[KinoSearch] multilanguage indexing and search
Hugues de Mazancourt
hugues at mazancourt.net
Sun Dec 3 09:24:51 PST 2006
Le 2 déc. 06 à 22:23, Alex Aver a écrit :
> 2006/12/1, Marvin Humphrey <marvin at rectangular.com>:
>> On Dec 1, 2006, at 8:09 AM, Alex Aver wrote:
> Why I can't use simple $word_char_tokenizer for this set of languages?
> Universal stemmer for mixed texts it's problem. I can separate words
> in latin & cyrillic characters and use special stemmer for Russian
> words. But how can I separate English & French?
You don't necessarily need. 80% of the job an English stemmer does is
to remove "s"/"es" at the end of a word, wich works also fine for
French. The other rules won't hurt (such as s/ed$//) because they
don't match French words.
You can also add some French rules in your stemmer, such as s/aux$/
al/, wich won't have any effect on English words.
In fact, the most important thing is that you use the *same* stemmer
for indexing and querying. Whatever stemming it performs.
>> Tokenizing Japanese is really, really hard
>> anyway, and KinoSearch provides no native support for it.
> Yes, tokenizing Japanese is hard, but possible - afair dpsearch &
> mnogosearch can do index and search in Japanese. But it isn't critical
> point at this moment ;)
MnogosSearch uses ChaSen, a free japanese parser that has a Perl
front-end. See http://rpmfind.net/linux/RPM/suse/9.3/i386/suse/i586/
More generally, there are some pointers on analyzing Japanese here :
KinoSearch mailing list
KinoSearch at rectangular.com
More information about the kinosearch