[KinoSearch] utf8 (unicode) any progress on TokenBatch?

Marc Elser melser at gmx.ch
Mon Aug 14 03:04:05 PDT 2006



Hi Marvin,

I desperatly need utf8 support in Kinosearch. I already found your
discussion with someone else where you wrote that one would have to
write a special TokenBatch class, but I didn't exactly understand what
you mean by absorbing the utf8 flag from the last scalar.

Did you have time to do some work in this yet? Or would it be sufficient 
to modify the analyzers and set the utf8 flag after the "gettext" method 
from the TokenBatch?

It's especially a problem because words like "fröhlich" are split into 
two search terms "fr" and "lich" which produces false matches. As we 
don't have one language only I must use utf8 strings.

Thanks for any help.

PS: I'm doin speed test and have indexed parsed office files. Currently
I have indexed about 100'000 files and it's lightning fast. Was trying
plucene before and it could not even handle 10'000 files (index grew up
to hundreds of megs and searches were ultra slow).

With Kinosearch it takes about 0.005 secs to search through the 100'000
docs. Amazing work! If this utf8 Problem is solved, Kinosearch is just
perfect.





_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch




More information about the kinosearch mailing list