[KinoSearch] utf8 (unicode) any progress on TokenBatch?
melser at gmx.ch
Mon Aug 14 03:04:05 PDT 2006
I desperatly need utf8 support in Kinosearch. I already found your
discussion with someone else where you wrote that one would have to
write a special TokenBatch class, but I didn't exactly understand what
you mean by absorbing the utf8 flag from the last scalar.
Did you have time to do some work in this yet? Or would it be sufficient
to modify the analyzers and set the utf8 flag after the "gettext" method
from the TokenBatch?
It's especially a problem because words like "fröhlich" are split into
two search terms "fr" and "lich" which produces false matches. As we
don't have one language only I must use utf8 strings.
Thanks for any help.
PS: I'm doin speed test and have indexed parsed office files. Currently
I have indexed about 100'000 files and it's lightning fast. Was trying
plucene before and it could not even handle 10'000 files (index grew up
to hundreds of megs and searches were ultra slow).
With Kinosearch it takes about 0.005 secs to search through the 100'000
docs. Amazing work! If this utf8 Problem is solved, Kinosearch is just
KinoSearch mailing list
KinoSearch at rectangular.com
More information about the kinosearch