[KinoSearch] compound word splitting
melser at gmx.ch
Thu Aug 24 12:30:57 PDT 2006
> It would be possible to write a compound splitter based on longest
> match. Start with a lexically sorted array of words that can be
> substrings. Proceed character by character through the token text,
> finding the longest possible match using binary search. If you find a
> match, lop it off the token text and start again from there. Index
> multiple tokens at one position by setting the position increment
> argument for TokenBatch::append() to 0. (This technique is also how you
> would implement a synonym analyzer.) It wouldn't be perfect, for the
> reasons discussed on that page, but it would be better than nothing.
This sounds great and to too complex but I have one big question: Is
this stuff I need todo all in the perl section of KinoSearch or does it
involve writing XS/C Code? Because if it does, I don't even have to
start because I don't know C.
And what is a lexical sorted array???
> I don't have time to write, test, or support something like this, but
> I'd be happy to continue discussing the design on list. My primary
> interest lies in providing the most elegant abstract framework possible
> within which such things can be implemented.
No problem, If I can implement it I will try, but I also have to do some
more googling about how to find binding-chars in compound words, as far
as I know there are only a few.
KinoSearch mailing list
KinoSearch at rectangular.com
More information about the kinosearch