[KinoSearch] compound word splitting

Marc Elser melser at gmx.ch
Thu Aug 24 12:30:57 PDT 2006



Hi Marvin,

> It would be possible to write a compound splitter based on longest 
> match.  Start with a lexically sorted array of words that can be 
> substrings.  Proceed character by character through the token text, 
> finding the longest possible match using binary search.  If you find a 
> match, lop it off the token text and start again from there.  Index 
> multiple tokens at one position by setting the position increment 
> argument for TokenBatch::append() to 0.  (This technique is also how you 
> would implement a synonym analyzer.)  It wouldn't be perfect, for the 
> reasons discussed on that page, but it would be better than nothing.
This sounds great and to too complex but I have one big question: Is 
this stuff I need todo all in the perl section of KinoSearch or does it 
involve writing XS/C Code? Because if it does, I don't even have to 
start because I don't know C.

And what is a lexical sorted array???
> 
> I don't have time to write, test, or support something like this, but 
> I'd be happy to continue discussing the design on list.  My primary 
> interest lies in providing the most elegant abstract framework possible 
> within which such things can be implemented.
No problem, If I can implement it I will try, but I also have to do some 
more googling about how to find binding-chars in compound words, as far 
as I know there are only a few.

Best regards,

Marc



_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch




More information about the kinosearch mailing list