[KinoSearch] compound word splitting

Marvin Humphrey marvin at rectangular.com
Thu Aug 24 11:39:16 PDT 2006




On Aug 24, 2006, at 12:37 AM, Marc Elser wrote:
> I'was googling more for compound word splitting and maybe there's a  
> solution which could work for KinoSearch too.
>
> There's a program called TSearch V2 (http://www.sai.msu.su/~megera/ 
> postgres/gist/tsearch/V2/) which is a PostgreSQL extension which  
> enhances PostGres by adding an inverted fulltext search indexes and  
> adds new functions to PostgreSQL. One function is 'lexize' which  
> you must pass the encoding and a word which returns the compound  
> words if you have a dictionary which is tagged for compound words,  
> but there are some dictionaries for swedish, german and other  
> languages although I don't know if the other dictionaries are  
> tagged too.

I had a look, but I bounced off of the TSearch source code, which is  
sparsely commented and uses some PostgreSQL stuff I'm unfamiliar  
with.  I'm not sure which files to look at, and the files I did look  
at I didn't grok.

I tried googling this subject myself, and I came up with this page:

http://www.glue.umd.edu/~oard/courses/708a/fall01/838/P2/

It would be possible to write a compound splitter based on longest  
match.  Start with a lexically sorted array of words that can be  
substrings.  Proceed character by character through the token text,  
finding the longest possible match using binary search.  If you find  
a match, lop it off the token text and start again from there.  Index  
multiple tokens at one position by setting the position increment  
argument for TokenBatch::append() to 0.  (This technique is also how  
you would implement a synonym analyzer.)  It wouldn't be perfect, for  
the reasons discussed on that page, but it would be better than nothing.

I don't have time to write, test, or support something like this, but  
I'd be happy to continue discussing the design on list.  My primary  
interest lies in providing the most elegant abstract framework  
possible within which such things can be implemented.


Marvin Humphrey

--
I'm looking for a part time job.




_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch




More information about the kinosearch mailing list