[KinoSearch] compound word splitting
marvin at rectangular.com
Thu Aug 24 11:39:16 PDT 2006
On Aug 24, 2006, at 12:37 AM, Marc Elser wrote:
> I'was googling more for compound word splitting and maybe there's a
> solution which could work for KinoSearch too.
> There's a program called TSearch V2 (http://www.sai.msu.su/~megera/
> postgres/gist/tsearch/V2/) which is a PostgreSQL extension which
> enhances PostGres by adding an inverted fulltext search indexes and
> adds new functions to PostgreSQL. One function is 'lexize' which
> you must pass the encoding and a word which returns the compound
> words if you have a dictionary which is tagged for compound words,
> but there are some dictionaries for swedish, german and other
> languages although I don't know if the other dictionaries are
> tagged too.
I had a look, but I bounced off of the TSearch source code, which is
sparsely commented and uses some PostgreSQL stuff I'm unfamiliar
with. I'm not sure which files to look at, and the files I did look
at I didn't grok.
I tried googling this subject myself, and I came up with this page:
It would be possible to write a compound splitter based on longest
match. Start with a lexically sorted array of words that can be
substrings. Proceed character by character through the token text,
finding the longest possible match using binary search. If you find
a match, lop it off the token text and start again from there. Index
multiple tokens at one position by setting the position increment
argument for TokenBatch::append() to 0. (This technique is also how
you would implement a synonym analyzer.) It wouldn't be perfect, for
the reasons discussed on that page, but it would be better than nothing.
I don't have time to write, test, or support something like this, but
I'd be happy to continue discussing the design on list. My primary
interest lies in providing the most elegant abstract framework
possible within which such things can be implemented.
I'm looking for a part time job.
KinoSearch mailing list
KinoSearch at rectangular.com
More information about the kinosearch