[KinoSearch] vectors + large indices

Marvin Humphrey marvin at rectangular.com
Sun May 3 16:19:39 PDT 2009


> > * Highlighter's internal workings have changed.  This process is  
> >   incomplete and the present implementation is buggy.
> 
> Which parts of it are buggy? 

Trimming of excerpts.

Take a look at the first result on this page:

    http://www.rectangular.com/cgi-bin/uscon_search.cgi?q=congress;offset=20

What happens is that we start with a window that's a little wider than the
requested excerpt length and try to edit it down so that it starts on sentence
boundaries whenever possible.  However, sometimes the word being highlighted
is at the edge of the window and gets cut off, resulting in an excerpt that's
seemingly irrelevant.

I think this happens for maybe 1 in every 10-20 excerpts.

I looked into trying to fix it a while back, but my initial assessment was
that there were some fundamental problems that needed to be addressed, so I
put it on the back burner.

Since that time, we had that discussion at
<https://issues.apache.org/jira/browse/LUCENE-1522>, which yielded a
high-level plan for a Lucene/Lucy highlighter which is reasonably close to the
KinoSearch implementation.  I expect to refactor according to that plan and
then troubleshoot after the transistion.

Marvin Humphrey





More information about the kinosearch mailing list