[KinoSearch] Highlighter Bug

Marvin Humphrey marvin at rectangular.com
Wed Nov 17 13:13:11 PST 2010


(cc'ing lucy-dev)

On Wed, Nov 17, 2010 at 01:26:51PM +0100, Nick Wellnhofer wrote:
> On 17/11/2010 00:52, Marvin Humphrey wrote:
> >On Tue, Nov 16, 2010 at 05:15:42PM +0100, Nick Wellnhofer wrote:
> >>
> >>I found a bug in the highlighter in KinoSearch 0.30_121 and 0.31. For
> >>some searches the excerpt doesn't contain any of the search terms, but
> >>only a sentence following the one that looks like it should have been
> >>picked.
> >>
> >>I had a quick look at the code in
> >>core/KinoSearch/Highlight/Highlighter.c and the test "candidate>= top"
> >>in Highlighter_raw_excerpt looks fishy to me. AFAICS this might cause
> >>sentences to be skipped.
> >
> >You're probably right that that's where we're skipping ahead, but it's not
> >clear what aspect of the input is causing Raw_Excerpt() to malfunction.
> >The check inside that block is supposed to verify that the chosen excerpt
> >contains at least some relevant material -- that's what the S_has_heat()
> >call is there for.
> 
> I had a closer look and the error case is that you have three sentences 
> where only the first and the last contain keywords but the middle one is 
> chosen for the excerpt.
> 
> >This will be hard to debug without sample material.  Once we have a
> >document which triggers the bad behavior, we can start throwing in
> >debugging printf's.  If you can supply me with sample code which has the
> >problem, I'll have a good hard look.
> 
> See the attached test case.
> 
> Nick

Thank you for the test case!  It illustrates the problem well and I have
started investigating.

Since Lucy has inherited KinoSearch's highlighter code, the problem also
exists in Lucy and I have opened an issue there.

    https://issues.apache.org/jira/browse/LUCY-126

Can you please create a JIRA account for yourself and attach your test case to
that issue as a contribution to the Apache Software Foundation, so that we can
use it for Lucy as well?  Just as is is fine, no need to adapt it for Lucy
specifically.

(Nick is already familiar with the intellectual property issues involved, but
for the rest of the KS list, here's an explanation:  When a contribution
arrives on the KinoSearch mailing list, we may assume that it is intended
for use in the KinoSearch project under KinoSearch's GPL/Artistic license.
This license is more restrictive than Lucy's Apache License 2.0; we can't
use such contributions for Lucy without an additional step granting an
additional license for the code to the ASF.)

Marvin Humphrey

> #! perl
> use strict;
> 
> use KinoSearch;
> 
> my $schema = KinoSearch::Schema->new;
> 
> my $polyanalyzer = KinoSearch::Analysis::PolyAnalyzer->new( 
>     language => 'de',
> );
> 
> $schema->spec_field( 
>     name => 'url',
>     type => KinoSearch::FieldType::StringType->new(),
> );
> $schema->spec_field( 
>     name  => 'title',
>     type => KinoSearch::FieldType::FullTextType->new(
>         analyzer      => $polyanalyzer,
>         boost         => 3,
>         highlightable => 1,
>     ),
> );
> $schema->spec_field(
>     name => 'bodytext',
>     type => KinoSearch::FieldType::FullTextType->new(
>         analyzer      => $polyanalyzer,
>         highlightable => 1,
>     ),
> );
> 
> my $indexer = KinoSearch::Indexer->new(
>     schema   => $schema,
>     index    => 'debug_index',
>     create   => 1,
>     truncate => 1,
> );
> 
> $indexer->add_doc(
>     doc => {
>         url      => 'urn:test',
>         title    => 'Test',
>         bodytext => <<'EOF',
> bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla.
> bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla.
> bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla.
> bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla.
> bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla.
> bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla NNN bla.
> bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla.
> bla bla bla MMM bla bla bla bla bla bla bla bla bla bla bla bla.
> bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla.
> bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla.
> bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla.
> bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla.
> bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla.
> EOF
>     },
> );
> 
> $indexer->commit();
> 
> my $searcher = KinoSearch::Searcher->new(
>     index => 'debug_index'
> );
> my $query = 'nnn mmm';
> my $bodytext_hl = KinoSearch::Highlight::Highlighter->new(
>     searcher => $searcher,
>     query    => $query,
>     field    => 'bodytext'
> );
> my $hits = $searcher->hits(
>     query      => $query,
>     offset     => 0,
>     num_wanted => 10,
> );
> 
> while(my $hit = $hits->next()) {
>     print($bodytext_hl->create_excerpt($hit), "\n");
> }
> 

> _______________________________________________
> kinosearch mailing list
> kinosearch at rectangular.com
> http://rectangular.com/cgi-bin/mailman/listinfo/kinosearch




More information about the kinosearch mailing list