[KinoSearch] Subclassable Highlighter
Marvin Humphrey
marvin at rectangular.com
Fri Feb 1 16:47:21 PST 2008
On Jan 31, 2008, at 2:37 PM, Father Chrysostomos wrote:
>
> On Jan 30, 2008, at 1:54 PM, I wrote:
>
>> This patch makes the changes to Highlighter.pm’s public interface,
>> and modifies the test accordingly. It does not yet use
>> HighlightSpan and HeatMap
>
> This patch makes it use those two.
Another nice patch. :) Sorry it's taken me a bit to respond -- been
sick. Today I'm just narcoleptic instead of comatose, though, so
I've been able to complete the review.
> I saw in another message that you wanted the Scorer to provide the
> HighlightSpans.
Having pondered the subject a little longer, and having seen your
patch, I now lean towards Weight as the best place to put the
highlight_spans() method.
Scorer actually wouldn't work, at least not as Scorer is conceived
and implemented today. Scorers are not serializable, and they are
tied to a particular machine and a particular IndexReader. So this
would be problematic:
for my $server_name (@server_names) {
push @searchers, KinoSearch::Search::SearchClient->new(
peer_address => "$server_name:$port",
password => $pass,
schema => $schema,
);
}
my $multi_searcher = KinoSearch::Search::MultiSearcher->new(
searchables => \@searchers,
schema => $schema,
);
my $highlighter = KinoSearch::Highlight::Highlighter->new(
searcher => $multi_searcher,
query => $query,
field => 'content',
);
MultiSearcher doesn't _have_ an internal IndexReader. It can't be
used to make a Scorer.
However, MultiSearcher _does_ know enough to turn a Query into a
Weight. And Weight knows the IDF, plus all the other per-document-
collection information, so it can properly calculate the weight for
each HighlightSpan.
Using IDF, custom query boosts, and such will yield higher-quality
excerpts. The highlighter algo prior to this patch was all about
keyword density, which can be misleading. For instance, if you're
searching for 'the grifters', a passage with a high density for the
word "the" looks like gold to the Highlighter, which can't tell that
"grifters" is a more discriminatory term.
You've apparently grokked this already, as I see that your
implementations of highlight_spans() call make_weight() internally.
(Kudos on figuring that out, as the documentation is incomplete.)
+ if(@$posit_vec) {
+ $weight = $self->make_weight($searcher)->get_value;
+ }
Compiling a Query to a Weight, though, is a little expensive to be
doing each for each document. I think the better solution is to have
the Highlighter compile the Query to a Weight once and cache it as a
member var, then have the cached Weight do the work.
We'll need to make some more APIs public in order for you to access
these capabilities in your custom Highlighter subclass.
* Weight
* Query::make_weight.
* Weight::highlight_spans.
Existing subclasses of Weight like TermWeight will stay private.
We'll probably need other things, like Searchable::doc_freq() In
theory, we need Searchable::create_weight() too, but I'd like to
refactor that away if possible. (I tried once before; I can't
remember why it didn't work out.)
We also need Weight and make_weight to be public in order to support
WildCardQuery, so this work has other applications.
> I’m not sure exactly how you want to accomplish that, but this is a
> start. It uses the Query for now, but at least it eliminates
> _starts_and_ends and _calc_best_location from Highlighter.pm.
It's very nice work, and I'm pleased to have committed it as r2982,
with one typo fixed (s/highlight_data/highlight_spans/ in Query.pm).
Thanks!
> I did run into a problem with multiple HighlightSpans with the same
> start offset. It makes the ( 1 / ( 1 + log($diff) ) ) formula in
> HeatMap.pm blow up (log 0). So I’ve added code that eliminates
> duplicates, adding the weights together. I’m not sure if this is
> how it should be done.
I'll cover this topic in a separate post.
PS: I saw this comment in HeatMap.pm:
# XXX: This calls the same methods over and over, as does the block
# below. Is there any way to speed this up?
my @orig_posits = sort {
$a->get_start_offset <=> $b->get_start_offset ||
$b->get_end_offset <=> $a->get_end_offset
} @$spans;
If that section turns out to be a bottleneck, it's trivial to port it
to C, where it will be lightning fast.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
More information about the kinosearch
mailing list