[KinoSearch] Merging indexes, etc
Marvin Humphrey
marvin at rectangular.com
Fri Oct 20 07:07:50 PDT 2006
On Oct 19, 2006, at 3:39 AM, henka at cityweb.co.za wrote:
> OK, so the danger of clobbering a working master index is almost zero.
That's the design. Doug Cutting gets most of the credit. Or whoever
he learned it from does. :)
However, KinoSearch improves on Lucene's model in that Lucene, while
indexing, makes multiple commits at times which the user cannot track
or control -- so it's more difficult to recover from a crashed
indexing session. There isn't the equivalent of KinoSearch's single
commit.
> Backing up the amount of data involved is, well, becoming difficult...
Do you have a plan in place in case the unlikely occurs and an index
does become corrupted? Obviously, this isn't supposed to happen, and
bug reports regarding busted KS indexes have been very rare. But I'm
human, and so are computers.
> Here's the error btw:
>
> Can't locate object method "_release_locks" via package
> "self" (perhaps
> you forgot to load "self"?) at
> /usr/lib/perl5/site_perl/5.8.7/i486-linux/KinoSearch/InvIndexer.pm
> line
> 273.
Looks like the DESTROY method is misbehaving. Maybe this will help?
-sub DESTROY { shift->_release_locks }
+sub DESTROY {
+ my $self = shift;
+ $self->_release_locks;
+}
I suspect what's happening is that the object is being reclaimed
before _release_locks() gets dispatched, because of the funny way I
used shift(). By making a local copy of $self, the object's refcount
should be increased and it should stick around until the end of the
block.
> Search performance is great, btw. Normal keyword searching still
> gives
> sub-second search times, with phrase searching pushing it just over
> 1s...
> and this on a machine (granted, a dual opteron with dual cores) bogged
> down with crawlers and indexers running simultaneously.
Excellent.
> Well done Marvin. This is friggin good stuff. I must confess to a
> certain level of anxiety building up to my first test search using
> KS with
> a decent size index, but your excellent work surprised my pants off.
> There appears to be almost no performance penalty (well, small)
> between
> searching an index of a few MB and 36GB.
There is room for improvement yet. KinoSearch does not currently
implement Lucene's skipTo() function.
skipTo() is called by a Scorer against a TermDocs object. Say you
have a phrase search for '"the mermen"'. 'the' is quite common, while
'mermen' should be quite rare, and let's assume for the purposes of
argument that we're not stopalizing 'the'. What KinoSearch currently
does is something like this:
while (defined ( my $mermen_doc = $mermen_term_docs->next ) )
while ( defined ( my $the_doc = $the_term_docs->next ) ) {
last if $the_doc >= $mermen_doc;
}
if ( $the_doc == $mermen_doc ) {
my $score = $scorer->calc_score($the_doc);
$hit_collector->collect($the_doc, $score);
}
}
skipTo() replaces next().
$mermen_doc = $mermen_term_docs->next;
$the_doc = $the_term_docs->next;
while ( defined ( $mermen_doc = $mermen_term_docs->skip_to
($the_doc) ) ) {
while ( defined ( $the_doc = $the_term_docs->skip_to
($mermen_doc) ) ) {
# ...
The idea is, if the first doc with 'mermen' in it is number 20000,
and 'the' appears in 15000 of the docs between 0 and 20000, don't
iterate over them one by one checking to see if they match -- just
skip to 20000 (or the next number above that) as quickly as possible.
Lucene's implementation of termDocs.skipTo() still basically requires
that the entire file containing the document numbers be read
linearly, so disk i/o is not lessened. However, it cuts down on CPU
consumption by deciphering maybe only 10% of the file instead of
100%. I've toyed with whether I can improve this scheme by reworking
the index structure to use a fixed block size, but it doesn't look
like I can do much without making things a lot more complex at index
time. So at this point, it's just a matter of porting skipTo().
Adding skipTo() to SegTermDocs and MultiTermDocs is easy, and should
yield some improvement of speed on phrase queries right away.
However, more significant benefits will accrue when I port Lucene's
BooleanScorer2 and its dependencies. That's more work.
> I imagine (with a tremor in my
> voice) that searching 1-2TB (that's T for terra people) will also
> scale
> well.
As you scale up in index size, search-time costs should scale up
pretty more or less linearly. The bottleneck at search-time is the
time to process oodles of docs which match common terms. As the
number of docs increases, so do CPU and disk i/o requirements.
At some point, index size becomes too great for any one machine to
handle gracefully. What needs to happen then is for documents to be
distributed around several machines on several indexes -- so you no
longer have one monolithic index. Each machine then searches against
its smaller index, the results are pooled, and there's something like
a runoff election to determine which documents get returned.
KinoSearch does not yet have the infrastructure to support this, but
the design is out there and just needs to be implemented.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
More information about the kinosearch
mailing list