[KinoSearch] Merging indexes, etc

Marvin Humphrey marvin at rectangular.com
Fri Oct 20 07:07:50 PDT 2006




On Oct 19, 2006, at 3:39 AM, henka at cityweb.co.za wrote:

> OK, so the danger of clobbering a working master index is almost zero.

That's the design.  Doug Cutting gets most of the credit.  Or whoever  
he learned it from does. :)

However, KinoSearch improves on Lucene's model in that Lucene, while  
indexing, makes multiple commits at times which the user cannot track  
or control -- so it's more difficult to recover from a crashed  
indexing session.  There isn't the equivalent of KinoSearch's single  
commit.

> Backing up the amount of data involved is, well, becoming difficult...

Do you have a plan in place in case the unlikely occurs and an index  
does become corrupted?  Obviously, this isn't supposed to happen, and  
bug reports regarding busted KS indexes have been very rare.  But I'm  
human, and so are computers.

> Here's the error btw:
>
> Can't locate object method "_release_locks" via package  
> "self" (perhaps
> you forgot to load "self"?) at
> /usr/lib/perl5/site_perl/5.8.7/i486-linux/KinoSearch/InvIndexer.pm  
> line
> 273.

Looks like the DESTROY method is misbehaving.  Maybe this will help?

-sub DESTROY { shift->_release_locks }
+sub DESTROY {
+    my $self = shift;
+    $self->_release_locks;
+}

I suspect what's happening is that the object is being reclaimed  
before _release_locks() gets dispatched, because of the funny way I  
used shift().  By making a local copy of $self, the object's refcount  
should be increased and it should stick around until the end of the  
block.

> Search performance is great, btw.  Normal keyword searching still  
> gives
> sub-second search times, with phrase searching pushing it just over  
> 1s...
> and this on a machine (granted, a dual opteron with dual cores) bogged
> down with crawlers and indexers running simultaneously.

Excellent.

> Well done Marvin.  This is friggin good stuff.  I must confess to a
> certain level of anxiety building up to my first test search using  
> KS with
> a decent size index, but your excellent work surprised my pants off.
> There appears to be almost no performance penalty (well, small)  
> between
> searching an index of a few MB and 36GB.

There is room for improvement yet.  KinoSearch does not currently  
implement Lucene's skipTo() function.

skipTo() is called by a Scorer against a TermDocs object.  Say you  
have a phrase search for '"the mermen"'. 'the' is quite common, while  
'mermen' should be quite rare, and let's assume for the purposes of  
argument that we're not stopalizing 'the'. What KinoSearch currently  
does is something like this:

    while (defined ( my $mermen_doc = $mermen_term_docs->next ) )
       while ( defined ( my $the_doc = $the_term_docs->next ) ) {
           last if $the_doc >= $mermen_doc;
       }
       if ( $the_doc == $mermen_doc ) {
           my $score = $scorer->calc_score($the_doc);
           $hit_collector->collect($the_doc, $score);
       }
    }

skipTo() replaces next().

    $mermen_doc = $mermen_term_docs->next;
    $the_doc    = $the_term_docs->next;
    while ( defined ( $mermen_doc = $mermen_term_docs->skip_to 
($the_doc) ) ) {
       while ( defined ( $the_doc = $the_term_docs->skip_to 
($mermen_doc) ) ) {
           # ...

The idea is, if the first doc with 'mermen' in it is number 20000,  
and 'the' appears in 15000 of the docs between 0 and 20000, don't  
iterate over them one by one checking to see if they match -- just  
skip to 20000 (or the next number above that) as quickly as possible.

Lucene's implementation of termDocs.skipTo() still basically requires  
that the entire file containing the document numbers be read  
linearly, so disk i/o is not lessened.  However, it cuts down on CPU  
consumption by deciphering maybe only 10% of the file instead of  
100%.  I've toyed with whether I can improve this scheme by reworking  
the index structure to use a fixed block size, but it doesn't look  
like I can do much without making things a lot more complex at index  
time.  So at this point, it's just a matter of porting skipTo().

Adding skipTo() to SegTermDocs and MultiTermDocs is easy, and should  
yield some improvement of speed on phrase queries right away.   
However, more significant benefits will accrue when I port Lucene's  
BooleanScorer2 and its dependencies.  That's more work.

> I imagine (with a tremor in my
> voice) that searching 1-2TB (that's T for terra people) will also  
> scale
> well.

As you scale up in index size, search-time costs should scale up  
pretty more or less linearly.  The bottleneck at search-time is the  
time to process oodles of docs which match common terms.  As the  
number of docs increases, so do CPU and disk i/o requirements.

At some point, index size becomes too great for any one machine to  
handle gracefully.   What needs to happen then is for documents to be  
distributed around several machines on several indexes -- so you no  
longer have one monolithic index.  Each machine then searches against  
its smaller index, the results are pooled, and there's something like  
a runoff election to determine which documents get returned.

KinoSearch does not yet have the infrastructure to support this, but  
the design is out there and just needs to be implemented.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch




More information about the kinosearch mailing list