[KinoSearch] Merging indexes, etc

henka at cityweb.co.za henka at cityweb.co.za
Thu Oct 19 03:39:54 PDT 2006




Thanks for the detailed response, Marvin.

>> 1.  Would the master index survive an interrupted
>> $iv->add_invindexes($a,$b,...), and be receptive to further
>> add_invindexes()s without a problem?  If not, how can it be resolved?
>
> There's one critical split-second during finish(), when
> "segments.new" gets renamed to "segments".  Up until that point, no
> changes are permanent.  The new files are there, but they won't be
> used.  If the InvIndexer ceases operations at any point before the
> renaming of "segments.new", then the next time an InvIndexer is
> created against that invindex location, it will overwrite the unused
> segment files.

OK, so the danger of clobbering a working master index is almost zero. 
Backing up the amount of data involved is, well, becoming difficult...

>> 2.  The docs aren't specific:  I presume one can
>> $iv->delete_docs_by_term($term) on the same $iv as the one being
>> operated
>> on with add_invindexes()?  I've encountered no errors so far, but was
>> wondering.
>
> Hmm, so you have something like this?


## foreach tempindex {

>      my $invindexer = KinoSearch::InvIndexer->new(
>          invindex => $invindex,
>          analyzer => $analyzer,
>      );
>      $invindexer->delete_docs_by_term($term);
>      $invindexer->add_invindexes( $another_invindex,
> $yet_another_invindex );
>      $invindexer->finish;

## }

Exactly (but with loop pseudo-code added).  Here's the error btw:

Can't locate object method "_release_locks" via package "self" (perhaps
you forgot to load "self"?) at
/usr/lib/perl5/site_perl/5.8.7/i486-linux/KinoSearch/InvIndexer.pm line
273.

For now, I'm just not calling $iv->finish if nothing was added (but
presents a problem if delete_docs_by_term($term) is called).

> I didn't consider that use case, so there's no test written for it,
> but I think it ought to work.

My tests later today will reveal whether the delete_docs_by_term() calls
are working.  My initial test run on a small subset of the data (creating
a 36GB index) did not delete existing docs, so I will let you know.

Search performance is great, btw.  Normal keyword searching still gives
sub-second search times, with phrase searching pushing it just over 1s...
and this on a machine (granted, a dual opteron with dual cores) bogged
down with crawlers and indexers running simultaneously.  Searching will
eventually be done on multiple dedicated machines, so search times will
decrease even further.

Well done Marvin.  This is friggin good stuff.  I must confess to a
certain level of anxiety building up to my first test search using KS with
a decent size index, but your excellent work surprised my pants off. 
There appears to be almost no performance penalty (well, small) between
searching an index of a few MB and 36GB.  I imagine (with a tremor in my
voice) that searching 1-2TB (that's T for terra people) will also scale
well.

> Eventually, the merge logic is going to change some and the
> restriction against performing add_doc and add_invindexes on the same
> InvIndexer object will be lifted.  The reason that restriction exists
> is that merging of indexes/segments which may have different field
> defs is complex.  However, Dave Balmain has come up with a design
> which solves that problem and I'm going to implement it.

The current restriction on add_doc and add_invindexes is OK in our case -
the merging is seperate from indexing propper.

>> 3.  During a merge operation of many temp indexes into a master
>> index, if
>> no call is made to add_invindexes() before a finish() (maby because
>> of an
>> empty/invalid temp index, etc), it generates an error (sorry, busy
>> with a
>> run at the moment, so will paste sample error later).  I've recoded
>> the
>> logic to side-step the error (ie, don't finish() if nothing is
>> added), but
>> I wonder if this might have any repercussions (ie, calling "my $iv =
>> KinoSearch::InvIndexer->new(...)" on the same index without calling
>> finish() in a loop).
>
> There's some stuff in there to make finish() a no-op if nothing's
> being changed.  Sounds like that's failing, but I don't understand
> why.  what's being called on $iv in between the last spec_field() and
> finish() ?

Nothing:  the logic is as you outlined above:

loop for all temp indexes.
    ::InvIndexer->new;
    delete existing docs by $term;
    add temp index to main index if temp index OK;
    ->finish;      #  fails here if previous add not performed.
                      #  finish *must* be called because we have
                      #  delete_docs_by_term() above...
end loop;

Simple as that.  If this cannot be resolved, I might have to have two
seperate loops:  one to delete_docs_by_term for all temp indexes, then the
other to merge 'em with master.

>
>> 4.  What's a good test to detect a bad/invalid/broken temp index?
>> At the
>> moment I just check if the "segments" file exists and is non-zero.
>
> If an indexing session is interrupted before finish completes, the
> segments file will exist, and it will have a length -- however when
> KS reads it, KS will see that the invindex doesn't have any
> segments.  That is, or ought to be, a valid state (I don't think I
> have a test written guaranteeing that it will be).
>
>> However, the segments file *will* exist if a temp index run is
>> interrupted
>> - what other files *shouldn't* exist (and indicate a temp index
>> which is
>> half-baked) so that I can refine the temp index validation?
>
> The problem is that you can't tell whether or not an indexing session
> was interrupted by the file contents of the invindex.  I'd suggest
> adding failsafe logic to the app that creates your sub-index which
> tells you whether or not the indexing session completed.
>
>     $invindexer->finish;
>     session_succeeded();
>
> If session_succeeded() doesn't fire, assume that you have a broken
> sub-index and need to repeat whatever actions it took to build it.

Good suggestion.  This approach is simpler, methinks.




_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch




More information about the kinosearch mailing list