[KinoSearch] KinoSearch index deployment/distrobution
Marvin Humphrey
marvin at rectangular.com
Sun Apr 16 11:20:59 PDT 2006
Hello Kevin,
In order to present a coherent reply, I've been working to make
KinoSearch's behavior in this area coherent. Version 0.09, released
Thursday, is the first version to support incremental indexing. With
all earlier versions, you could only index in a single shot, so
complex resource locking issues didn't come into play. The
infrastructure is about 75% there in 0.09, but it doesn't yet work as
planned.
> Specifically I'm wondering best practices for dealing with an
> infrastructure with multiple front-ends who would be querying the
> index, and potentially building up the index.
With 0.09, the best option is probably brute force.
Here's how things are supposed to work, and will work Real Soon Now(TM):
An invindex is made up "segments", each of which is an independent
inverted index. A small, central file, named "segments", contains
information about which segments are valid. Searchers (actually,
IndexReaders within Searchers) consult the "segments" file and open
up SegReaders for each valid segment. These are based on
filehandles, and the SegReader stays valid even if the files it's
accessing get unlinked. However, because they continue to read the
same files for as long as they exist, Searchers go out of date and
you need to reopen them once the index gets updated or they will
return outdated results.
The only way which an InvIndexer modifies an existing segment is by
rewriting the list of documents within that segment which are
deleted. The primary file (the big one with the .cfs extension),
once written, is never modified. When you add documents to an
InvIndexer, they are poured into a new segment. It's perfectly OK to
do this while the index is being accessed because the new segment is
not listed in the "segments" file, so Searchers can't see it.
When you call $invindexer->finish(), first, a bunch of sorting/
merging/writing stuff happens. One item of interest is the recycling
of existing segments. KinoSearch determines which segments are the
best candidates for absorption using an algorithm based on the
Fibonacci series, trying to strike a balance between minimizing the
number of segments which a Searcher must search against and the index-
time overhead of merging segments together. Segments so identified
are dumped into the new segment.
Then, near the end of $invindexer->finish(), the "segments" file and
the deletions files get rewritten. During this window, the
InvIndexer claims a commit lock on the invindex and no new Searchers/
IndexReaders may be opened. After the lock is released, any segments
which were consolidated into the new segment are unlinked.
> Is it possible to merge multiple indexes together?
At index-time: version 0.10 will add the add_invindexes() method to
InvIndexer, making it possible for one invindex to absorb others.
That feature is done in Subversion as of revision 851.
At search time: Not yet, and probably not soon. There are two kinds
of items on my TODO list: items which happen before the file format
is finalized and KinoSearch comes out of alpha, and items which
happen after that. Search-time multiplexing of results from multiple
invindexes is definitely after.
> or is it best to have just one box dedicated to doing the actual
> indexing and pushing the results out to the front-ends? or should
> the front-ends NFS mount the index and use it that way (I strongly
> suspect this would have locking issues based on some other things
> I've read).
The locking mechanism relies on lockfiles created then unlinked
within the system's temp directory. It doesn't use flock, so flock
portability isn't an issue. However, if you have multiple systems
trying to access a frequently-updated common index and they don't
agree about where the lockfiles are located, eventually something is
going to go haywire. I'd roll my own mutex for that situation.
> Is it possible to build an index while simultaneously searching? My
> current incarnation seems to indicate no, but maybe I've overlooked
> something. Perhaps I should build the index in a temp directory and
> then copy it over the latest index once it's completed generating,
> then force the searcher to re-open at this point.
That's what I recommend for 0.09.
Best,
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
More information about the kinosearch
mailing list