[KinoSearch] more fun with kinosearch
marvin at rectangular.com
Mon Jun 19 16:55:25 PDT 2006
More good stuff, Eric!
On Jun 18, 2006, at 4:22 PM, Eric Lease Morgan wrote:
> My medium-term goal is to demonstrate how a new & improved library
> catalog could function.
> To accomplish these goals I have begun experimenting more with
> KinoSearch and content from Project Gutenberg. First I downloaded
> the RDF representation of Project Gutenberg content -- all 44 MB of
> it. I then parsed the RDF and cached what I needed to a (MyLibrary)
> database. I then looped through each Project Gutenberg record --
> all 24,000 of them -- downloaded the full text of each item, and
> fed the whole thing to KinoSearch.
Out of curiosity, do you know the total amount of disk space a
decompressed Project Gutenberg takes up?
> 3. KinoSearch requires a lot of extra disk space in order to
> optimize. I scaled back my experiment a few times, and the last
> time I squeaked by with less than a MB to spare. When optimization
> was complete I given back more than 10 GB of disk space. BTW, my
> index is 7 GB in size.
The disk space eaten during the indexing process is not something I
have concerned myself with overmuch. It is probably possible to dial
it back some.
[Aside to Dave Balmain: this would be achieved by having the posting
serialization use compressed integers for position and offset data.]
> 5. Freetext searches take a long time to execute, longer than most
> people will be willing to wait. At the same time, my hardware is
> not very big. It is considered to be a hand-me-down.
Throwing hardware at this problem will probably help a lot.
Another thing you will definitely want to do with any large
deployment is run under mod_perl or FastCGI, so you can reuse a
persistent Searcher object. That way you aren't reloading the
Searcher's caches with each query.
Your major bottleneck happens elsewhere, though...
> 6. Indexed fields (title, creator, subject, etc.), whose content
> is much smaller than the free text field, respond *much* quicker.
Resource consumption at search-time is dominated by the time spent
pawing through common terms. I'd be curious to know whether it's
disk access or CPU that's the bottleneck right now.
If you're pressed for resources, you can opt to insert a Stopalizer
into the PolyAnalyzer chain. That's not the default for a variety of
reasons. (For starters, your search for 'to be or not to be' won't
match anything because those are all stop words.)
> 8. Precision-recall is not the greatest because there is too much
> noise in the full text. For example, searches for "north carolina"
> return too much irrelevant stuff.
I'd like to make it possible to improve this a bit. At present, the
Similarity class is not exposed. For many purposes, its absence
isn't important, but if you have a really big document collection,
the drive to learn the math behind search scoring, and the time to
experiment, it's possible to make some gains.
The main reason Similarity is not exposed is that it has to specify C
callbacks which are invoked for every document which matches the
query (Perl callbacks would be unacceptably slow). It's a bit
awkward to present an interface on a Perl module requiring the user
to supply a C pointer-to-function. :)
However, a recent contribution to Lucene (Chris Hostetter's
SweetSpotSimilarity) has pointed the way to exposing this
functionality, and I look forward to incorporating it. The idea is
that Similarity will still be a C struct wrapped in a perl object,
and it will still use C callbacks, which you won't be able to
override... but these functions will use variables that you can
affect. IOW, the contours of the curve will be pre-set, but you'll
be able to tweak the slopes.
For now, you might try a little Easter Egg I have hidden in the
# WARNING UNSUPPORTED HACK
my $title_similarity = KinoSearch::Search::TitleSimilarity->new;
name => 'title',
similarity => $title_similarity,
This is an index-time change, so you'd need to re-index.
Be aware, this stuff can take a little effort to grok. The defaults
work well, so we have the common cases covered. Hopefully we can
work up an API that makes it possible to tune results some without
understanding *all* the theory.
> suggesting alternative queries through the use of dictionaries and
You may be interested to know that the way Google, MSN, etc. do "did
you mean" is by tracking over time what people type after they enter
something else. If 75% of people type "librarian" after "libarian",
then anybody who enters "libarian" in the future is going to see a
> Wish me luck.
KinoSearch mailing list
KinoSearch at rectangular.com
More information about the kinosearch