marvin at rectangular.com
Fri May 15 12:12:19 PDT 2009
On Thu, May 14, 2009 at 08:42:32AM -0700, webmasters at ctosonline.org wrote:
> I’ve just discovered the hard way that Scorer->next has to return doc
> ids in order. Here is a patch to mention it in the cookbook:
Thanks, applied as r4594.
> Is it this way to allow KS to tiptoe over deleted documents efficiently?
It's all about scalability.
When scoring, all doc id sets are represented as iterators. To get the
intersection of iterators, they need to proceed in an orderly, predictable
Say you want the top hits for "foo AND bar". If you weren't worried about
scalability, you could cache separate result sets as arrays -- one for 'foo'
and one for 'bar' -- then intersect them, sort, and grab the top few results.
However, that doesn't scale up to millions of hits because of the size of
those result sets.
So instead, the iterators for 'foo' and 'bar' ascend through doc nums in sync,
hits are collected one at a time into a priority queue which only holds as
many hits as absolutely necessary, and less relevant hits fall out the bottom
of the queue, keeping memory costs under control.
A better implementation of PrefixQuery would actually use a priority queue to
hold the PostingLists rather than grab all the results at construction time.
But then the cookbook entry would have to be a lot longer and more elaborate.
More information about the kinosearch