[KinoSearch] opening up the scorers

Nathan Kurz nate at verse.com
Sat Apr 19 12:10:45 PDT 2008



On Fri, Apr 18, 2008 at 7:50 PM, Marvin Humphrey <marvin at rectangular.com> wrote:
> > My desire for simplicity makes me wonder if
> > one could just have a single 'QueryNode' class that instantiates a
> > customizeable Scorer.
> >
>
>  I don't quite follow.

Instead of building a tree of different classes of Query, it seems
simpler to me to build a tree out of nodes of the same type and move
the class to a field:

QueryNode:
    scorer: "KSx::MyScorer"
    children:  [Child1, Child2]

Probably just my quirk, though. I've never been liked subclassing for
the sake of avoiding function pointers (or their Perlish equivalents).
  I'd also want a better name than QueryNode. ;)  So long as this tree
can be easily parsed and 'optimized', I guess I don't have a problem
with the current approach, though.

>  You mean how would you persuade QueryParser to use your ORQuery variant
> rather than the default?

Yes, I'm wondering how to get a variant to actually be used.  As it
is, the the official way seems to be to rewrite QueryParser to use my
own classes, but this seems onerous.  Or one could post-process the
Query tree and swap in the custom class.   Alternatively, one could
take the approach I did before I bogged down, and conclude that it's
simpler to skip the indirection and build the Scorer tree directly.

>  Probably we'd need to give QueryParser some sort
> of make_orquery() factory method you could override.
>  I'm not sure I want that to happen right away in core, though.
> QueryParser-type classes are sadly prone to death by Featuritis.  This is
> the kind of thing I'd rather see refined via KSx.

Definitely the custom scorer should go in KSx (or in some other
userspace) but there needs to be some way to use this class without
writing a lot of other infrastructure.  Either QueryParser needs to be
more easily subclassable, or needs to have customizable types (skip
factories, all we need is a class name string), or there needs to be
hook to post-process the Query tree (s/// for trees a la XSLT).

> QueryParser doesn't parse 'NOT
> brobniquitz' down to a NOTQuery because it's standard behavior for search
> engines to parse that kind of thing as a void query with no result set
> rather than return the universe.

I strongly think you want to 'return the universe' here.  If you
design the system so it doesn't choke on large result sets, it will be
truly industrial strength and multi-purpose.   Instead of thinking
about this as a search engine (with standard search engine
constraints) think of KinoSearch as a general purpose database with
some really cool retrieval functions.   Make it strong and fearless!

> > > ANDORQuery is the odd one out, because it doesn't really mean 'a AND/OR
> b'.
>
> > Ditto.  Why not just layer an AND and an OR?
> >
>
>  I don't think that's quite the same thing??

I was shooting from the hip, but I think 'A AND (A OR B)' would
produce the same results once normalized.  Given the way caching
works, this probably isn't actually that expensive, but I can
certainly see why it isn't perfect.  Alternatively, one could allow
OrScorer a non-zero no-match score, or come up with an
'OptionalTermScorer'.

But you are probably right:  while I like the building block
simplicity of these approaches, it's not that bad to have a custom
Scorer for this situation.  Although "Term AND OptionalTerm" is pretty
clear too.    If you do go with a RequiredAndOptionalScorer, though,
I'd request that it be able to handle arbitrary subqueries under the
Required half, rather than just straight Terms.

> Or, even better, say you have a simple TermQuery, and you
> find out that the term isn't in the index (because $searchable->doc_freq
> returns 0).  Then you can just return undef (indicating a null result set)
> instead of a Scorer.

This doesn't really strike me as an 'even better', but a recipe for
poorly handled rare error conditions.  A query with no results is
going to be very fast to run, so this optimization isn't really saving
much.  And it definitely made my code paths uglier trying to handle
this case.  I'd much prefer to get an actual empty set of results
after running the search (which I need to handle anyway) rather than
have his special case.

>  There is actually quite a lot that happens in between a Query and a Scorer.
> That's where the "Weight" classes come in - they encapsulate the process of
> compiling a Query to a Scorer.

Any chance you could write up what actually happens here?  And then
perhaps feeling too embarrassed to publish the as-builts, rework this
part of the architecture to make it simple, streamlined, and 3x5
cardable?  ;)


Nathan Kurz
nate at verse.com

> > ps. The ice cream goes pretty well: http://screamsorbet.com/
> >
>
>  Beet Lemon Sorbet!  Awesome.

Yeah, it's surprisingly good.  I'll send you up some once we figure
out the right  way to package it for shipping.

_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch




More information about the kinosearch mailing list