[KinoSearch] passing positions

Marvin Humphrey marvin at rectangular.com
Wed Sep 5 21:12:56 PDT 2007




On Jul 25, 2007, at 12:23 AM, Nathan Kurz wrote:

> (prescript:  I'm not attached to the names or terminology I'm using
> here.  Sometimes it is because I'm being sloppy, sometimes because I'm
> wrong, sometimes because I've chosen a name different than the current
> so I can contrast them, and sometimes because the current names don't
> resonate for me.  But in the end, we'll call things whatever you
> want.)

FWIW, I've written up the naming principles I follow as a blog entry:  
<http://www.rectangular.com/blog/my_name_is_variable.html>.

One of your inventions is Scorer_Advance.  I like it as a substitute  
for Scorer_Next, and it might be worth a global search and replace  
since that method isn't public yet. :)  However, in your code it  
appears to be a substitute for Scorer_Skip_To.  If that's correct, I  
think it's not the best choice, because there's greater contrast in a  
Scorer_Next:Scorer_Skip_To duality than a Scorer_Next:Scorer_Advance  
duality.

> I was able to think about how to pass positions a bit today, but am
> still searching for better ideas. But I thought I'd bounce my current
> thoughts off you.
>
> My mental goal is to allow a Phrase scorer to sit on top of an
> arbitrary tree of Boolean type scorers, and have enough information to
> determine if the terms found occur in the document as a phrase.  For
> example, "a|b c"  might parse to scorers nested as such:
> Phrase(
>     And(
>          Or(a b)
>          c
>    )
> )
> It's not that the actual Phrase scorer has to work like this, but if
> it did (and was efficient) I think we'd be in a good position to do
> generalized proximity matching.

 From a DRY standpoint, it would be nice to have a single  
PhraseScorer working over sub-scorers rather than having one which  
uses sub-Scorers and one which uses PostingLists.  Let's proceed  
under that assumption and mod PhraseScorer.  We can go back and  
unroll an auxiliary PostingLists version for efficiency if need be.
>
> I'm thinking that every Scorer will have a Match struct, either within
> it's Tally or standalone.  I'm not against having it in the Tally (or
> conjoined with it), but I haven't yet figured out how to make it work
> that way.

I think similar reasoning led you to Match and me to Tally.

> The trickiness (and I don't like trickiness) is that each Match is
> allowed to contain either an array of positions, or an array of Match
> structs:

I doubt that's necessary.  Just create a default wrapper at the  
lowest level.  That's how TermScorer does things presently.

> struct Match {
>       int field;       // set to MATCH_HAS_SUBMATCHES if submatch  
> array
>                        // set to MATCH_MULTIPLE_FIELDS if necessary
>                        // defaults to MATCH_INVALID_FIELD
>                        // else field number that positions come from

This variable name violates my "avoid overload overload" rule. :)  
"field" has a very specific meaning in the context of KS and this  
isn't it.

>
>       int alloc;       // set to 0 if static position data is  
> referenced
>                        // otherwise allocated length of positions
>                        //   (and lengths if lengths is non-null)
>                        // arrays below are copy on write if alloc is 0



>
>       int count;       // number of positions or submatches in data  
> array
>                        // also number in lengths array if lengths  
> non-null
>
>       void *data;      // used as union {
>                        //               // field ==  
> MATCH_HAS_SUBMATCHES ?
>                        //               Match **submatches;  // :
>                        //               int *positions;
>                        //         } data;

I think we can avoid this union.  See below.

>
>       int *lengths;    // NULL means all 1, else 'int array[count]'
>                        // (lengths is unused if MATCH_HAS_SUBMATCHES)
> }
>
> I'd hoped to be able to avoid the position/submatch duality, but I
> couldn't come up with any way to flatten the results that seemed
> efficient and general.

This was the driving factor behind the ScoreProx class.

The solution is general because each scorer presents an array of  
ScoreProx in its Tally.  That includes TermScorer, which presents a  
single element array.

It's efficient because the ScoreProx objects persist.  TermScorer  
keeps just copies pointers into it from the Posting Object.   
PhraseScorer keeps the same one around and reallocates space for the  
positions on an as-needed basis.

A better name for the ScoreProx class would be appreciated.  :)  It's  
the worst class name in KS, and the "num_sproxen" member var in Tally  
is the worst member var name.

> This structure, like Tally, would be created at Scorer init, and
> used/reused over the life of the Scorer.  For efficiency, it wouldn't
> be refcounted, and the values would only be valid until the next call
> to Scorer_Advance().

Yes.  Just so.  Tally does that.

> I think it can be very fast to create the positions, because the
> pointers to the subscorers Match structs remain the same over the life
> of a scorer.
>
> For an And or Or scorer:
>  // at init:
>  self->match->field = MATCH_HAS_SUBMATCHES;
>  for (int i = 0; i < self->num_clauses; i++) {
>      // push adds match to data taking care of alloc and count
>      self->match->push(self->match, self->subclause[i]->match);
>  }
>
>  // per doc:
>  // nothing! it just points through to the static matches in the  
> subscorers
>  // (although an Or scorer needs to be sure unmatched scorers are  
> zeroed)

Collation of positions gets complicated when these scorers are nested.

> For a Term scorer:
>  // at init:
>  self->match->field = field_num;  // as passed by query
>  // self->match->alloc left at zero
>
>  // per doc:
>  self->match->data = posting->positions;
>  self->match->count = posting->num_positions;

Yes.

> Things are going to be trickier for a the Phrase Scorer.  Rather than
> passing through Matches of its children, it will create its own.  The
> trickiness comes because if the subscorer Matches themselves have
> submatches, it needs to descend to the bottom of these trees,
> presumably treating the various bottom matches as alternative
> possibilities for that position in the phrase. Then it will add the
> positions (and lengths) of these phrases to its own Match struct.  I'm
> not sure how to deal with (or exclude) the possibility of phrases
> occurring in multiple fields (submatches?).  I'm not sure yet how to
> do any of this efficiently.

Probably we'll need a priority queue.  Or some other sorting scheme.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch




More information about the kinosearch mailing list