[KinoSearch] API request for KS::InvIndexer -- field_value as arrayref

Marvin Humphrey marvin at rectangular.com
Wed May 2 19:36:27 PDT 2007




Peter,

What's the ultimate goal here?  Is it that you want to supply pre- 
parsed fields?  I've been thinking about that a bit myself, because  
for HTML parsing with per-position boosts, I want to store a version  
with tags stripped, but the tags have to be there at parse-time to  
determine boost for each token (bigger, heavier text = bigger boost).

Another possibility would be to allow TokenBatch objects as field  
values rather than arrayrefs.  But in either case we have the problem  
of how to join them together to form the string to be stored.

>   while ( my ( $title, $content ) = each %source_docs ) {
>     $invindexer->add_doc({
>        title   => $title,
>        content => $content, # could be arrayref or scalar string
>     });
>   }
>
> where the field value of each hashref key/value pair could be a  
> scalar string (as it is now) or an arrayref of scalar strings.
>
> If it were an arrayref, then the pos_inc would bump by +1 for every  
> item in the array.

What I would really like to see here is for this to be implemented as  
an Analyzer subclass.  Possibly to be published on CPAN as a plugin  
within a "KinoSearchX" namespace.  I want to accommodate this in such  
a way as it is convenient and fast.

I am reluctant to complicate the API for InvIndexer->add_doc, though,  
because it's a bottleneck that many different problems must pass  
through -- like Searcher->search.  It would be better design to  
divide and conquer this problem and implement a solution within a  
purpose-built class.  Then we can work on it in isolation, or even  
replace it with a second version if a better algo occurs to us --  
without disrupting other KS users or cluttering the API for an  
essential method.

If we need to modify some low-level aspect of KS to support such a  
class, that's cool.  Especially if the low-level mod can be put into  
service supporting other higher-level needs.

Hmm.  This gives me an idea about how to simplify add_doc.  If we  
resurrect KinoSearch::Document::Doc, implemented as a blessed hash  
with boost stored as an inside-out member, the Doc object can carry  
the boost information -- and we can eliminate the extra args to  
InvIndexer->add_doc.

See where I'm going with this?

> Example:
>
>  my $content = ['eats shoots and leaves', 'by the morning train'];

Where are these texts coming from?  If you join them with  
"A_SEPARATOR_THAT_NEVER_APPEARS_IN_THE_TEXT" you could hack up a  
custom Tokenizer which recognizes that string and bumps the position  
increment rather than adding a Token.

Then you have the same problem as me with the HTML tags, though,  
because you don't want metadata like that separator polluting the  
stored version.  Hmm.

Are there other reasons that solution wouldn't work for you?

> [1] "seems" because I'm having a hard time wrapping my head around  
> some of the magic in the interaction between TokenBatch and the  
> Analyzer.

Thanks for that bit of feedback.  If we can improve the architecture/ 
documentation of those two so that the API is easier to grok, great.   
Power is more important than ease of use, though, since relatively  
few users will need to write custom Analyzer subclasses.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch




More information about the kinosearch mailing list