[KinoSearch] API request for KS::InvIndexer -- field_value as arrayref

Marvin Humphrey marvin at rectangular.com
Thu May 3 11:51:55 PDT 2007



Peter,

I think the general solution will be to make it possible for  
Analyzers to affect the stored text.  To do this, we'll have to give  
them access to the document itself, and the field name.  Here's one  
possibility, which just extends the existing Analyzer->analyze method  
by adding more args:

    sub analyze {
       my ( $self, $token_batch, $doc, $field_name ) = @_;
       ...
       $doc->{$field_name} = $new_text;
       return $new_token_batch;
    }

I think we can do better than that, though, as you'll see below...

On May 3, 2007, at 7:10 AM, Peter Karman wrote:

> The ultimate goal in my suggestion was to convey basic structural  
> positional information about a doc's contents via the contents'  
> data structure,

---->8 SNIP 8<----

> $ cat foo.html
> <html>
>  <body>
>   <div>eats shoots and leaves</div>
>   <div>by the morning train</div>
>  </body>
> </html>
>
>
>  # the long way
>  my @divs;
>  foreach my $div ($parser->parse_html('foo.html'))
>  {
>      push(@divs, $div);
>  }
>  $invindexer->add_doc({ content => \@divs });
>
>  # the short way
>  $invindexer->add_doc({ content => $parser->parse_html('foo.html') });

That would solve your specific problem of wanting to forbid phrase  
matching in certain cases.  However, it encodes one particular kind  
of metadata using one particular convention.  Forbidding phrase  
matches across structural divisions is a worthy idea (: and I intend  
to steal it for KinoSearch::Simple->parse_html :) but I don't think  
the proposed implementation is general enough.  There's no way we can  
anticipate all the different kinds of metadata people might want to  
pass through InvIndexer->add_doc.  You haven't solved my problem of  
how to pass visual text weight metadata, for example.

My inclination is to allow documents to use whatever-the-hell-they- 
want as field values.  Filehandles.  Arrayrefs.  Arbitrary objects.   
Undefs.  The only things really standing in the way of this now are:

   * The utf8::upgrade calls performed by InvIndexer, which
     can probably be moved to individual analyzers.
   * The field name verification regime, intended to thwart
     misspelled field names, which I'm not sure what to do
     about but would like to keep if possible.
   * The current behavior of DocWriter/DocReader.

To facilitate this, we can add a public, overrideable method:  
Analyzer->analyze_field.  (Also, Analyzer->analyze should probably be  
renamed to process_batch or something like that.)  Here's how  
LCNormalizer->analyze_field would look:

   sub analyze_field {
     my ( $self, $doc, $field_name ) = @_;
     utf8::upgrade( $doc->{$field_name} );
     return KinoSearch::Analysis::TokenBatch->new(
        text => lc( $doc->{$field_name} ),
     );
   }

This set-up would allow you to perform Swish analysis entirely within  
an Analyzer.  Or to pre-process everything and create a TokenBatch  
later.  We'd still need to add some methods to TokenBatch to fully  
support what you want to do, but here's a rough outline of how things  
could work:

   sub analyze_field {
     my ( $self, $doc, $field_name ) = @_;
     my $divs = $doc->{$field_name};
     my $token_batch = KinoSearch::Analysis::TokenBatch->new;

     for my $div (@$divs) {
       my $sub_batch = $self->{tokenizer}->analyze_text($div);
       $token_batch->eat($sub_batch);
     }

     # ugly, wouldn't really want to do this...
     $doc->{$field_name} = join( "\n", @$divs );

     return $token_batch;
   }

Allowing KS documents to have arbitrary structure also moves us a few  
steps towards the concept of an OO database, which I'd really dig.   
It would also be great to allow integer or float type fields as well  
in addition to the string-type fields currently supported.

> Hacking up a custom Tokenizer for what I'm guessing is a common  
> case for marked up docs seems prohibitive for the casual user.

Yes, and another problem is that KinoSearch's XS-based Tokenizer is  
much faster than alternative pure-Perl implementations.

> Perhaps an example in the Tutorial, or an Advanced Tutorial,  
> showing how/why someone would want to create their own Analyzer?

I think the place for this is the Analyzer documentation.  Analyzer  
exists to be subclassed.  Right now the docs are sparse; they could  
be much longer.  Subclassing Analyzer is an "expert API" task, so  
verbose docs are OK.

The other possibility is to add a tutorial under KinoSearch::Docs, or  
even publish such a tutorial on a WikiToBeNamedLater, reserving  
Analyzer's POD for concise API documentation.  I lean towards  
stuffing everything into Analyzer, though.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch




More information about the kinosearch mailing list