[KinoSearch] API request for KS::InvIndexer -- field_value as arrayref
Marvin Humphrey
marvin at rectangular.com
Thu May 3 11:51:55 PDT 2007
Peter,
I think the general solution will be to make it possible for
Analyzers to affect the stored text. To do this, we'll have to give
them access to the document itself, and the field name. Here's one
possibility, which just extends the existing Analyzer->analyze method
by adding more args:
sub analyze {
my ( $self, $token_batch, $doc, $field_name ) = @_;
...
$doc->{$field_name} = $new_text;
return $new_token_batch;
}
I think we can do better than that, though, as you'll see below...
On May 3, 2007, at 7:10 AM, Peter Karman wrote:
> The ultimate goal in my suggestion was to convey basic structural
> positional information about a doc's contents via the contents'
> data structure,
---->8 SNIP 8<----
> $ cat foo.html
> <html>
> <body>
> <div>eats shoots and leaves</div>
> <div>by the morning train</div>
> </body>
> </html>
>
>
> # the long way
> my @divs;
> foreach my $div ($parser->parse_html('foo.html'))
> {
> push(@divs, $div);
> }
> $invindexer->add_doc({ content => \@divs });
>
> # the short way
> $invindexer->add_doc({ content => $parser->parse_html('foo.html') });
That would solve your specific problem of wanting to forbid phrase
matching in certain cases. However, it encodes one particular kind
of metadata using one particular convention. Forbidding phrase
matches across structural divisions is a worthy idea (: and I intend
to steal it for KinoSearch::Simple->parse_html :) but I don't think
the proposed implementation is general enough. There's no way we can
anticipate all the different kinds of metadata people might want to
pass through InvIndexer->add_doc. You haven't solved my problem of
how to pass visual text weight metadata, for example.
My inclination is to allow documents to use whatever-the-hell-they-
want as field values. Filehandles. Arrayrefs. Arbitrary objects.
Undefs. The only things really standing in the way of this now are:
* The utf8::upgrade calls performed by InvIndexer, which
can probably be moved to individual analyzers.
* The field name verification regime, intended to thwart
misspelled field names, which I'm not sure what to do
about but would like to keep if possible.
* The current behavior of DocWriter/DocReader.
To facilitate this, we can add a public, overrideable method:
Analyzer->analyze_field. (Also, Analyzer->analyze should probably be
renamed to process_batch or something like that.) Here's how
LCNormalizer->analyze_field would look:
sub analyze_field {
my ( $self, $doc, $field_name ) = @_;
utf8::upgrade( $doc->{$field_name} );
return KinoSearch::Analysis::TokenBatch->new(
text => lc( $doc->{$field_name} ),
);
}
This set-up would allow you to perform Swish analysis entirely within
an Analyzer. Or to pre-process everything and create a TokenBatch
later. We'd still need to add some methods to TokenBatch to fully
support what you want to do, but here's a rough outline of how things
could work:
sub analyze_field {
my ( $self, $doc, $field_name ) = @_;
my $divs = $doc->{$field_name};
my $token_batch = KinoSearch::Analysis::TokenBatch->new;
for my $div (@$divs) {
my $sub_batch = $self->{tokenizer}->analyze_text($div);
$token_batch->eat($sub_batch);
}
# ugly, wouldn't really want to do this...
$doc->{$field_name} = join( "\n", @$divs );
return $token_batch;
}
Allowing KS documents to have arbitrary structure also moves us a few
steps towards the concept of an OO database, which I'd really dig.
It would also be great to allow integer or float type fields as well
in addition to the string-type fields currently supported.
> Hacking up a custom Tokenizer for what I'm guessing is a common
> case for marked up docs seems prohibitive for the casual user.
Yes, and another problem is that KinoSearch's XS-based Tokenizer is
much faster than alternative pure-Perl implementations.
> Perhaps an example in the Tutorial, or an Advanced Tutorial,
> showing how/why someone would want to create their own Analyzer?
I think the place for this is the Analyzer documentation. Analyzer
exists to be subclassed. Right now the docs are sparse; they could
be much longer. Subclassing Analyzer is an "expert API" task, so
verbose docs are OK.
The other possibility is to add a tutorial under KinoSearch::Docs, or
even publish such a tutorial on a WikiToBeNamedLater, reserving
Analyzer's POD for concise API documentation. I lean towards
stuffing everything into Analyzer, though.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
More information about the kinosearch
mailing list