[KinoSearch] Re: Analyzer API mods (was API request for KS::InvIndexer...)
Peter Karman
peter at peknet.com
Fri May 4 11:52:07 PDT 2007
Marvin Humphrey scribbled on 5/4/07 12:58 PM:
> I think Analyzer needs three public analyze_xxxxx methods:
> analyze_field, analyze_text, and analyze_batch. They will take
> different arguments, but each will return a TokenBatch.
sounds sane.
> analyze_batch() should take the place of the current analyze(). All
> subclasses will have to implement it.
>
> The only change from the current implementation of analyze_text() is
> that calls to analyze() will need to be swapped out for calls to
> analyze_batch(). Then it needs public docs.
>
> SegWriter should be adapted to use analyze_field instead of analyze_text
> as it does now.
>
> Note: analyze_text() is not just a convenience method; it also allows a
> small optimization, avoiding a string copy or two when subclasses
> overrides it. Instead of copying the text into a TokenBatch then
> processing the copy and creating a second TokenBatch, we start with the
> original, process, and create a TokenBatch. That saves 3 string
> alloc/copy ops in the case of LCNormalizer and 1 in the case of
> Tokenizer. (LCNormalizer has more due to crossing the Perl/C boundary
> in Token->get_text and Token->set_text.)
>
glad you're thinking of these things.
> In order to make things work for you, I think we need to add
> TokenBatch->eat.
>
> $token_batch->eat( $other, $additional_pos_inc );
>
> $additional_pos_inc would be added to the pos_inc of the last Token in
> the cannibalistic batch. From perl-space we can have it default to 0 if
> only one arg is supplied; from C it will be required, of course. By
> setting it to 1, you'll be able to interrupt phrase matching as requested.
>
> sub analyze_field {
> my ( $self, $doc, $field_name ) = @_;
> my $token_batch = KinoSearch::Analysis::TokenBatch->new;
> my @frags = $self->{parser}->parse( $doc->{field_name} );
> for my $frag (@frags) {
> my $sub_batch = $self->{tokenizer}->analyze_text($frag);
> $token_batch->eat( $sub_batch, 1 );
> }
> return $token_batch;
> }
>
nice.
>
>>> * The utf8::upgrade calls performed by InvIndexer, which
>>> can probably be moved to individual analyzers.
>>
>> agreed. perhaps with a syntactically sweet wrapper in the base
>> Analyzer class?
>> So analyzer methods that care could call:
>>
>> $self->utf8ify( $field_value );
>
> That's not a bad idea. utf8::upgrade is a funny, non-perlish function.
> It modifies its argument in place. (So would utf8ify.) Also, it's
> always available: you don't have to 'use utf8' in order to get it -- and
> indeed you shouldn't, unless you really want your source code
> interpreted as utf8.
>
> Probably what we should do is implement our own replacement in XS. I'm
> not sure it ought to be a method in Analyzer, though. It might be
> better as a function in KinoSearch::Util::StringHelper (which would get
> a public API). That way other classes can use it: QueryParser, etc.
>
good idea to make it a Util method.
In Swish XS, I do something like this (not real code):
swish_utf8ify(self, str)
SV* self;
SV* str;
CODE:
char * buf = SvPV(str, PL_na);
if (!SvUTF8(str))
{
if (swish_is_ascii(buf))
SvUTF8_on(str); /* flags original SV */
else
croak("%s is not flagged as a UTF-8 string and is not ASCII", buf);
}
where swish_is_ascii() just makes sure there is no byte > 127. I'm sure there's
a native Perl equivalent.
You'll notice I don't handle Latin1 or EBCDIC as the utf8::upgrade() claims to.
The utf8 pod recommends Encode, and that's what Search::Tools::Transliterate
uses for its to_utf8() method (all in perl space).
That's not as friendly for the (common western) case of full Latin1, but I
figure it's a Good Thing to force users to be aware of their source encodings,
rather than quietly converting it to UTF-8. I know that not everyone agrees with
me on that.
>>> The other possibility is to add a tutorial under KinoSearch::Docs, or
>>> even publish such a tutorial on a WikiToBeNamedLater, reserving
>>> Analyzer's POD for concise API documentation. I lean towards
>>> stuffing everything into Analyzer, though.
>>
>> docs_in_analyzer++
>
> OK, cool.
>
> Would you like to work collaboratively on this stuff, the way Pudge and
> I did on the Filter classes? I can take care of everything, but A)
> there's other work that has to be done that only I can do, B) the code
> will come out better if we ensure that at least two people grok it, and
> C) this is a point at which KS and Swish3 meet and the handshaking will
> probably be cleaner if you develop a deep understanding of how the KS
> side works.
I'd be happy to try, though I can't promise to be as fast at it as that last
round with you and Pudge. I'll be more likely to chip away rather than plunge ahead.
I think your example code above is a great roadmap though and I'll get at it as
I can.
--
Peter Karman . http://peknet.com/ . peter at peknet.com
_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
More information about the kinosearch
mailing list