[KinoSearch] Re: Analyzer API mods (was API request for KS::InvIndexer...)

Peter Karman peter at peknet.com
Fri May 4 11:52:07 PDT 2007



Marvin Humphrey scribbled on 5/4/07 12:58 PM:

> I think Analyzer needs three public analyze_xxxxx methods: 
> analyze_field, analyze_text, and analyze_batch.  They will take 
> different arguments, but each will return a TokenBatch.

sounds sane.

> analyze_batch() should take the place of the current analyze().  All 
> subclasses will have to implement it.
> 
> The only change from the current implementation of analyze_text() is 
> that calls to analyze() will need to be swapped out for calls to 
> analyze_batch().  Then it needs public docs.
> 
> SegWriter should be adapted to use analyze_field instead of analyze_text 
> as it does now.
> 
> Note:  analyze_text() is not just a convenience method; it also allows a 
> small optimization, avoiding a string copy or two when subclasses 
> overrides it.  Instead of copying the text into a TokenBatch then 
> processing the copy and creating a second TokenBatch, we start with the 
> original, process, and create a TokenBatch.  That saves 3 string 
> alloc/copy ops in the case of LCNormalizer and 1 in the case of 
> Tokenizer.  (LCNormalizer has more due to crossing the Perl/C boundary 
> in Token->get_text and Token->set_text.)
> 

glad you're thinking of these things.


> In order to make things work for you, I think we need to add 
> TokenBatch->eat.
> 
>    $token_batch->eat( $other, $additional_pos_inc );
> 
> $additional_pos_inc would be added to the pos_inc of the last Token in 
> the cannibalistic batch.  From perl-space we can have it default to 0 if 
> only one arg is supplied; from C it will be required, of course.  By 
> setting it to 1, you'll be able to interrupt phrase matching as requested.
> 
>     sub analyze_field {
>         my ( $self, $doc, $field_name ) = @_;
>         my $token_batch = KinoSearch::Analysis::TokenBatch->new;
>         my @frags = $self->{parser}->parse( $doc->{field_name} );
>         for my $frag (@frags) {
>             my $sub_batch = $self->{tokenizer}->analyze_text($frag);
>             $token_batch->eat( $sub_batch, 1 );
>         }
>         return $token_batch;
>     }
> 

nice.


> 
>>>   * The utf8::upgrade calls performed by InvIndexer, which
>>>     can probably be moved to individual analyzers.
>>
>> agreed. perhaps with a syntactically sweet wrapper in the base 
>> Analyzer class?
>> So analyzer methods that care could call:
>>
>>  $self->utf8ify( $field_value );
> 
> That's not a bad idea.  utf8::upgrade is a funny, non-perlish function.  
> It modifies its argument in place.  (So would utf8ify.)  Also, it's 
> always available: you don't have to 'use utf8' in order to get it -- and 
> indeed you shouldn't, unless you really want your source code 
> interpreted as utf8.
> 
> Probably what we should do is implement our own replacement in XS.  I'm 
> not sure it ought to be a method in Analyzer, though.  It might be 
> better as a function in KinoSearch::Util::StringHelper (which would get 
> a public API).  That way other classes can use it: QueryParser, etc.
> 

good idea to make it a Util method.

In Swish XS, I do something like this (not real code):

swish_utf8ify(self, str)
   SV* self;
   SV* str;

   CODE:

   char * buf = SvPV(str, PL_na);

   if (!SvUTF8(str))
   {
       if (swish_is_ascii(buf))
          SvUTF8_on(str);     /* flags original SV */
       else
          croak("%s is not flagged as a UTF-8 string and is not ASCII", buf);
   }

where swish_is_ascii() just makes sure there is no byte > 127. I'm sure there's
a native Perl equivalent.

You'll notice I don't handle Latin1 or EBCDIC as the utf8::upgrade() claims to.
The utf8 pod recommends Encode, and that's what Search::Tools::Transliterate
uses for its to_utf8() method (all in perl space).

That's not as friendly for the (common western) case of full Latin1, but I
figure it's a Good Thing to force users to be aware of their source encodings,
rather than quietly converting it to UTF-8. I know that not everyone agrees with
me on that.


>>> The other possibility is to add a tutorial under KinoSearch::Docs, or 
>>> even publish such a tutorial on a WikiToBeNamedLater, reserving 
>>> Analyzer's POD for concise API documentation.  I lean towards 
>>> stuffing everything into Analyzer, though.
>>
>> docs_in_analyzer++
> 
> OK, cool.
> 
> Would you like to work collaboratively on this stuff, the way Pudge and 
> I did on the Filter classes?  I can take care of everything, but A) 
> there's other work that has to be done that only I can do, B) the code 
> will come out better if we ensure that at least two people grok it, and 
> C) this is a point at which KS and Swish3 meet and the handshaking will 
> probably be cleaner if you develop a deep understanding of how the KS 
> side works.

I'd be happy to try, though I can't promise to be as fast at it as that last
round with you and Pudge. I'll be more likely to chip away rather than plunge ahead.

I think your example code above is a great roadmap though and I'll get at it as
I can.

-- 
Peter Karman  .  http://peknet.com/  .  peter at peknet.com

_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch




More information about the kinosearch mailing list