[KinoSearch] Re: Analyzer API mods (was API request for KS::InvIndexer...)

Marvin Humphrey marvin at rectangular.com
Fri May 4 15:20:46 PDT 2007

> I'd be happy to try, though I can't promise to be as fast at it as  
> that last
> round with you and Pudge. I'll be more likely to chip away rather  
> than plunge ahead.

This is nowhere near as ambitious as what Pudge and I took on.  We're  
talkin' order-of-magnitude-less-ambitious.

That project had several major challenges.  In addition to all the  
good stuff Pudge contributed, I had to write the classes which are  
now MultiLexicon, LexCache, SegLexCache, MatchFieldQuery,  
MatchFieldWeight, and MatchFieldScorer, plus add a bunch of stuff to  
BitVector.  There was a lot of complex C code which had to be  
written, tested, and debugged.

Here's what you and I need to do:

  Task 1:

    * Copy and paste analyze_field, analyze_text, and analyze_batch
      from that previous email into Analyzer.pm, replacing the current
      analyze() and analyze_text().
    * Perform minor mods on existing analyze_text() methods in
      LCNormalizer and PolyAnalyzer.
    * Change 151-analyzer.t to use a custom subclass of Analyzer (since
      analyze() is a no-op, but its replacement analyze_batch() dies an
      abstract death).
    * Add perfunctory tests for analyze_field to the relevant test  
        o 150-polyanalyzer.t
        o 151-analyzer.t
        o 153-lc_normalizer.t
        o 154-tokenizer.t
        o 155-stopalizer.t
        o 156-stemmer.t
    * Adjust Analyzer's documentation to reflect the new regime.

  Task 2:

    * Change SegWriter to use analyze_field.
    * Add optimized analyze_field implementations to LCNormalizer and
    * Add optimized analyze_field implementation to Tokenizer.  This  
      harder because it requires some advanced XS.
    * Test that you can mod a document's contents, using code nearly
      identical to what will end up in the Swish/KS glue eventually.

  Task 3:

    * Expand Analyzer's docs with regard to subclassing.

  Task 4:

    * Copy and paste the utf8ify code into StringHelper.pm.
    * Add some tests to verify that it works.
    * Replace calls to utf8::upgrade with utf8ify.
    * We'll skip moving the utf8 conversion from InvIndexer to the
      Analyzers for now, since that has other implications.

Marvin Humphrey
Rectangular Research

KinoSearch mailing list
KinoSearch at rectangular.com

More information about the kinosearch mailing list