[KinoSearch] Re: Analyzer API mods (was API request for KS::InvIndexer...)

Marvin Humphrey marvin at rectangular.com
Fri May 4 15:20:46 PDT 2007




> I'd be happy to try, though I can't promise to be as fast at it as  
> that last
> round with you and Pudge. I'll be more likely to chip away rather  
> than plunge ahead.

This is nowhere near as ambitious as what Pudge and I took on.  We're  
talkin' order-of-magnitude-less-ambitious.

That project had several major challenges.  In addition to all the  
good stuff Pudge contributed, I had to write the classes which are  
now MultiLexicon, LexCache, SegLexCache, MatchFieldQuery,  
MatchFieldWeight, and MatchFieldScorer, plus add a bunch of stuff to  
BitVector.  There was a lot of complex C code which had to be  
written, tested, and debugged.

Here's what you and I need to do:

  Task 1:

    * Copy and paste analyze_field, analyze_text, and analyze_batch
      from that previous email into Analyzer.pm, replacing the current
      analyze() and analyze_text().
    * Perform minor mods on existing analyze_text() methods in
      LCNormalizer and PolyAnalyzer.
    * Change 151-analyzer.t to use a custom subclass of Analyzer (since
      analyze() is a no-op, but its replacement analyze_batch() dies an
      abstract death).
    * Add perfunctory tests for analyze_field to the relevant test  
files.
        o 150-polyanalyzer.t
        o 151-analyzer.t
        o 153-lc_normalizer.t
        o 154-tokenizer.t
        o 155-stopalizer.t
        o 156-stemmer.t
    * Adjust Analyzer's documentation to reflect the new regime.

  Task 2:

    * Change SegWriter to use analyze_field.
    * Add optimized analyze_field implementations to LCNormalizer and
      PolyAnalyzer.
    * Add optimized analyze_field implementation to Tokenizer.  This  
one's
      harder because it requires some advanced XS.
    * Test that you can mod a document's contents, using code nearly
      identical to what will end up in the Swish/KS glue eventually.

  Task 3:

    * Expand Analyzer's docs with regard to subclassing.

  Task 4:

    * Copy and paste the utf8ify code into StringHelper.pm.
    * Add some tests to verify that it works.
    * Replace calls to utf8::upgrade with utf8ify.
    * We'll skip moving the utf8 conversion from InvIndexer to the
      Analyzers for now, since that has other implications.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch




More information about the kinosearch mailing list