[KinoSearch] Re: Analyzer API mods (was API request for KS::InvIndexer...)
marvin at rectangular.com
Fri May 4 15:20:46 PDT 2007
> I'd be happy to try, though I can't promise to be as fast at it as
> that last
> round with you and Pudge. I'll be more likely to chip away rather
> than plunge ahead.
This is nowhere near as ambitious as what Pudge and I took on. We're
That project had several major challenges. In addition to all the
good stuff Pudge contributed, I had to write the classes which are
now MultiLexicon, LexCache, SegLexCache, MatchFieldQuery,
MatchFieldWeight, and MatchFieldScorer, plus add a bunch of stuff to
BitVector. There was a lot of complex C code which had to be
written, tested, and debugged.
Here's what you and I need to do:
* Copy and paste analyze_field, analyze_text, and analyze_batch
from that previous email into Analyzer.pm, replacing the current
analyze() and analyze_text().
* Perform minor mods on existing analyze_text() methods in
LCNormalizer and PolyAnalyzer.
* Change 151-analyzer.t to use a custom subclass of Analyzer (since
analyze() is a no-op, but its replacement analyze_batch() dies an
* Add perfunctory tests for analyze_field to the relevant test
* Adjust Analyzer's documentation to reflect the new regime.
* Change SegWriter to use analyze_field.
* Add optimized analyze_field implementations to LCNormalizer and
* Add optimized analyze_field implementation to Tokenizer. This
harder because it requires some advanced XS.
* Test that you can mod a document's contents, using code nearly
identical to what will end up in the Swish/KS glue eventually.
* Expand Analyzer's docs with regard to subclassing.
* Copy and paste the utf8ify code into StringHelper.pm.
* Add some tests to verify that it works.
* Replace calls to utf8::upgrade with utf8ify.
* We'll skip moving the utf8 conversion from InvIndexer to the
Analyzers for now, since that has other implications.
KinoSearch mailing list
KinoSearch at rectangular.com
More information about the kinosearch