[KinoSearch] Re: Analyzer API mods (was API request for KS::InvIndexer...)
Marvin Humphrey
marvin at rectangular.com
Fri May 4 15:20:46 PDT 2007
> I'd be happy to try, though I can't promise to be as fast at it as
> that last
> round with you and Pudge. I'll be more likely to chip away rather
> than plunge ahead.
This is nowhere near as ambitious as what Pudge and I took on. We're
talkin' order-of-magnitude-less-ambitious.
That project had several major challenges. In addition to all the
good stuff Pudge contributed, I had to write the classes which are
now MultiLexicon, LexCache, SegLexCache, MatchFieldQuery,
MatchFieldWeight, and MatchFieldScorer, plus add a bunch of stuff to
BitVector. There was a lot of complex C code which had to be
written, tested, and debugged.
Here's what you and I need to do:
Task 1:
* Copy and paste analyze_field, analyze_text, and analyze_batch
from that previous email into Analyzer.pm, replacing the current
analyze() and analyze_text().
* Perform minor mods on existing analyze_text() methods in
LCNormalizer and PolyAnalyzer.
* Change 151-analyzer.t to use a custom subclass of Analyzer (since
analyze() is a no-op, but its replacement analyze_batch() dies an
abstract death).
* Add perfunctory tests for analyze_field to the relevant test
files.
o 150-polyanalyzer.t
o 151-analyzer.t
o 153-lc_normalizer.t
o 154-tokenizer.t
o 155-stopalizer.t
o 156-stemmer.t
* Adjust Analyzer's documentation to reflect the new regime.
Task 2:
* Change SegWriter to use analyze_field.
* Add optimized analyze_field implementations to LCNormalizer and
PolyAnalyzer.
* Add optimized analyze_field implementation to Tokenizer. This
one's
harder because it requires some advanced XS.
* Test that you can mod a document's contents, using code nearly
identical to what will end up in the Swish/KS glue eventually.
Task 3:
* Expand Analyzer's docs with regard to subclassing.
Task 4:
* Copy and paste the utf8ify code into StringHelper.pm.
* Add some tests to verify that it works.
* Replace calls to utf8::upgrade with utf8ify.
* We'll skip moving the utf8 conversion from InvIndexer to the
Analyzers for now, since that has other implications.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
More information about the kinosearch
mailing list