[KinoSearch] API request for KS::InvIndexer -- field_value as arrayref
Marvin Humphrey
marvin at rectangular.com
Wed May 2 19:36:27 PDT 2007
Peter,
What's the ultimate goal here? Is it that you want to supply pre-
parsed fields? I've been thinking about that a bit myself, because
for HTML parsing with per-position boosts, I want to store a version
with tags stripped, but the tags have to be there at parse-time to
determine boost for each token (bigger, heavier text = bigger boost).
Another possibility would be to allow TokenBatch objects as field
values rather than arrayrefs. But in either case we have the problem
of how to join them together to form the string to be stored.
> while ( my ( $title, $content ) = each %source_docs ) {
> $invindexer->add_doc({
> title => $title,
> content => $content, # could be arrayref or scalar string
> });
> }
>
> where the field value of each hashref key/value pair could be a
> scalar string (as it is now) or an arrayref of scalar strings.
>
> If it were an arrayref, then the pos_inc would bump by +1 for every
> item in the array.
What I would really like to see here is for this to be implemented as
an Analyzer subclass. Possibly to be published on CPAN as a plugin
within a "KinoSearchX" namespace. I want to accommodate this in such
a way as it is convenient and fast.
I am reluctant to complicate the API for InvIndexer->add_doc, though,
because it's a bottleneck that many different problems must pass
through -- like Searcher->search. It would be better design to
divide and conquer this problem and implement a solution within a
purpose-built class. Then we can work on it in isolation, or even
replace it with a second version if a better algo occurs to us --
without disrupting other KS users or cluttering the API for an
essential method.
If we need to modify some low-level aspect of KS to support such a
class, that's cool. Especially if the low-level mod can be put into
service supporting other higher-level needs.
Hmm. This gives me an idea about how to simplify add_doc. If we
resurrect KinoSearch::Document::Doc, implemented as a blessed hash
with boost stored as an inside-out member, the Doc object can carry
the boost information -- and we can eliminate the extra args to
InvIndexer->add_doc.
See where I'm going with this?
> Example:
>
> my $content = ['eats shoots and leaves', 'by the morning train'];
Where are these texts coming from? If you join them with
"A_SEPARATOR_THAT_NEVER_APPEARS_IN_THE_TEXT" you could hack up a
custom Tokenizer which recognizes that string and bumps the position
increment rather than adding a Token.
Then you have the same problem as me with the HTML tags, though,
because you don't want metadata like that separator polluting the
stored version. Hmm.
Are there other reasons that solution wouldn't work for you?
> [1] "seems" because I'm having a hard time wrapping my head around
> some of the magic in the interaction between TokenBatch and the
> Analyzer.
Thanks for that bit of feedback. If we can improve the architecture/
documentation of those two so that the API is easier to grok, great.
Power is more important than ease of use, though, since relatively
few users will need to write custom Analyzer subclasses.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
More information about the kinosearch
mailing list