[KinoSearch] Dynamic schemas - How?
Marvin Humphrey
marvin at rectangular.com
Tue Feb 27 12:45:22 PST 2007
On Feb 27, 2007, at 4:12 AM, Marc Elser wrote:
>> package MySchema::$field_name;
>> use base qw( KinoSearch::Schema::Field );
(whoops, "Field" should have been "FieldSpec" -- the class has been
renamed since I last did work on KS::Simple.)
> Yes there are multiple specs because I have multiple indexes.
OK. Would it be feasible to create static Schemas that know
everything except for the field names?
Here's how the KS 0.20 API could change to accommodate your needs.
# MySchema.pm
package UnAnalyzedFieldSpec;
use base qw( KinoSearch::Schema::FieldSpec );
sub analyzed {0}
package MySchema;
use base qw( KinoSearch::Schema );
sub analyzer {
KinoSearch::Analysis::PolyAnalyzer->new( language => 'en' );
}
# invindexer.plx
MySchema->init_field( title => 'KinoSearch::Schema::FieldSpec');
MySchema->init_field( content => 'KinoSearch::Schema::FieldSpec');
MySchema->init_field( url => 'UnAnalyzedFieldSpec');
my $invindexer = KinoSearch::InvIndexer->new(
invindex => MySchema->clobber('/path/to/invindex'),
);
That's closer to the earlier KS design, and I think ought to work for
you. Yes?
Most people would put the calls to init_field() in MySchema.pm, but
in your case you would defer them until your scripts.
FWIW, this design breaks with the ORM model which Schema was loosely
derived from, since there's no longer a one-to-one mapping between
fields in the index and classes directly below the Schema subclass's
package. Not that that's a problem.
>> Do they ever change?
> Well, they do change occasionally and then the index with the
> changed field is beeing rebuilt.
The only problem that would arise is if you change up the FieldSpec
subclass after data has been indexed using it, then try to search or
modify the index. The Schema architecture will always have that
vulnerability.
But we already had the same problem with Analyzers: with KS 0.15, if
you switch up an analyzer between index-time and search-time, you get
garbage.
>> Do you ever need to add fields in the middle of an indexing
>> session or do you know them all up front?
> I know them upfront because they're defined in an xml which is
> parsed, but they never change in the middle of indexing.
OK. I'd like to accommodate people who want to add new fields in the
middle of an indexing session, too. That's hard, but maybe it's
possible.
The big tradeoff is that if fields aren't limited to a known, finite
set, a lot of validation has to be turned off.
For instance, InvIndexer->delete_by_term verifies that the field in
question 1) is known, and 2) is spec'd as indexed. If it isn't
known, you probably misspelled it; if it wasn't indexed, no docs will
be found and no deletions will occur -- and that's something you
probably want to know about.
But if the fact that a Schema doesn't know about a field doesn't mean
anything, then we have to accept silent failure in both those cases.
> This still leaves me with the problem you can not only specify the
> fields you want to index in our config-xml but also the indexes you
> want to create. So I would also have to define the
> KinoSearch::Schema classes through an eval, but it would at least
> save me another eval for setting up the fields.
Hmm. Technically, you don't need an eval -- you can manipulate @ISA
directly if you turn off strict refs.
{
no strict 'refs';
@{ $class . '::ISA' } = ('KinoSearch::Schema');
}
I'd consider either that or an eval acceptable if we can't figure out
a better way to handle things, but it's still somewhat inelegant.
If Schemas were objects rather than classes -- which is something I
considered -- we wouldn't have this problem.
my $schema = KinoSearch::Schema->new(
analyzer => KinoSearch::Analysis::Tokenizer->new,
);
$schema->spec_field( name => title );
my $invindexer = KinoSearch::InvIndexer->new(
invindex => $schema->clobber('/path/to/invindex'),
);
However, I rejected that design because I know if we did that, less
experienced users would copy and paste the schema code between index
and search scripts, violating DRY and leading to a bunch of nasty
errors when conflicts arise because copies get out of sync. Shunting
everyone into module use is less error prone and encourages good
programming practice. However, your particular use case is less well
served.
In Perl, which allows you to create classes on the fly, we can still
pull it off. A less dynamic language might not be able to...
> But maybe you also know of a better solution for the subclassing
> problem for every index.
OK... new thought.... how about allowing instances of your Schema
subclass to add fields?
# invindexer.plx
my $schema = MySchema->new;
$schema->add_field( title => 'KinoSearch::Schema::FieldSpec');
$schema->add_field( content => 'KinoSearch::Schema::FieldSpec');
$schema->add_field( url => 'UnAnalyzedFieldSpec');
my $invindexer = KinoSearch::InvIndexer->new(
invindex => $schema->clobber('/path/to/invindex'),
);
init_field() would be a class method only. Fields so registered
would serve as the starter set for each instance.
add_field() would be an instance method only. Fields so registered
would only be known to the object it was called upon.
Hmm... add_field() actually solves another problem. It allows us to
record a mapping of field name to FieldSpec class name in
segments_XXX.yaml, then either validate the mapping against the
schema used to open the file, or register new mappings on an instance
without polluting class variables.
You're lucky in that you know all the field names at search time, so
you can create the Schema on the fly then, too. But say you
didn't... having $schema->open('/path/to/invindex') call add_field()
and build your field list for you solves that problem.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
More information about the kinosearch
mailing list