[KinoSearch] Serialized Schema (was KinoSearch::FieldSpec::text)
Marvin Humphrey
marvin at rectangular.com
Fri Sep 7 12:25:27 PDT 2007
On Sep 6, 2007, at 11:23 PM, Nathan Kurz wrote:
>> It would be really nice if FieldSpecs themselves were completely
>> serializable. After the "text" change, we have exactly one fixed
>> class def. I was thinking about adding text::unstored,
>> text::unanalyzed, text::unstoredunanalyzed, etc... but that quickly
>> gets ridiculous.
>
> I fear I'm a slow student, but why is this ridiculous? The particular
> names you have chosen are little cumbersome (because you are presuming
> direct mapping to a subclasses) but
The above formulation assumes a no-argument constructor. All the
information about the field type's behavior is carried by the class
name. If we simply add the following class to KS, along with alias
resolution in Schema.pm...
package KinoSearch::FieldSpec::text::unanalyzed;
use parent qw( KinoSearch::FieldSpec::text );
sub analyzed { 0 }
... then it becomes possible for a user to spec a %fields hash like so:
our %fields = (
title => 'text',
url => 'text::unanalyzed',
);
However, that stratagem scales poorly, because you need a unique
class name for each combination of characteristics.
If we make it possible to embed serialized FieldSpecs in an invindex,
though... we don't need to add all those subclasses to the KS core. :)
> aren't there only a few types one
> wants to support natively: text, blob, keyword, number, maybe date,
> what else?
Something like that.
'keyword' is not a very useful type because it's so close to 'text'.
It's not desirable because various 'keyword' fields might or might
not be analyzed (e.g. for lower-casing), vectorized, or stored.
Users will end up creating their own subclasses to get the exact
behavior they want anyway.
For now, I think we need only one: text. We might also add 'blob'
because it's easy and straightforward.
package KinoSearch::FieldSpec::blob;
use parent qw( KinoSearch::FieldSpec );
sub indexed { FALSE }
sub stored { TRUE }
sub analyzed { FALSE }
sub vectorized { FALSE }
sub binary { TRUE }
sub compressed { FALSE }
> Also, my instinct (perhaps because I've only been looking at the
> Scorer side) is that these field types are going to be most useful if
> they have a corresponding scorer,
Agreed.
> such that you can do stuff queries
> like "keyword_field:tag text_field:word && number_field:<10".
That kind of query would be nice to support.
> Would recording the analyzer steps be enough to do this?
For the various number types, and for 'date' as well depending on
implementation: the existing query classes won't work well, if at all.
However, I don't think that's an immediate concern. My main goal
with serializing Schema is to make the invindex file format self-
describing, so that it becomes possible to read one without the need
for any auxiliary information.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
More information about the kinosearch
mailing list