[KinoSearch] Serialized Schema (was KinoSearch::FieldSpec::text)

Marvin Humphrey marvin at rectangular.com
Fri Sep 7 12:25:27 PDT 2007




On Sep 6, 2007, at 11:23 PM, Nathan Kurz wrote:

>> It would be really nice if FieldSpecs themselves were completely
>> serializable.  After the "text" change, we have exactly one fixed
>> class def.   I was thinking about adding text::unstored,
>> text::unanalyzed, text::unstoredunanalyzed, etc... but that quickly
>> gets ridiculous.
>
> I fear I'm a slow student, but why is this ridiculous? The particular
> names you have chosen are little cumbersome (because you are presuming
> direct mapping to a subclasses) but

The above formulation assumes a no-argument constructor.  All the  
information about the field type's behavior is carried by the class  
name.  If we simply add the following class to KS, along with alias  
resolution in Schema.pm...

   package KinoSearch::FieldSpec::text::unanalyzed;
   use parent qw( KinoSearch::FieldSpec::text );
   sub analyzed { 0 }

... then it becomes possible for a user to spec a %fields hash like so:

   our %fields = (
      title => 'text',
      url   => 'text::unanalyzed',
   );

However, that stratagem scales poorly, because you need a unique  
class name for each combination of characteristics.

If we make it possible to embed serialized FieldSpecs in an invindex,  
though... we don't need to add all those subclasses to the KS core. :)

> aren't there only a few types one
> wants to support natively:  text, blob, keyword, number, maybe date,
> what else?

Something like that.

'keyword' is not a very useful type because it's so close to 'text'.   
It's not desirable because various 'keyword' fields might or might  
not be analyzed (e.g. for lower-casing), vectorized, or stored.   
Users will end up creating their own subclasses to get the exact  
behavior they want anyway.

For now, I think we need only one: text.  We might also add 'blob'  
because it's easy and straightforward.

   package KinoSearch::FieldSpec::blob;
   use parent qw( KinoSearch::FieldSpec );

   sub indexed    { FALSE }
   sub stored     { TRUE }
   sub analyzed   { FALSE }
   sub vectorized { FALSE }
   sub binary     { TRUE }
   sub compressed { FALSE }

> Also, my instinct (perhaps because I've only been looking at the
> Scorer side) is that these field types are going to be most useful if
> they have a corresponding scorer,

Agreed.

> such that you can do stuff queries
> like "keyword_field:tag text_field:word && number_field:<10".

That kind of query would be nice to support.

> Would recording the analyzer steps be enough to do this?

For the various number types, and for 'date' as well depending on  
implementation: the existing query classes won't work well, if at all.

However, I don't think that's an immediate concern.   My main goal  
with serializing Schema is to make the invindex file format self- 
describing, so that it becomes possible to read one without the need  
for any auxiliary information.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch




More information about the kinosearch mailing list