[KinoSearch] Serialized Schema (was KinoSearch::FieldSpec::text)

Marvin Humphrey marvin at rectangular.com
Thu Sep 6 15:26:26 PDT 2007




On Sep 6, 2007, at 6:57 AM, Peter Karman wrote:
> natively supported field types make a lot of sense to me.

Hmm... I wasn't originally thinking of this as "native support", just  
flexible shorthand -- but now that you mention it, I guess the change  
amounts to the same thing.

In addition to the keystroke-savings, the idea was that if someone  
wrote a search app in another language, it would be able to read the  
invindex, see "text", and know that the field should be assigned a  
particular set of characteristics.

[... mind races ...]

It would be really nice if FieldSpecs themselves were completely  
serializable.  After the "text" change, we have exactly one fixed  
class def.   I was thinking about adding text::unstored,  
text::unanalyzed, text::unstoredunanalyzed, etc... but that quickly  
gets ridiculous.

Instead, what if you could insert a FieldSpec class def into an  
invindex, then assign field names to it?

   field_specs:
     keyword:
       analyzed: 0
       stored: 0
   fields:
     title: text
     body: text
     category: keyword

The trick is that while FieldSpec specifies a bunch of booleans that  
are trivial to serialize, it also contains things like analyzer that  
aren't... or does it?

Turns out that it's possible to write sane serialization code for all  
of KinoSearch's Analyzer classes.  Rough sketch:

   analyzers:
     main_analyzer:
       polyanalyzer:
         language: en
     whitespace_tokenizer:
       tokenizer:
         token_re: "\S+"

We could still make it possible to extend behavior with customized  
non-serializable analyzers:

    custom_analyzer: "MyApp::CustomAnalyzer"

The resulting invindex just wouldn't be portable.

Let's say all of this would go into a file called schema.yaml.

If we can stuff Analyzers and FieldSpecs into a serialized Schema,  
then we've solved a problem in Lucene that I'd given up on solving:  
it's not possible to read a Lucene index without knowing additional  
information not present in the index itself -- you have to know the  
Analyzer that was used.

Unfortunately, the decision to punt on that problem, which led to the  
present implementation of Schema, left KinoSearch with a nasty,  
though rarely encountered defect: if you change certain aspects of  
the Schema class (e.g. analyzer choice or behavior), KS can crash or  
behave bizarrely.  But... if the Schema is fully described by its  
serialized form, that problem goes away for everyone except people  
doing non-serializable custom extensions.

Another advantage: I'm pretty sure that the Schema subclass is only  
needed at index-time... so it would no longer be necessary to keep  
track of an extra .pm file.

This is worth doing.  :)

Peter, I know Swish works off of a configuration file.  What do you  
think of having Schema write out something analogous to the Swish  
config file during InvIndexer->finish?

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch




More information about the kinosearch mailing list