[KinoSearch] Backwards Compatibility Policy

Marvin Humphrey marvin at rectangular.com
Tue Jan 23 13:59:07 PST 2007




On Jan 22, 2007, at 11:28 PM, Henka wrote:

>> This policy would be put in place as of revision 1.0, which is likely
>> to be 0.20 with a few minor bugfixes after a few weeks of testing.
>
> As an aside, will 0.20 break backwards compatibility for existing  
> indexes?
> I seem to recall that it would.

Yes, it will shatter it.  Every current app will fail  
catastrophically.  Continuity from the current file format to the new  
would be too hard and take too long, so it's time to invoke the  
"alpha" clause.  And once we've committed to that course of action,  
doing things halfway doesn't make sense.  We don't want subtly  
degraded behavior and intermittent failure.

The new format is designed to be more mutable, so we don't have to  
resort to such measures again, especially not soon.  Below you'll  
find a the text of a Lucene JIRA issue I've opened, explaining some  
of the reasoning.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


Begin forwarded message:
From: "Marvin Humphrey (JIRA)" <jira at apache.org>
Date: January 23, 2007 11:41:49 AM PST
To: java-dev at lucene.apache.org
Subject: [jira] Created: (LUCENE-783) Store all metadata in human- 
readable segments file
Reply-To: java-dev at lucene.apache.org

Store all metadata in human-readable segments file
--------------------------------------------------

                  Key: LUCENE-783
                  URL: https://issues.apache.org/jira/browse/LUCENE-783
              Project: Lucene - Java
           Issue Type: Improvement
           Components: Index
             Reporter: Marvin Humphrey
             Priority: Minor


Various index-reading components in Lucene need metadata in addition  
to data.
This metadata is presently stored in arbitrary binary headers and  
spread out
over several files.  We should move to concentrate it in a single  
file, and
this file should be encoded using a human-readable, extensible,  
standardized
data serialization language -- either XML or YAML.

* Making metadata human-readable makes debugging easier.   
Centralizing it
   makes debugging easier still.  Developers benefit from being able  
to scan
   and locate relevant information quickly and with less debug  
printing.  Users
   get a new window through which to peer into the index structure.
* Since metadata is written to a separate file, there would no longer  
be a
   need to seek back to the beginning of any data file to finish a  
header,
   solving issue LUCENE-532.
* Special-case parsing code needed for extracting metadata supplied by
   different index formats can be pared down.  If a value is no longer
   necessary, it can just be ignored/discarded.
* Removing headers from the data files simplifies them and makes the  
file
   format easier to implement.
* With headers removed, all or nearly all data structures can take the
   form of records stacked end to end, so that once a decoder has been
   selected, an iterator can read the file from top to tail.  To an  
extent,
   this allows us to separate our data-processing algorithms from our
   serialization algorithms, decoupling Lucene's code base from its file
   format.  For instance, instead of further subclassing TermDocs to  
deal with
   "flexible indexing" formats, we might replace it with a  
PostingList which
   returns a subclass of Posting.  The deserialization code would be  
wholly
   contained within the Posting subclass rather than spread out over  
several
   subclasses of TermDocs.
* YAML and XML are equally well suited for the task of storing metadata,
   but in either case a complete parser would not be needed -- a  
small subset
   of the language will do.  KinoSearch 0.20's custom-coded YAML parser
   occupies about 600 lines of C -- not too bad, considering how  
miserable C's
   string handling capabilities are.


_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch




More information about the kinosearch mailing list