[KinoSearch] Using Kinosearch file format as generic inverted index

Marvin Humphrey marvin at rectangular.com
Tue Aug 29 18:38:32 PDT 2006



Simon,

I don't know if this is realistic right now.  The problem is that  
KinoSearch, like Lucene, is tightly bound to its file format.  It's  
possible, but it would take a fairly deep understanding of KS's data  
structures and an awful lot of hacking on stuff that isn't public.

Nevertheless, this is *precisely* the direction that I want to take  
KinoSearch, and Lucy.

Document and Field should both be abstract classes that specify  
serialization/deserialization methods.

In the near term, that change is crucial for liberating KinoSearch  
from its file format.  Once the logic for reading the index lives in  
a plugin, and fewer classes spec how to read the index directly,  
backwards compatibility suddenly gets a hell of a lot easier... and  
we can get rid of that dang "alpha" label.

In the longer term, I hope to enable innovation along the lines of  
what you propose to do.  Other examples include...

    * "boost per-position", allowing, say, text between h1
      tags to contribute more than text between p tags.
    * tracking part-of-speech per-position
    * Associating each term with LSA vectors
    * ????? -- a generic inverted index will hopefully be put
      to uses that not currently envisioned.

The plan is to battle-test the abstraction privately first using a  
new file format which will fit with this scheme more comfortably than  
the current one. The target release for the private API is 0.20.   
Once we cross that threshold, it will be easier to do what you  
propose, if you're willing to live on the bleeding edge and hack away  
at the internals.

> Related to my previous post and some algorithms I've been playing  
> aorund
> with I'd like to tyr and see if I can get a performance boost out of
> using the KinoSearch InvIndex to store some graph data.
>
> I need to store a node id and then a list of other node ids that it
> links to. The edge needs to have 2 other arbitary fields attached  
> to it
> - a type and value (although I suppose the type could be done y having
> each different type in different indexes). Preferably each node should
> be able to be looked up as an id or as a value.

Does it need to be per-position or per-term?  This can get fairly  
expensive if you need it per-position.  Think of whether each word in  
a book's index needs the tagging, or whether each page number within  
each index entry needs the tagging.  If it's each page number, then  
you need a lot more space than if it's per-term.

> Understandably the docs don't really go into how to do this - the
> various classes seem a bit ... sparse on POD :)

Sparse on visible POD at least for private classes -- by design.   
However, have you snooped the actual module code rather than just  
running it through perldoc or looking on search.cpan.org?  In some  
cases, there's fairly extensive documentation hidden away -- see  
OutStream for a good example.

> Any idea on whether this is a sane thing to do and, if so, hwo to go
> about doing it?

The main classes you would need to be concerned with at search-time  
are...

    * TermEnum/SegTermEnum -- an "array" of Terms.
    * TermDocs/SegTermDocs/MultiTermDocs -- for each term, an "array"
      of doc numbers and other info.
    * TermBuffer -- does the deserialization for SegTermEnum.

TermEnum and TermDocs aren't really arrays, they're iterators, but  
it's easier if we think of them giant arrays.

At index-time, it's harder to describe what's going on, but for the  
record, the classes that handle the low level writing are  
PostingsWriter, TermInfosWriter, and SegWriter.

The idea is that you would stuff an extra number into somewhere in  
the file format, then recover it later and probably make use of it in  
a specialized scorer.  I haven't thought too much about the API.   
Maybe something pack-ish?  We'll see.

For background, see the POD in KinoSearch::Docs::FileFormat and  
<http://wiki.apache.org/jakarta-lucene/FlexibleIndexing>.

Cheers,

Marvin Humphrey

--
I'm looking for a part time job.




_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch




More information about the kinosearch mailing list