[KinoSearch] getting back to mmap
nate at verse.com
Mon May 5 15:48:32 PDT 2008
On Thu, May 1, 2008 at 11:43 PM, Marvin Humphrey <marvin at rectangular.com> wrote:
> I suspect we will still want to encapsulate IO functions within InStream
> and OutStream... We'll still have code that looks like this:
> u32_t doc_num = InStream_Read_C32(instream);
While I think this makes good sense for the current format, I view
this as an implementation detail that should be internal to the
default Index format. For an uncompressed format using mmap (which I
agree is useful only in certain limited circumstances), I'd really be
able to knock the whole thing down to "posting += posting->length"
(with some appropriate bounds checks).
For compressed formats, I still would want to be able to use format
specific decompressions. For example, I've been tempted to use
SQLite as a data store, and access its BTrees directly. While I guess
one could do this with a specially subclassed InStream, I'd rather
have the external interface be at some higher level.
Note that I'm not looking to index text content stored in SQLite,
rather to use SQLite (or at least its BTree implementation) to manage
the binary PostingLists used by KinoSearch. I'm also interested in
using my own custom file formats, as well as trying a balanced file
system like Reiser. Thus I would like to clearly defined interface
between the Scorer and the Index that doesn't make assumptions about
how the Index will fulfill the 'Posting_next' request.
> Well, did you at least notice that we were designing the file format with
> SSDs in mind when you were scoring our discussion for buzzword compliance?
I did notice, and this was one of the sections that made me uneasy.
I'd guess that Kinosearch is not currently being limited by disk seek
times, but by sustained transfer rates --- or at least I'd hope so.
My goal would be to have it limited by memory bandwidth, which is many
times faster than an SSD accessed as a disk.
Am I missing something here that would give SSD's a real advantage?
Or reasons that they benefit from being treated differently? My
instinct is that they'd be of more benefit for truly random access or
for cases where reads and writes are more balanced. I'm not familiar
with them in practice.
> > Unfortunately, I won't be able to back
> > that up with code any time soon. :(
> Well, that just means both communiques and code will emerge from me more
I understand, and am appreciative that the responses come at all.
> I just wish we could get you, Dave Balmain, Mike McCandless and myself all
> together hashing out a file format in the same forum at the same time.
If I'm useful at all in my currently addled state, I'd be happy to try.
nate at verse.com
KinoSearch mailing list
KinoSearch at rectangular.com
More information about the kinosearch