[KinoSearch] fast phrase matching [patch]
nate at verse.com
Thu Sep 27 13:06:16 PDT 2007
On 9/27/07, Marvin Humphrey <marvin at rectangular.com> wrote:
> I found what I was looking for. There's a book called "Programming
> from the Ground Up", by Jonathan Bartlett, available both in dead-
> tree format and as a PDF under the GNU Free Documentation License.
Thanks, I shall check it out. I don't recall if I sent this in
another message, but provenance aside this is excellent for the
memory management in Linux side:
> The classes we should focus on are InStream and OutStream. First
> order of business is to study how the simple functions look in
> assembly: read_byte, read_int, etc.
Yes, so long as the Zero order of business is to determine if they
make sense architecturally. :) I'd wonder whether keeping this more
encapsulated in the Posting class might be more flexibile. We'll
still probably need the same routines, though.
> After that, we move on to what are currently called VInts and
> VLongs. (I'm contemplating renaming them C32 and C64 for "compressed
> 32/64-bit integer", since they are no longer the same as Lucene VInts/
> VLongs -- they're now BER compressed integers, as used by Perl's pack
> () function.) First, we'd like to see whether those functions are
> fully optimized under the current scheme.
Check. Presumably the internal Perl code is optimized. And does
SQLite use the same format? The code there is generally pretty and
likely well tuned.
> Second, data compression
> is the bugaboo for integrating mmap, and we can brainstorm
> alternative approaches while optimizing.
I don't think this is really a problem. As you mentioned in another
thread, we don't really need to be using bulk reads, and once we are
decompressing a single posting at a time I think there are elegant
solutions for this.
> > I'll send you a version tomorrow with such things included for you to
> > decide where you want to draw that line. It's out of sync with the
> > patch right now.
> I went ahead and committed the version you supplied as r2555. Thanks!
Sorry for never sending that. I've been distracted with other projects.
> Some mild mods followed in r2556 (<http://xrl.us/6uch>):
> * The "inline" keyword has been replaced with the recently introduced
> INLINE Charmonizer macro, which is empty if the compiler doesn't
> support inline functions.
Yes, good idea.
> * A sanity check was added at the top of winnow_anchors() to prevent
> possible invalid pointer de-refs. Technically, this wasn't
> necessary because the calling function cannot currently supply data
> which triggers such a problem, but I was uneasy about the
> absence of a
> local safety mechanism.
I think this might be a better case for an ASSERT, but I understand
> * The assignment of the iteration variable "i" was moved from the
> variable declaration at the top of PhraseScorer_calc_phrase_freq
> to the loop initiation. This will thwart problems if stuff gets
> moved around and i winds up with a new value prior to the loop.
> Again, not technically necessary, just defensive programming
Good thing. It ended up the way it did because I converted it from
C99 where 'i' was both declared and initialized in the loop header.
> * winnow_anchors() now returns a u32_t rather than a size_t,
> because it's returning a count of u32_t rather than char and the
> C standard defines size_t as "A type to define sizes of strings
> and memory blocks."
OK. I went with this because we add the return value to a pointer,
and I thought that on a 64-bit system this might save a back and forth
conversion. One could either argue that it is being used to define
the size of a memory block, or that the standard's definition is not
exclusive of other uses.
> >> PS: Tabs suck.
> > Oops. Have I been including them in things I send?
> In this patch at least. No big deal, I zapped 'em.
Hmm, I downloaded the patch I attached, and didn't find any tabs in
it. Either I've done something wrong twice, or maybe something else is
nate at verse.com
KinoSearch mailing list
KinoSearch at rectangular.com
More information about the kinosearch