[KinoSearch] fast phrase matching [patch]
Marvin Humphrey
marvin at rectangular.com
Fri Sep 28 01:31:35 PDT 2007
On Thu, Sep 27, 2007 at 02:06:16PM -0600, Nathan Kurz wrote:
> > The classes we should focus on are InStream and OutStream. First
> > order of business is to study how the simple functions look in
> > assembly: read_byte, read_int, etc.
>
> Yes, so long as the Zero order of business is to determine if they
> make sense architecturally. :)
I like the sound of that. :)
>I'd wonder whether keeping this more
> encapsulated in the Posting class might be more flexibile. We'll
> still probably need the same routines, though.
Other things besides Postings need these routines: Lexicons, term vectors,
document storage, etc.
> > After that, we move on to what are currently called VInts and
> > VLongs. (I'm contemplating renaming them C32 and C64 for "compressed
> > 32/64-bit integer", since they are no longer the same as Lucene VInts/
> > VLongs -- they're now BER compressed integers, as used by Perl's pack
> > () function.) First, we'd like to see whether those functions are
> > fully optimized under the current scheme.
>
> Check. Presumably the internal Perl code is optimized.
That seems likely.
IIRC, what's in KS is very close to what's in pp_pack.c. I certainly studied
it.
There are a couple interesting differences between the Lucene VInt and the BER
compressed integer. You can tack "leading zeroes" onto the BER compressed
integer, but not the VInt. Also, there's an extra count to keep track of at
either read or write; VInt pays the penalty on read, BER pays the penalty on
write.
> And does SQLite use the same format? The code there is generally pretty and
> likely well tuned.
Dunno about that. I've never spelunked the SQLite code base, and a web search
for 'sqlite compressed integer' didn't turn up anything of interest.
> > Second, data compression
> > is the bugaboo for integrating mmap, and we can brainstorm
> > alternative approaches while optimizing.
>
> I don't think this is really a problem. As you mentioned in another
> thread, we don't really need to be using bulk reads, and once we are
> decompressing a single posting at a time I think there are elegant
> solutions for this.
Cool. There ought to be room for improvement here. KS is double buffering
all input/output. The setvbuf thing is really weird, but I couldn't deny
the benchmarks. Maybe an 'objdump -S FSFileDes.o' will tell me something.
> > > I'll send you a version tomorrow with such things included for you to
> > > decide where you want to draw that line. It's out of sync with the
> > > patch right now.
> >
> > I went ahead and committed the version you supplied as r2555. Thanks!
>
> Sorry for never sending that. I've been distracted with other projects.
No worries, mate. Can you show me how you normally like your DEBUG and ASSERT
macros set up?
> > * A sanity check was added at the top of winnow_anchors() to prevent
> > possible invalid pointer de-refs. Technically, this wasn't
> > necessary because the calling function cannot currently supply data
> > which triggers such a problem, but I was uneasy about the
> > absence of a
> > local safety mechanism.
>
> I think this might be a better case for an ASSERT, but I understand
> the impulse.
My guess was that you'd had an ASSERT there but deleted it. However, I don't
think it's a good candidate for an ASSERT, because you need specifically
pathological data to trigger the problem.
> > * winnow_anchors() now returns a u32_t rather than a size_t,
> > because it's returning a count of u32_t rather than char and the
> > C standard defines size_t as "A type to define sizes of strings
> > and memory blocks."
>
> OK. I went with this because we add the return value to a pointer,
> and I thought that on a 64-bit system this might save a back and forth
> conversion.
OK, that's a good enough reason. I'll change it back.
> Hmm, I downloaded the patch I attached, and didn't find any tabs in
> it. Either I've done something wrong twice, or maybe something else is
> adding them.
Output of a search for tabs is below. I also find them with vim.
Cheers,
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
Last login: Fri Sep 28 00:31:20 on ttyp1
Welcome to Darwin!
teddy:~ marvin$ curl -o winnow_anchors.diff http://www.rectangular.com/pipermail/kinosearch/attachments/20070910/79f710e0/winnow_anchors-0001.obj
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 8914 100 8914 0 0 35623 0 --:--:-- --:--:-- --:--:-- 88862
teddy:~ marvin$ ack -a "\t" winnow_anchors.diff
--- c_src/KinoSearch/Search/PhraseScorer.c (revision 2526)
+++ c_src/KinoSearch/Search/PhraseScorer.c (working copy)
+ const u32_t *candidates, const u32_t *const candidates_end,
+ size_t offset)
+{
+ if (++candidates == candidates_end) goto DONE;
+ if (++anchors == anchors_end) goto DONE;
+ /* copy anchor positions skipping those less than initial offset */
+ if (phrase_offsets[0] > posting->prox[i]) anchors_end--;
+ /* subtract phrase_offsets[0] to account for leading virtual terms */
+ else *anchors++ = posting->prox[i] - phrase_offsets[0];
+ i++;
+ /* prepare the non-overwritten list of potential next terms */
+ /* reduce anchor_set in place to those that match the next term */
+ anchors_remaining = winnow_anchors(anchors_start, anchors_end,
+ candidates_start, candidates_end,
+ phrase_offsets[i]);
+ /* adjust end for number of anchors that remain */
+ anchors_end = anchors_start + anchors_remaining;
--- perl/t/502-phrasequery.t (revision 2526)
+++ perl/t/502-phrasequery.t (working copy)
teddy:~ marvin$
_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
More information about the kinosearch
mailing list