[KinoSearch] get doc/query similarity
marvin at rectangular.com
Tue Apr 15 23:59:45 PDT 2008
On Apr 15, 2008, at 9:33 PM, jack_tanner at yahoo.com wrote:
> This is a kind of an duplicate detection task. I have a corpus of
> documents written by a known, small set of authors. I want to rank
> the authors w.r.t. how much they repeat themselves. To do that, I
> want to take all docs written by the same author, compute their
> pairwise similarities, and then average those similarities.
> (Probably just take the mean.) I'm going to repeat this for all
> authors. At the end, I have a "repetitiveness" score for each
> author. This score is the actual end goal.
Neat. Not that this is what you're doing, but I can imagine something
like this being used as a supervisory tool for people who get paid for
generating content when the primary criteria is volume rather than
quality. Copy-and-paste documents with minor variations would appear
tightly grouped in vector space.
>> The brute force way is to take the contents of a document or possibly
>> a distillation of the contents and use that as your query, hand off
>> a Searcher and see what the search gives back. That gives you a
>> of docs, though -- not just one. You can constrain the search by
>> adding a "primary key"-type requirement, though performance of such a
>> search might be a concern with large indexes due to the way KS
>> compiles its queries.
> I can definitely do that, and then just loop over the hits until I
> get the doc of interest. The only problem is if the doc of interest
> is not retrieved at all... but then I can assign that a score of 0.
Please let us know how it goes.
I suggest using only one field, otherwise you might get some
distortions and exaggerations in the scoring curves as artifacts of
the query parsing wizard.
You may also run afoul of the max_clause_count of 1024 in BooleanQuery
because the queries will have so many components. To defeat this in
0.1x, add this to your code:
# hack to override safety feature
KinoSearch's scoring model uses Lucene's slight variant on vanilla TF/
IDF. Length normalization is in there; the resolution is low, but
that shouldn't matter. The one thing that's a little unusual is the
addition of a "coord" function which boosts OR'd queries when multiple
clauses match. It will affect your scores, but probably not too much
since the formula is proportional: num_matchers / max_matchers.
KinoSearch mailing list
KinoSearch at rectangular.com
More information about the kinosearch