[KinoSearch] get doc/query similarity
marvin at rectangular.com
Tue Apr 15 10:30:41 PDT 2008
> Ping? Still trying to compute similarity of two indexed docs... A
> weighted cosine or some such.
I've started a reply to this several times, then balled it up and
ashcanned it. I understand what you want theoretically, and the
document frequency and term frequency information is in the index and
accessible at least via private APIS. The question is how to achieve
whatever your end goal is efficiently and conveniently.
> Thanks for that example. Let me be more clear about what is desired:
> I need to compute the similarity of two indexed documents.
Are you doing something akin to a "more like this" query? What does
the end user API look like?
The brute force way is to take the contents of a document or possibly
a distillation of the contents and use that as your query, hand off to
a Searcher and see what the search gives back. That gives you a bunch
of docs, though -- not just one. You can constrain the search by
adding a "primary key"-type requirement, though performance of such a
search might be a concern with large indexes due to the way KS
compiles its queries.
If that doesn't meet your needs, then I'm not sure how to answer. I'm
approaching this like a Query/Scorer design question -- I assume that
you need not only to compare two documents, but that you need to do it
*more than once*. Is that right?
KinoSearch mailing list
KinoSearch at rectangular.com
More information about the kinosearch