[KinoSearch] get doc/query similarity

Marvin Humphrey marvin at rectangular.com
Tue Apr 15 10:30:41 PDT 2008



> Ping? Still trying to compute similarity of two indexed docs... A  
> weighted cosine or some such.

I've started a reply to this several times, then balled it up and  
ashcanned it.  I understand what you want theoretically, and the  
document frequency and term frequency information is in the index and  
accessible at least via private APIS.  The question is how to achieve  
whatever your end goal is efficiently and conveniently.

> Thanks for that example. Let me be more clear about what is desired:  
> I need to compute the similarity of two indexed documents.

Are you doing something akin to a "more like this" query?  What does  
the end user API look like?

The brute force way is to take the contents of a document or possibly  
a distillation of the contents and use that as your query, hand off to  
a Searcher and see what the search gives back.  That gives you a bunch  
of docs, though -- not just one.  You can constrain the search by  
adding a "primary key"-type requirement, though performance of such a  
search might be a concern with large indexes due to the way KS  
compiles its queries.

If that doesn't meet your needs, then I'm not sure how to answer.  I'm  
approaching this like a Query/Scorer design question -- I assume that  
you need not only to compare two documents, but that you need to do it  
*more than once*. Is that right?

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch




More information about the kinosearch mailing list