Re: Vector based queries

Pat Ferrel Sun, 11 Mar 2012 11:08:22 -0700

I found a description here:http://cephas.net/blog/2008/03/30/how-morelikethis-works-in-lucene/

If it is the same four years later, it looks like lucene is doing anindex lookup for each important term in the example doc boosting eachterm based on the term weights. My guess would be that this is a littleslower than 2-3word query but still scalable.


Has anyone used this on a very large index?

Thanks,
Pat

On 3/11/12 10:45 AM, Pat Ferrel wrote:

MoreLikeThis looks exactly like what I need. I would probably create anew "like" method to take a mahout vector and build a search? I buildthe vector by starting from a doc and reweighting certain terms. Theprototype just reweights words but I may experiment with dirichletclusters and reweighting an entire cluster of words so you could boostthe importance of a topic in the results. Still the result of eitheralgorithm would be a mahout vector.
Is there a description of how this works somewhere? Is it basically anindex lookup? I always though the Google feature used precalculatedresults (and it probably does). I'm curious but mainly asking to seehow fast it is.
Thanks
Pat

On 3/11/12 8:36 AM, Paul Libbrecht wrote:
Maybe that's exactly it but... given a document with n tokens A, andm tokens B, a query A^n B^m would find what you're looking for or?
paul
PS I've always viewed queries as linear forms on the vector spaceand I'd like to see this really mathematically written one day...
Le 11 mars 2012 à 07:23, Lance Norskog a écrit :
Look at the MoreLikeThis feature in Lucene. I believe it does roughly
what you describe.
On Sat, Mar 10, 2012 at 9:58 AM, Pat Ferrel<p...@occamsmachete.com>wrote:
I have a case where I'd like to get documents which most closelymatch a
particular vector. The RowSimilarityJob of Mahout is ideal for
precalculating similarity between existing documents but in my casethequery is constructed at run time. So the UI constructs a vector tobe usedas a query. We have this running in prototype using a run timecalculationof cosine similarity but the implementation is not scalable tolarge doc
stores.
One thought is to calculate fairly small clusters. The UI will knowwhichcluster to target for the vector query. So we might be able tonarrow down
the number of docs per query to a reasonable size.
It seems like a place for multiple hash functions maybe? Could weuse some
kind of hack of the boost feature of Solr or some other approach?

Does anyone have a suggestion?
--
Lance Norskog
goks...@gmail.com

Re: Vector based queries

Reply via email to