Re: Vector based queries

2012-03-11 Thread Bill Bell
It is way too slow Sent from my Mobile device 720-256-8076 On Mar 11, 2012, at 12:07 PM, Pat Ferrel wrote: > I found a description here: > http://cephas.net/blog/2008/03/30/how-morelikethis-works-in-lucene/ > > If it is the same four years later, it looks like lucene is doing an index > look

Re: Vector based queries

2012-03-11 Thread Pat Ferrel
I found a description here: http://cephas.net/blog/2008/03/30/how-morelikethis-works-in-lucene/ If it is the same four years later, it looks like lucene is doing an index lookup for each important term in the example doc boosting each term based on the term weights. My guess would be that this

Re: Vector based queries

2012-03-11 Thread Pat Ferrel
MoreLikeThis looks exactly like what I need. I would probably create a new "like" method to take a mahout vector and build a search? I build the vector by starting from a doc and reweighting certain terms. The prototype just reweights words but I may experiment with dirichlet clusters and rewei

Re: Vector based queries

2012-03-11 Thread Paul Libbrecht
Maybe that's exactly it but... given a document with n tokens A, and m tokens B, a query A^n B^m would find what you're looking for or? paul PS I've always viewed queries as linear forms on the vector space and I'd like to see this really mathematically written one day... Le 11 mars 2012 à 07:

Re: Vector based queries

2012-03-10 Thread Lance Norskog
Look at the MoreLikeThis feature in Lucene. I believe it does roughly what you describe. On Sat, Mar 10, 2012 at 9:58 AM, Pat Ferrel wrote: > I have a case where I'd like to get documents which most closely match a > particular vector. The RowSimilarityJob of Mahout is ideal for > precalculating

Vector based queries

2012-03-10 Thread Pat Ferrel
I have a case where I'd like to get documents which most closely match a particular vector. The RowSimilarityJob of Mahout is ideal for precalculating similarity between existing documents but in my case the query is constructed at run time. So the UI constructs a vector to be used as a query.