It is way too slow Sent from my Mobile device 720-256-8076
On Mar 11, 2012, at 12:07 PM, Pat Ferrel <p...@occamsmachete.com> wrote: > I found a description here: > http://cephas.net/blog/2008/03/30/how-morelikethis-works-in-lucene/ > > If it is the same four years later, it looks like lucene is doing an index > lookup for each important term in the example doc boosting each term based on > the term weights. My guess would be that this is a little slower than 2-3word > query but still scalable. > > Has anyone used this on a very large index? > > Thanks, > Pat > > On 3/11/12 10:45 AM, Pat Ferrel wrote: >> MoreLikeThis looks exactly like what I need. I would probably create a new >> "like" method to take a mahout vector and build a search? I build the vector >> by starting from a doc and reweighting certain terms. The prototype just >> reweights words but I may experiment with dirichlet clusters and reweighting >> an entire cluster of words so you could boost the importance of a topic in >> the results. Still the result of either algorithm would be a mahout vector. >> >> Is there a description of how this works somewhere? Is it basically an index >> lookup? I always though the Google feature used precalculated results (and >> it probably does). I'm curious but mainly asking to see how fast it is. >> >> Thanks >> Pat >> >> On 3/11/12 8:36 AM, Paul Libbrecht wrote: >>> Maybe that's exactly it but... given a document with n tokens A, and m >>> tokens B, a query A^n B^m would find what you're looking for or? >>> >>> paul >>> >>> PS I've always viewed queries as linear forms on the vector space and I'd >>> like to see this really mathematically written one day... >>> Le 11 mars 2012 à 07:23, Lance Norskog a écrit : >>> >>>> Look at the MoreLikeThis feature in Lucene. I believe it does roughly >>>> what you describe. >>>> >>>> On Sat, Mar 10, 2012 at 9:58 AM, Pat Ferrel<p...@occamsmachete.com> wrote: >>>>> I have a case where I'd like to get documents which most closely match a >>>>> particular vector. The RowSimilarityJob of Mahout is ideal for >>>>> precalculating similarity between existing documents but in my case the >>>>> query is constructed at run time. So the UI constructs a vector to be used >>>>> as a query. We have this running in prototype using a run time calculation >>>>> of cosine similarity but the implementation is not scalable to large doc >>>>> stores. >>>>> >>>>> One thought is to calculate fairly small clusters. The UI will know which >>>>> cluster to target for the vector query. So we might be able to narrow down >>>>> the number of docs per query to a reasonable size. >>>>> >>>>> It seems like a place for multiple hash functions maybe? Could we use some >>>>> kind of hack of the boost feature of Solr or some other approach? >>>>> >>>>> Does anyone have a suggestion? >>>> >>>> >>>> -- >>>> Lance Norskog >>>> goks...@gmail.com >>>