It is way too slow

Sent from my Mobile device
720-256-8076

On Mar 11, 2012, at 12:07 PM, Pat Ferrel <p...@occamsmachete.com> wrote:

> I found a description here: 
> http://cephas.net/blog/2008/03/30/how-morelikethis-works-in-lucene/
> 
> If it is the same four years later, it looks like lucene is doing an index 
> lookup for each important term in the example doc boosting each term based on 
> the term weights. My guess would be that this is a little slower than 2-3word 
> query but still scalable.
> 
> Has anyone used this on a very large index?
> 
> Thanks,
> Pat
> 
> On 3/11/12 10:45 AM, Pat Ferrel wrote:
>> MoreLikeThis looks exactly like what I need. I would probably create a new 
>> "like" method to take a mahout vector and build a search? I build the vector 
>> by starting from a doc and reweighting certain terms. The prototype just 
>> reweights words but I may experiment with dirichlet clusters and reweighting 
>> an entire cluster of words so you could boost the importance of a topic in 
>> the results. Still the result of either algorithm would be a mahout vector.
>> 
>> Is there a description of how this works somewhere? Is it basically an index 
>> lookup? I always though the Google feature used precalculated results (and 
>> it probably does). I'm curious but mainly asking to see how fast it is.
>> 
>> Thanks
>> Pat
>> 
>> On 3/11/12 8:36 AM, Paul Libbrecht wrote:
>>> Maybe that's exactly it but... given a document with n tokens A, and m 
>>> tokens B, a query A^n B^m would find what you're looking for or?
>>> 
>>> paul
>>> 
>>> PS  I've always viewed queries as linear forms on the vector space and I'd 
>>> like to see this really mathematically written one day...
>>> Le 11 mars 2012 à 07:23, Lance Norskog a écrit :
>>> 
>>>> Look at the MoreLikeThis feature in Lucene. I believe it does roughly
>>>> what you describe.
>>>> 
>>>> On Sat, Mar 10, 2012 at 9:58 AM, Pat Ferrel<p...@occamsmachete.com>  wrote:
>>>>> I have a case where I'd like to get documents which most closely match a
>>>>> particular vector. The RowSimilarityJob of Mahout is ideal for
>>>>> precalculating similarity between existing documents but in my case the
>>>>> query is constructed at run time. So the UI constructs a vector to be used
>>>>> as a query. We have this running in prototype using a run time calculation
>>>>> of cosine similarity but the implementation is not scalable to large doc
>>>>> stores.
>>>>> 
>>>>> One thought is to calculate fairly small clusters. The UI will know which
>>>>> cluster to target for the vector query. So we might be able to narrow down
>>>>> the number of docs per query to a reasonable size.
>>>>> 
>>>>> It seems like a place for multiple hash functions maybe? Could we use some
>>>>> kind of hack of the boost feature of Solr or some other approach?
>>>>> 
>>>>> Does anyone have a suggestion?
>>>> 
>>>> 
>>>> -- 
>>>> Lance Norskog
>>>> goks...@gmail.com
>>> 

Reply via email to