Maybe that's exactly it but... given a document with n tokens A, and m tokens 
B, a query A^n B^m would find what you're looking for or?

paul

PS  I've always viewed queries as linear forms on the vector space and I'd like 
to see this really mathematically written one day...
Le 11 mars 2012 à 07:23, Lance Norskog a écrit :

> Look at the MoreLikeThis feature in Lucene. I believe it does roughly
> what you describe.
> 
> On Sat, Mar 10, 2012 at 9:58 AM, Pat Ferrel <p...@occamsmachete.com> wrote:
>> I have a case where I'd like to get documents which most closely match a
>> particular vector. The RowSimilarityJob of Mahout is ideal for
>> precalculating similarity between existing documents but in my case the
>> query is constructed at run time. So the UI constructs a vector to be used
>> as a query. We have this running in prototype using a run time calculation
>> of cosine similarity but the implementation is not scalable to large doc
>> stores.
>> 
>> One thought is to calculate fairly small clusters. The UI will know which
>> cluster to target for the vector query. So we might be able to narrow down
>> the number of docs per query to a reasonable size.
>> 
>> It seems like a place for multiple hash functions maybe? Could we use some
>> kind of hack of the boost feature of Solr or some other approach?
>> 
>> Does anyone have a suggestion?
> 
> 
> 
> -- 
> Lance Norskog
> goks...@gmail.com

Reply via email to