I have a case where I'd like to get documents which most closely match a particular vector. The RowSimilarityJob of Mahout is ideal for precalculating similarity between existing documents but in my case the query is constructed at run time. So the UI constructs a vector to be used as a query. We have this running in prototype using a run time calculation of cosine similarity but the implementation is not scalable to large doc stores.
One thought is to calculate fairly small clusters. The UI will know 
which cluster to target for the vector query. So we might be able to 
narrow down the number of docs per query to a reasonable size.
It seems like a place for multiple hash functions maybe? Could we use 
some kind of hack of the boost feature of Solr or some other approach?
Does anyone have a suggestion?

Reply via email to