Hi,

We are in the process of applying a scoring model to our search results. In
particular, we would like to add scores for documents per query and user
context.

For example, we want to have a score from 500 to 1 for the top 500
documents for the query “dog” for users who speak US English.

We believe it becomes infeasible to store these scores in Solr because we
want to update the scores regularly, and the number of scores increases
rapidly with increased user attributes.

One solution we explored was to store these scores in a secondary data
store, and use this at Solr query time with a boost function such as:

`bf=mul(termfreq(id,’ID-1'),500) mul(termfreq(id,'ID-2'),499) …
mul(termfreq(id,'ID-500'),1)`

We have over a hundred thousand documents in one Solr collection, and about
fifty million in another Solr collection. We have some queries for which
roughly 80% of the results match, although this is an edge case. We wanted
to know the worst case performance, so we tested with such a query. For
both of these collections we found the a message similar to the following
in the Solr cloud logs (tested on a laptop):

Elapsed time: 5020. Exceeded allowed search time: 5000 ms.

We then tried using the following boost, which seemed simpler:

`boost=if(query($qq), 10, 1)&qq=id:(ID-1 OR ID-2 OR … OR ID-500)`

We then saw the following in the Solr cloud logs:

`The request took too long to iterate over terms.`

All responses above took over 5000 milliseconds to return.

We are considering Solr’s re-ranker, but I don’t know how we would use this
without pushing all the query-context-document scores to Solr.


The alternative solution that we are currently considering involves
invoking multiple solr queries.

This means we would make a request to solr to fetch the top N results (id,
score) for the query. E.g. q=dog, fq=featureA:foo, fq=featureB=bar, limit=N.

Another request would be made using a filter query with a set of doc ids
that we know are high value for the user’s query. E.g. q=*:*,
fq=featureA:foo, fq=featureB:bar, fq=id:(d1, d2, d3), limit=N.

We would then do a reranking phase in our service layer.

Do you have any suggestions for known patterns of how we can store and
retrieve scores per user context and query?

Regards,
Ash & Spirit.

-- 
**
** <https://www.canva.com/>Empowering the world to design
Also, we're 
hiring. Apply here! <https://about.canva.com/careers/>
 
<https://twitter.com/canva> <https://facebook.com/canva> 
<https://au.linkedin.com/company/canva> <https://twitter.com/canva>  
<https://facebook.com/canva>  <https://au.linkedin.com/company/canva>  
<https://instagram.com/canva>










Reply via email to