I didn’t respond because it seemed like you were stuck on an approach that would
never be efficient in Solr. It requires massive amounts of data applied to 
documents
in a fine-grained way. Maybe it makes the math easier, but the data management 
is impractical. I could not see any way to make that fast.

Here are four alternative approaches.

1. Instead of a 1-500 scale, use a 0/1 scale. Either this result is good for 
this query term or not.
That can be implemented with a single multivalued field containing the query 
terms. If the 
query matches that field, it gets a boost.

2. How much of that score is really different between query terms? Split out a 
common document
score that is independent of the query term. Use the boost parameter with that 
document
quality score and see how close it is to the ideal ranking.

3. Group your queries and documents into categories. Give each document a score 
for each
category. That could be boolean (in the category or not) or a quality score for 
that category.
That can be stored in a dynamic field, so topic_score_1, topic_score_42, etc. A 
query for
topic 42 fetches the matching field. We did this for three different sets of 
topics each with
thousands of categories. It was a ton of fields, but ran really fast. 

You can categorize the query by seeing which documents it matches. Check the 
category
memberships of the first k results and choose the top-scoring category. This is 
a kNN 
(k Nearest Neighbors) classifier. Then take that category and run a second 
query using
the category scores.

4. Pre-calculate the top 50 results for each category with the slow algorithm 
and use the
elevate component to force that ranking for that term. 

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Feb 18, 2020, at 9:27 PM, Ashwin Ramesh <ash...@canva.com.INVALID> wrote:
> 
> ping on this :)
> 
> On Tue, Feb 18, 2020 at 11:50 AM Ashwin Ramesh <ash...@canva.com> wrote:
> 
>> Hi,
>> 
>> We are in the process of applying a scoring model to our search results.
>> In particular, we would like to add scores for documents per query and user
>> context.
>> 
>> For example, we want to have a score from 500 to 1 for the top 500
>> documents for the query “dog” for users who speak US English.
>> 
>> We believe it becomes infeasible to store these scores in Solr because we
>> want to update the scores regularly, and the number of scores increases
>> rapidly with increased user attributes.
>> 
>> One solution we explored was to store these scores in a secondary data
>> store, and use this at Solr query time with a boost function such as:
>> 
>> `bf=mul(termfreq(id,’ID-1'),500) mul(termfreq(id,'ID-2'),499) …
>> mul(termfreq(id,'ID-500'),1)`
>> 
>> We have over a hundred thousand documents in one Solr collection, and
>> about fifty million in another Solr collection. We have some queries for
>> which roughly 80% of the results match, although this is an edge case. We
>> wanted to know the worst case performance, so we tested with such a query.
>> For both of these collections we found the a message similar to the
>> following in the Solr cloud logs (tested on a laptop):
>> 
>> Elapsed time: 5020. Exceeded allowed search time: 5000 ms.
>> 
>> We then tried using the following boost, which seemed simpler:
>> 
>> `boost=if(query($qq), 10, 1)&qq=id:(ID-1 OR ID-2 OR … OR ID-500)`
>> 
>> We then saw the following in the Solr cloud logs:
>> 
>> `The request took too long to iterate over terms.`
>> 
>> All responses above took over 5000 milliseconds to return.
>> 
>> We are considering Solr’s re-ranker, but I don’t know how we would use
>> this without pushing all the query-context-document scores to Solr.
>> 
>> 
>> The alternative solution that we are currently considering involves
>> invoking multiple solr queries.
>> 
>> This means we would make a request to solr to fetch the top N results (id,
>> score) for the query. E.g. q=dog, fq=featureA:foo, fq=featureB=bar, limit=N.
>> 
>> Another request would be made using a filter query with a set of doc ids
>> that we know are high value for the user’s query. E.g. q=*:*,
>> fq=featureA:foo, fq=featureB:bar, fq=id:(d1, d2, d3), limit=N.
>> 
>> We would then do a reranking phase in our service layer.
>> 
>> Do you have any suggestions for known patterns of how we can store and
>> retrieve scores per user context and query?
>> 
>> Regards,
>> Ash & Spirit.
>> 
> 
> -- 
> **
> ** <https://www.canva.com/>Empowering the world to design
> Also, we're 
> hiring. Apply here! <https://about.canva.com/careers/>
> 
> <https://twitter.com/canva> <https://facebook.com/canva> 
> <https://au.linkedin.com/company/canva> <https://twitter.com/canva>  
> <https://facebook.com/canva>  <https://au.linkedin.com/company/canva>  
> <https://instagram.com/canva>
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 

Reply via email to