[GitHub] [lucene] jpountz opened a new pull request, #12457: Improve MaxScoreBulkScorer partitioning logic.

via GitHub Mon, 24 Jul 2023 05:37:51 -0700


jpountz opened a new pull request, #12457:
URL: https://github.com/apache/lucene/pull/12457


   Partitioning scorers is an optimization problem: the optimal set of 
non-essential scorers is the subset of scorers whose sum of max window scores 
is less than the minimum competitive score that maximizes the sum of costs.
   
   The current approach consists of sorting scorers by maximum score within the 
window and computing the set of non-essential clauses as the first scorers 
whose sum of max scores is less than the minimum competitive score, ie. you 
cannot have a competitive hit by matching only non-essential clauses.
   
   This sorting logic works well in the common case when costs are inversely 
correlated with maximum scores and gives an optimal solution: the above 
algorithm will also optimize the cost of non-essential clauses and thus 
minimize the cost of essential clauses, in-turn further improving query 
runtimes. But this isn't true for all queries. E.g. fuzzy queries compute 
scores based on artificial term statistics, so scores are no longer inversely 
correlated with maximum scores. This was especially visible with the query 
`titel~2` on the wikipedia dataset, as `title` matches this query and is a 
high-frequency term. Yet the score contribution of this term is in the same 
order as the contribution of most other terms, so query runtime gets much 
improved if this clause gets considered non-essential rather than essential.
   
   This commit optimize the partitioning logic a bit by sorting clauses by 
`max_score / cost` instead of just `max_score`. This will not change anything 
in the common case when max scores are inversely correlated with costs, but can 
significantly help otherwise. E.g. `titel~2` went from 41ms to 13ms on my 
machine and the wikimedium10m dataset with this change.
   
   Relates #12456


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jpountz opened a new pull request, #12457: Improve MaxScoreBulkScorer partitioning logic.

Reply via email to