jpountz opened a new pull request, #12457: URL: https://github.com/apache/lucene/pull/12457
Partitioning scorers is an optimization problem: the optimal set of non-essential scorers is the subset of scorers whose sum of max window scores is less than the minimum competitive score that maximizes the sum of costs. The current approach consists of sorting scorers by maximum score within the window and computing the set of non-essential clauses as the first scorers whose sum of max scores is less than the minimum competitive score, ie. you cannot have a competitive hit by matching only non-essential clauses. This sorting logic works well in the common case when costs are inversely correlated with maximum scores and gives an optimal solution: the above algorithm will also optimize the cost of non-essential clauses and thus minimize the cost of essential clauses, in-turn further improving query runtimes. But this isn't true for all queries. E.g. fuzzy queries compute scores based on artificial term statistics, so scores are no longer inversely correlated with maximum scores. This was especially visible with the query `titel~2` on the wikipedia dataset, as `title` matches this query and is a high-frequency term. Yet the score contribution of this term is in the same order as the contribution of most other terms, so query runtime gets much improved if this clause gets considered non-essential rather than essential. This commit optimize the partitioning logic a bit by sorting clauses by `max_score / cost` instead of just `max_score`. This will not change anything in the common case when max scores are inversely correlated with costs, but can significantly help otherwise. E.g. `titel~2` went from 41ms to 13ms on my machine and the wikimedium10m dataset with this change. Relates #12456 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org