jpountz commented on PR #12444: URL: https://github.com/apache/lucene/pull/12444#issuecomment-1637854931
Here is a similar table as above but with low-cardinality clauses instead of high-cardinality clauses in order to show how the overhead of the bitset manifests: ``` OrLow2: rivers sequence OrLow3: rivers sequence opposite OrLow4: rivers sequence opposite aug OrLow6: rivers sequence opposite aug ross bronze OrLow8: rivers sequence opposite aug ross bronze extension factor OrLow12: rivers sequence opposite aug ross bronze extension factor migration maintained norwegian visited OrLow16: rivers sequence opposite aug ross bronze extension factor migration maintained norwegian visited korean argentina developing billion ``` | Task | BooleanScorer | WANDScorer within DefaultBulkScorer | MaxScoreBulkScorer (main) | MaxScoreBulkScorer (patch) | | -- | -- | -- | -- | -- | | OrLow2 | 283.3 | 353.0 | 427.2 🔶 | 398.1 🔷 | | OrLow3 | 210.3 | 278.6 🔶 | 270.0 | 220.1 🔷 | | OrLow4 | 171.7 | 198.3 🔶 | 190.0 |163.5 🔷 | | OrLow6 | 124.5 | 114.7 🔶 | 112.3 | 108.5 🔷 | | OrLow8 | 97.3 | 77.5 🔶 | 77.1 | 81.6 🔷 | | OrLow12 | 68.2 | 44.7 🔶 | 50.1 | 56.5 🔷 | | OrLow16 | 52.3 | 31.1 🔶 | 36.0 | 42.6 🔷 | With high-frequency clauses, `MaxScoreBulkScorer` was consistenly better in this PR than in the main branch. With low-frequency clauses, it's now only true for queries with 8 clauses or more. Also WAND performs faster than MAXSCORE here with less than 8 clauses. I'd like to avoid trying to go too far wrt picking the optimal implementation based on the query, which could get quite messy. Maybe we could introduce simple heuristics in a follow-up, such as only using the bulk scorer if the cost is high enough that we'd expect more than X matches per 2048-bits window on average. In general, this new `MaxScoreBulkScorer` feels like the best option to me, as it performs better on the slower queries that have high-frequency clauses, and its performance degrades more gracefully when the number of clauses increases. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org