[ https://issues.apache.org/jira/browse/LUCENE-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17349941#comment-17349941 ]
Zach Chen commented on LUCENE-9335: ----------------------------------- Hi [~jpountz], I've tried out a few ideas in the last few days and they gave some improvements (but also made it worse for OrMedMedMedMedMed). However, it was still not as performing as BMW for MSMARCO passages dataset. The ideas I tried include: # Move scorer from essential to non-essential list when minCompetitiveScore increases (mentioned in the paper) ## commit: **[https://github.com/apache/lucene/pull/101/commits/e5f10e31a84c0bab687fbac7d3f05274472a1288|https://github.com/apache/lucene/pull/101/commits/e5f10e31a84c0bab687fbac7d3f05274472a1288]] # Use score.score instead of maxScore for candidate doc evaluation against minCompetitiveScore to prune more docs (reverting your previous optimization) ## commit: **[https://github.com/apache/lucene/pull/101/commits/e5f10e31a84c0bab687fbac7d3f05274472a1288|https://github.com/apache/lucene/pull/101/commits/e5f10e31a84c0bab687fbac7d3f05274472a1288]] # Reduce maxScore contribution from non-essential list during candidate doc evaluation for scorer that cannot match ## commit: [https://github.com/apache/lucene/pull/101/commits/881dbf8fc1c04b8c5d2cb0f19e4e3e44ef595f3d] # Use the maximum of each scorer's upTo for maxScore boundary instead of minimum (opposed to what the paper suggested) ## commit: [https://github.com/apache/lucene/pull/101/commits/466a2d9292e300cbf00312f3477d95a14c41c188] ## This causes OrMedMedMedMedMed to degrade by 40% Collectively, these gave 70~90% performance boost to OrHighHigh, 60~150% for OrHighMed, and smaller improvement for AndHighOrMedMed, but at the expense of OrMedMedMedMedMed performance (by -40% with #4 changes). For MSMARCO passages dataset, they now give the following results (modified slightly from your version to show more percentile, and to add comma to separate digits for readability): *BMW Scorer* {code:java} AVG: 23,252,992.375 P25: 6,298,463 P50: 13,007,148 P75: 26,868,222 P90: 56,683,505 P95: 84,333,397 P99: 154,185,321 Collected AVG: 8,168.523 Collected P25: 1,548 Collected P50: 2,259 Collected P75: 3,735 Collected P90: 6,228 Collected P95: 13,063 Collected P99: 221,894{code} *BMM Scorer* {code:java} AVG: 41,970,641.638 P25: 8,654,210 P50: 21,553,366 P75: 51,519,172 P90: 109,510,378 P95: 154,534,017 P99: 266,141,446 Collected AVG: 16,810.392 Collected P25: 2,769 Collected P50: 7,159 Collected P75: 20,077 Collected P90: 43,031 Collected P95: 69,984 Collected P99: 135,253 {code} I've also attached "JFR result for BMM scorer with optimizations May 22" for the BMM scorer profiling result from the latest changes. Overall, it seems that the larger number of docs collected by BMM is becoming a bottleneck for performance, as around 50% of the computation was spent by SimpleTopScoreDocCollector#collect / BlockMaxMaxscoreScorer#score to compute score for candidate doc (around 34% of the computation was spent to find the next doc in BlockMaxMaxscoreScorer#nextDoc). If there's a way to prune more docs faster, it should be able to improve BMM further. > Add a bulk scorer for disjunctions that does dynamic pruning > ------------------------------------------------------------ > > Key: LUCENE-9335 > URL: https://issues.apache.org/jira/browse/LUCENE-9335 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Adrien Grand > Priority: Minor > Attachments: JFR result for BMM scorer with optimizations May 22.png, > MSMarcoPassages.java, wikimedium.10M.nostopwords.tasks, > wikimedium.10M.nostopwords.tasks.5OrMeds > > Time Spent: 6h 50m > Remaining Estimate: 0h > > Lucene often gets benchmarked against other engines, e.g. against Tantivy and > PISA at [https://tantivy-search.github.io/bench/] or against research > prototypes in Table 1 of > [https://cs.uwaterloo.ca/~jimmylin/publications/Grand_etal_ECIR2020_preprint.pdf]. > Given that top-level disjunctions of term queries are commonly used for > benchmarking, it would be nice to optimize this case a bit more, I suspect > that we could make fewer per-document decisions by implementing a BulkScorer > instead of a Scorer. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org