[ https://issues.apache.org/jira/browse/LUCENE-10121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17419385#comment-17419385 ]
Adrien Grand commented on LUCENE-10121: --------------------------------------- I opened a pull request that tries to avoid this issue by looking at the floating-point scores as well, still in a way that is prone to rounding errors. Here are the results of luceneutil on wikibigall: {noformat} TaskQPS baseline StdDev QPS patch StdDev Pct diff p-value HighTermDayOfYearSort 2048.74 (29.0%) 1984.55 (30.6%) -3.1% ( -48% - 79%) 0.740 Fuzzy2 100.30 (4.8%) 98.92 (5.2%) -1.4% ( -10% - 9%) 0.383 HighPhrase 84.34 (2.6%) 83.47 (1.8%) -1.0% ( -5% - 3%) 0.141 OrHighLow 576.29 (2.8%) 570.57 (2.4%) -1.0% ( -6% - 4%) 0.227 AndHighLow 484.18 (3.8%) 480.17 (3.5%) -0.8% ( -7% - 6%) 0.473 OrHighMed 95.16 (4.0%) 94.41 (3.8%) -0.8% ( -8% - 7%) 0.520 Respell 177.03 (2.4%) 176.16 (2.4%) -0.5% ( -5% - 4%) 0.517 HighSloppyPhrase 3.50 (3.5%) 3.49 (4.2%) -0.5% ( -7% - 7%) 0.692 AndHighMed 154.00 (4.0%) 153.27 (3.8%) -0.5% ( -8% - 7%) 0.704 Prefix3 210.72 (12.9%) 209.87 (13.2%) -0.4% ( -23% - 29%) 0.922 HighTerm 1546.28 (3.6%) 1540.74 (2.8%) -0.4% ( -6% - 6%) 0.727 HighTermMonthSort 116.31 (6.0%) 115.94 (4.9%) -0.3% ( -10% - 11%) 0.853 IntNRQ 435.27 (1.9%) 434.13 (1.5%) -0.3% ( -3% - 3%) 0.622 Wildcard 126.26 (12.5%) 125.93 (13.2%) -0.3% ( -23% - 28%) 0.950 Fuzzy1 181.58 (8.5%) 181.12 (6.9%) -0.3% ( -14% - 16%) 0.917 LowPhrase 60.14 (2.2%) 60.02 (2.0%) -0.2% ( -4% - 4%) 0.750 MedTerm 1549.63 (2.5%) 1547.41 (3.2%) -0.1% ( -5% - 5%) 0.874 LowSpanNear 13.72 (3.3%) 13.71 (2.8%) -0.1% ( -6% - 6%) 0.944 AndHighHigh 73.67 (3.5%) 73.62 (2.9%) -0.1% ( -6% - 6%) 0.950 LowTerm 2856.45 (3.4%) 2855.10 (4.3%) -0.0% ( -7% - 7%) 0.969 MedSpanNear 5.15 (9.9%) 5.15 (9.1%) -0.0% ( -17% - 21%) 0.996 MedPhrase 25.88 (2.4%) 25.87 (2.4%) -0.0% ( -4% - 5%) 0.987 LowSloppyPhrase 79.38 (3.8%) 79.48 (3.6%) 0.1% ( -7% - 7%) 0.917 MedSloppyPhrase 12.25 (3.1%) 12.28 (3.4%) 0.2% ( -6% - 6%) 0.817 HighSpanNear 6.20 (4.5%) 6.22 (3.3%) 0.3% ( -7% - 8%) 0.782 OrHighHigh 20.94 (3.3%) 21.35 (4.2%) 2.0% ( -5% - 9%) 0.098 {noformat} There is a modest (it's consistently reproducible so I believe it's not noise) improvement to OrHighHigh with no slowdown of other queries. This is expected since most blocks generally have different maximum scores. However the (cab_color:y OR cab_color:g) query on the sorted sparse NYC Taxis goes from 80ms to 2ms. > WANDScorer could skip more > -------------------------- > > Key: LUCENE-10121 > URL: https://issues.apache.org/jira/browse/LUCENE-10121 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Adrien Grand > Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > I was looking at the NYC Taxis benchmark recently and got puzzled by the fact > that the query (cab_color:y OR cab_color:g) ran so slowly: > http://people.apache.org/~mikemccand/lucenebench/sparseResults.html#search_bq_qps. > This is supposed to be a best-case scenario for WAND: there are only two > possible scores for documents, this query should return instantly in the > sorted case. > After digging I noticed that this is due to the scaling that we due in > WANDScorer to avoid floating-point rounding errors: documents can be > considered as possible matches according to the scaled scores (which are > rounded) while they cannot possibly match according to the actual scores. > This is especially visible when many blocks contain a document that has the > maximum score across the entire postings list, so any field indexed with > indexOptions=DOCS or constant-scoring queries for instance. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org