gsmiller commented on PR #12055: URL: https://github.com/apache/lucene/pull/12055#issuecomment-1435332248
OK, I think I've addressed the previous feedback and also brought in the same changes to `TermInSetQuery`. This should be ready for feedback @jpountz (whenever you have a free moment). On some internal benchmarks (Amazon product search), we see throughput increases ranging from ~7 - 63% (depending on a number of factors). We have some situations where pre-processing of long postings (in the existing TiS implementation) takes up a large share of CPU time (these tend to be cases where the actual matches are sparse but the TiS terms match a large number of docs). Being able to "hold back" these long postings into a `DisjunctionDisiAppox` while still pre-processing the shorter postings is a big win in these cases. Here are some flame charts showing the impact (heavily redacted of course): "Normal" TiS: <img width="1469" alt="Screen Shot 2023-02-15 at 7 51 01 AM" src="https://user-images.githubusercontent.com/16479560/219803198-c28ce8f5-1807-4c86-b3ec-7d6e33096533.png"> TiS with this PR: <img width="1415" alt="Screen Shot 2023-02-15 at 10 38 38 AM" src="https://user-images.githubusercontent.com/16479560/219803253-b785d652-cf37-42c6-bf7c-be6d18e69caf.png"> I also re-ran `luceneutil` benchmarks (`wikimedium10m`) and see consistent results with the initial PR: ``` TaskQPS baseline StdDevQPS candidate StdDev Pct diff p-value BrowseMonthTaxoFacets 32.05 (14.3%) 30.06 (23.2%) -6.2% ( -38% - 36%) 0.309 BrowseMonthSSDVFacets 14.85 (17.6%) 13.95 (2.3%) -6.0% ( -22% - 16%) 0.127 BrowseRandomLabelTaxoFacets 23.18 (12.4%) 21.93 (19.8%) -5.4% ( -33% - 30%) 0.303 BrowseDateTaxoFacets 31.04 (14.2%) 29.42 (22.7%) -5.2% ( -36% - 37%) 0.383 BrowseDayOfYearTaxoFacets 31.29 (14.4%) 29.66 (22.9%) -5.2% ( -37% - 37%) 0.389 BrowseDayOfYearSSDVFacets 14.17 (17.7%) 13.85 (14.0%) -2.2% ( -28% - 35%) 0.659 IntNRQ 156.59 (5.3%) 154.50 (7.1%) -1.3% ( -12% - 11%) 0.498 HighTermTitleSort 100.27 (2.6%) 99.43 (2.8%) -0.8% ( -6% - 4%) 0.329 HighTermMonthSort 2633.69 (3.3%) 2614.59 (3.1%) -0.7% ( -6% - 5%) 0.477 AndHighLow 1085.57 (2.9%) 1080.12 (2.5%) -0.5% ( -5% - 5%) 0.559 OrNotHighHigh 795.28 (3.1%) 791.61 (3.4%) -0.5% ( -6% - 6%) 0.656 HighPhrase 145.14 (2.7%) 144.70 (2.8%) -0.3% ( -5% - 5%) 0.725 BrowseDateSSDVFacets 3.81 (7.7%) 3.80 (7.9%) -0.3% ( -14% - 16%) 0.905 Fuzzy2 57.08 (1.3%) 56.95 (1.2%) -0.2% ( -2% - 2%) 0.555 LowPhrase 379.61 (2.4%) 378.74 (2.8%) -0.2% ( -5% - 5%) 0.783 Fuzzy1 76.62 (1.5%) 76.45 (1.2%) -0.2% ( -2% - 2%) 0.621 MedPhrase 19.20 (2.3%) 19.19 (2.2%) -0.1% ( -4% - 4%) 0.898 OrHighMedDayTaxoFacets 15.40 (3.4%) 15.40 (3.3%) -0.0% ( -6% - 6%) 0.978 OrNotHighLow 1186.45 (3.0%) 1186.35 (2.2%) -0.0% ( -5% - 5%) 0.992 MedSpanNear 8.28 (2.7%) 8.28 (2.6%) 0.0% ( -5% - 5%) 0.996 TermDTSort 104.78 (1.2%) 104.79 (1.3%) 0.0% ( -2% - 2%) 0.984 AndHighMed 204.19 (3.2%) 204.21 (3.7%) 0.0% ( -6% - 7%) 0.992 MedTermDayTaxoFacets 51.30 (2.9%) 51.32 (3.0%) 0.0% ( -5% - 6%) 0.970 HighTermTitleBDVSort 21.28 (4.7%) 21.28 (4.2%) 0.0% ( -8% - 9%) 0.975 LowSpanNear 179.68 (1.3%) 179.77 (1.6%) 0.0% ( -2% - 3%) 0.919 AndHighMedDayTaxoFacets 25.84 (1.6%) 25.85 (1.7%) 0.0% ( -3% - 3%) 0.923 OrHighMed 106.73 (3.0%) 106.83 (3.4%) 0.1% ( -6% - 6%) 0.932 OrNotHighMed 329.39 (3.1%) 329.71 (3.4%) 0.1% ( -6% - 6%) 0.925 Respell 49.02 (0.9%) 49.09 (0.6%) 0.1% ( -1% - 1%) 0.546 MedIntervalsOrdered 4.40 (5.8%) 4.40 (5.5%) 0.1% ( -10% - 12%) 0.933 AndHighHighDayTaxoFacets 13.00 (1.7%) 13.02 (1.5%) 0.2% ( -2% - 3%) 0.760 PKLookup 181.72 (2.5%) 182.00 (2.2%) 0.2% ( -4% - 4%) 0.833 HighSpanNear 23.16 (1.5%) 23.20 (1.9%) 0.2% ( -3% - 3%) 0.763 OrHighNotLow 420.49 (3.3%) 421.43 (5.0%) 0.2% ( -7% - 8%) 0.867 LowIntervalsOrdered 32.01 (3.6%) 32.08 (3.6%) 0.2% ( -6% - 7%) 0.838 MedSloppyPhrase 136.79 (2.9%) 137.27 (2.7%) 0.3% ( -5% - 6%) 0.696 HighTermDayOfYearSort 233.88 (2.3%) 234.72 (2.9%) 0.4% ( -4% - 5%) 0.663 LowSloppyPhrase 24.70 (2.6%) 24.80 (2.5%) 0.4% ( -4% - 5%) 0.604 OrHighNotMed 477.53 (3.0%) 479.58 (5.0%) 0.4% ( -7% - 8%) 0.742 OrHighNotHigh 274.34 (3.4%) 275.55 (4.8%) 0.4% ( -7% - 8%) 0.737 HighIntervalsOrdered 1.86 (3.4%) 1.87 (3.3%) 0.5% ( -6% - 7%) 0.652 BrowseRandomLabelSSDVFacets 10.16 (8.9%) 10.22 (9.0%) 0.5% ( -15% - 20%) 0.850 OrHighLow 426.87 (2.9%) 429.35 (4.0%) 0.6% ( -6% - 7%) 0.599 MedTerm 520.48 (3.7%) 524.12 (5.6%) 0.7% ( -8% - 10%) 0.643 HighTerm 522.07 (3.3%) 526.23 (5.2%) 0.8% ( -7% - 9%) 0.562 HighSloppyPhrase 9.98 (4.5%) 10.06 (4.2%) 0.8% ( -7% - 9%) 0.562 OrHighHigh 29.97 (4.5%) 30.24 (5.5%) 0.9% ( -8% - 11%) 0.570 LowTerm 732.69 (3.2%) 740.93 (4.5%) 1.1% ( -6% - 9%) 0.362 AndHighHigh 29.62 (5.1%) 29.99 (5.9%) 1.2% ( -9% - 12%) 0.481 Prefix3 138.32 (1.3%) 178.19 (1.9%) 28.8% ( 25% - 32%) 0.000 Wildcard 393.93 (1.5%) 759.76 (3.8%) 92.9% ( 86% - 99%) 0.000 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org