zacharymorn commented on PR #12194: URL: https://github.com/apache/lucene/pull/12194#issuecomment-1461257536
Thanks @jpountz for the review and comment! >Did you manage to observe some speedups with this change? So far I have only able to run `wikimedium10m` and see the implementation has around -10% slow down (listed below) for full text boolean queries `OrXXXNotYYYY` due to changes in `ReqExclScorer` and `Lucene90PostingsReader` (and the facet ones don't seems to exercise the changes and should be just random fluctuation). I'm currently still searching for any existing benchmarking tasks that can measure these targeted use cases: > Is it actually common to have long runs of matches? For full-text indexes, maybe not so much, only stop words may have runs of adjacent matches. For string fields, this may happen if the field has a default value that is the value of most documents in the collection. Also it's possible for users to use index sorting in order to cluster similar documents together, which increases the likelihood to have long runs of adjacent matches. Do you have any pointer which benchmark task I could potentially use? If there isn't one available, I could try to add some next. ``` TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value BrowseRandomLabelTaxoFacets 39.06 (49.1%) 33.42 (45.6%) -14.4% ( -73% - 157%) 0.335 BrowseDateTaxoFacets 34.22 (29.5%) 30.49 (27.9%) -10.9% ( -52% - 65%) 0.229 BrowseDayOfYearTaxoFacets 34.30 (29.4%) 30.57 (27.9%) -10.9% ( -52% - 65%) 0.230 OrNotHighHigh 426.33 (2.7%) 388.27 (1.6%) -8.9% ( -12% - -4%) 0.000 OrHighNotHigh 621.42 (2.7%) 573.32 (2.0%) -7.7% ( -12% - -3%) 0.000 OrHighNotMed 616.08 (3.8%) 573.20 (2.7%) -7.0% ( -13% - 0%) 0.000 OrHighNotLow 562.98 (4.0%) 525.80 (3.3%) -6.6% ( -13% - 0%) 0.000 OrNotHighMed 712.40 (2.6%) 672.88 (2.4%) -5.5% ( -10% - 0%) 0.000 HighTermTitleBDVSort 19.04 (7.9%) 18.73 (8.3%) -1.6% ( -16% - 15%) 0.534 HighIntervalsOrdered 1.89 (12.3%) 1.86 (15.0%) -1.6% ( -25% - 29%) 0.719 BrowseMonthTaxoFacets 32.23 (33.9%) 31.77 (33.6%) -1.4% ( -51% - 100%) 0.895 OrNotHighLow 1719.50 (3.8%) 1696.85 (4.6%) -1.3% ( -9% - 7%) 0.326 HighTermTitleSort 202.79 (2.9%) 200.20 (2.8%) -1.3% ( -6% - 4%) 0.155 AndHighHigh 51.74 (5.8%) 51.08 (5.4%) -1.3% ( -11% - 10%) 0.475 Fuzzy1 59.04 (2.7%) 58.36 (3.2%) -1.2% ( -6% - 4%) 0.214 MedTerm 1364.68 (4.4%) 1349.31 (3.4%) -1.1% ( -8% - 6%) 0.362 Wildcard 314.79 (2.8%) 311.35 (3.5%) -1.1% ( -7% - 5%) 0.277 LowTerm 2087.86 (3.2%) 2065.24 (3.8%) -1.1% ( -7% - 6%) 0.334 MedIntervalsOrdered 22.66 (8.6%) 22.42 (10.4%) -1.0% ( -18% - 19%) 0.730 PKLookup 331.54 (2.9%) 328.12 (2.6%) -1.0% ( -6% - 4%) 0.242 LowIntervalsOrdered 161.90 (9.5%) 160.23 (11.6%) -1.0% ( -20% - 22%) 0.758 Fuzzy2 100.43 (1.5%) 99.40 (3.0%) -1.0% ( -5% - 3%) 0.169 Respell 88.01 (2.0%) 87.27 (2.4%) -0.8% ( -5% - 3%) 0.223 BrowseDateSSDVFacets 4.89 (21.4%) 4.85 (20.2%) -0.8% ( -34% - 51%) 0.905 BrowseRandomLabelSSDVFacets 19.23 (7.1%) 19.09 (6.2%) -0.7% ( -13% - 13%) 0.728 AndHighMed 114.49 (5.6%) 113.77 (5.0%) -0.6% ( -10% - 10%) 0.708 Prefix3 376.91 (1.4%) 374.65 (2.5%) -0.6% ( -4% - 3%) 0.348 HighTermMonthSort 4250.83 (4.2%) 4227.31 (3.6%) -0.6% ( -7% - 7%) 0.653 OrHighMed 209.50 (6.2%) 208.61 (3.4%) -0.4% ( -9% - 9%) 0.787 LowPhrase 89.33 (3.0%) 88.96 (2.2%) -0.4% ( -5% - 4%) 0.623 BrowseDayOfYearSSDVFacets 24.82 (10.9%) 24.75 (11.2%) -0.3% ( -20% - 24%) 0.940 AndHighMedDayTaxoFacets 158.35 (1.6%) 158.08 (1.8%) -0.2% ( -3% - 3%) 0.756 HighTerm 2076.24 (3.7%) 2074.83 (2.9%) -0.1% ( -6% - 6%) 0.949 AndHighHighDayTaxoFacets 14.81 (2.4%) 14.81 (2.9%) -0.0% ( -5% - 5%) 0.992 HighSpanNear 11.02 (2.0%) 11.02 (2.4%) 0.0% ( -4% - 4%) 0.951 LowSpanNear 178.01 (1.7%) 178.17 (1.8%) 0.1% ( -3% - 3%) 0.864 OrHighLow 473.27 (6.2%) 473.94 (3.3%) 0.1% ( -8% - 10%) 0.929 TermDTSort 230.93 (4.7%) 231.36 (3.3%) 0.2% ( -7% - 8%) 0.885 MedSloppyPhrase 20.69 (3.0%) 20.76 (3.0%) 0.3% ( -5% - 6%) 0.721 MedSpanNear 80.38 (2.2%) 80.66 (2.1%) 0.3% ( -3% - 4%) 0.618 MedPhrase 53.03 (1.8%) 53.23 (1.8%) 0.4% ( -3% - 3%) 0.520 AndHighLow 2127.04 (4.1%) 2136.55 (3.4%) 0.4% ( -6% - 8%) 0.706 HighTermDayOfYearSort 594.24 (6.9%) 597.03 (6.3%) 0.5% ( -11% - 14%) 0.822 OrHighHigh 53.90 (5.7%) 54.21 (4.0%) 0.6% ( -8% - 10%) 0.709 MedTermDayTaxoFacets 41.47 (1.1%) 41.75 (2.8%) 0.7% ( -3% - 4%) 0.311 HighPhrase 119.60 (2.0%) 120.53 (1.8%) 0.8% ( -2% - 4%) 0.195 LowSloppyPhrase 166.00 (5.7%) 167.36 (5.5%) 0.8% ( -9% - 12%) 0.644 HighSloppyPhrase 43.02 (5.5%) 43.40 (5.2%) 0.9% ( -9% - 12%) 0.610 BrowseMonthSSDVFacets 24.48 (8.6%) 24.85 (11.6%) 1.5% ( -17% - 23%) 0.638 OrHighMedDayTaxoFacets 7.71 (3.7%) 7.86 (6.0%) 1.9% ( -7% - 12%) 0.218 IntNRQ 115.47 (14.3%) 118.04 (13.2%) 2.2% ( -22% - 34%) 0.610 ``` >You explored implementing this new API in several different places: BitSetIterator, doc-value iterator, postings, etc. and it's already a bit exhausting to review and will get worse when we add more tests. I think it would be helpful if we focused on a single thing for the initial PR that focuses on proving that this API is a good addition, adds good testing, and then implement the new API on other implementations of DocIdSetIterator in follow-up PRs. For sure. Once I'm able to benchmark this and observe good speed up & we are good with the API, I will break up this PR into smaller pieces. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org