jpountz opened a new issue, #13106: URL: https://github.com/apache/lucene/issues/13106
### Description On top-k queries, Lucene is now competitive with Tantivy/PISA on https://tantivy-search.github.io/bench/, but it's still quite slower on counting queries. This made me want to run a similar experiment as https://github.com/Tony-X/search-benchmark-game/issues/44, though with a few more changes to how skipping works: - Single level of skip lists. - Skip data and impacts are inlined between blocks of postings. - Less overhead: - no separate SkipReader abstraction that gets lazily instantiated: the skipping logic is more lightweight and within the postings/impacts enum logic, - checking whether to skip and decode a new block is now a single check on `BlockDocsEnum` while it requires two different checks today. A hacky implementation of this can be found at https://github.com/apache/lucene/compare/main...jpountz:lucene:skip_experiment?expand=1: - It doesn't replace existings skip data, just adds additional skip data and impacts inlined between blocks. - Only `BlockDocsEnum` and `BlockImpactsDocsEnum` switched to this new skip data, other impls still use existing skip data. So term queries will see a change, but not phrase queries. It's quite naive, we could probably do something that is a bit more efficient. Yet results on wikibigall are interesting: ``` TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value OrHighNotLow 327.87 (7.9%) 183.13 (5.4%) -44.1% ( -53% - -33%) 0.000 OrHighNotMed 374.44 (6.7%) 242.58 (5.0%) -35.2% ( -43% - -25%) 0.000 HighTermMonthSort 4971.86 (1.6%) 3322.82 (1.1%) -33.2% ( -35% - -30%) 0.000 HighTerm 480.63 (6.5%) 343.22 (5.2%) -28.6% ( -37% - -18%) 0.000 OrHighNotHigh 226.71 (7.4%) 169.91 (6.2%) -25.1% ( -36% - -12%) 0.000 MedTerm 705.95 (5.8%) 617.27 (5.1%) -12.6% ( -22% - -1%) 0.000 OrNotHighHigh 183.15 (6.7%) 161.14 (6.3%) -12.0% ( -23% - 1%) 0.000 CountOrHighHigh 56.83 (16.6%) 53.32 (10.5%) -6.2% ( -28% - 25%) 0.160 OrHighHigh 73.45 (1.9%) 71.56 (1.3%) -2.6% ( -5% - 0%) 0.000 TermDTSort 259.01 (4.1%) 253.64 (5.8%) -2.1% ( -11% - 8%) 0.193 HighTermTitleBDVSort 15.83 (6.0%) 15.60 (4.9%) -1.4% ( -11% - 10%) 0.417 AndHighHigh 71.57 (2.2%) 70.60 (1.8%) -1.4% ( -5% - 2%) 0.032 CountTerm 14081.51 (4.0%) 13918.96 (3.9%) -1.2% ( -8% - 6%) 0.353 HighTermTitleSort 112.66 (5.3%) 112.15 (2.4%) -0.5% ( -7% - 7%) 0.724 Prefix3 340.35 (3.4%) 339.24 (2.9%) -0.3% ( -6% - 6%) 0.745 Wildcard 106.77 (3.2%) 106.51 (3.5%) -0.2% ( -6% - 6%) 0.824 PKLookup 282.16 (2.0%) 281.52 (1.3%) -0.2% ( -3% - 3%) 0.667 LowSpanNear 13.84 (2.8%) 13.81 (2.6%) -0.2% ( -5% - 5%) 0.839 CountPhrase 3.19 (8.0%) 3.19 (9.1%) -0.1% ( -15% - 18%) 0.974 HighPhrase 29.89 (2.9%) 29.90 (4.7%) 0.0% ( -7% - 7%) 0.985 MedSpanNear 9.92 (2.3%) 9.93 (2.3%) 0.0% ( -4% - 4%) 0.952 HighSpanNear 5.37 (3.7%) 5.37 (3.4%) 0.1% ( -6% - 7%) 0.923 LowTerm 1154.36 (4.2%) 1155.62 (3.6%) 0.1% ( -7% - 8%) 0.930 MedSloppyPhrase 25.88 (2.7%) 25.93 (1.8%) 0.2% ( -4% - 4%) 0.789 Respell 59.03 (1.6%) 59.16 (1.6%) 0.2% ( -2% - 3%) 0.661 LowPhrase 48.93 (2.4%) 49.10 (4.7%) 0.4% ( -6% - 7%) 0.763 HighTermDayOfYearSort 437.52 (1.2%) 440.03 (1.7%) 0.6% ( -2% - 3%) 0.227 CountOrHighMed 108.93 (11.3%) 109.65 (7.7%) 0.7% ( -16% - 22%) 0.830 MedPhrase 26.25 (2.6%) 26.44 (5.1%) 0.7% ( -6% - 8%) 0.580 HighSloppyPhrase 6.35 (3.5%) 6.40 (2.1%) 0.8% ( -4% - 6%) 0.361 Fuzzy2 72.65 (1.1%) 73.27 (1.2%) 0.9% ( -1% - 3%) 0.022 Fuzzy1 90.62 (1.1%) 91.49 (1.2%) 1.0% ( -1% - 3%) 0.008 LowIntervalsOrdered 15.56 (3.6%) 15.74 (3.4%) 1.2% ( -5% - 8%) 0.287 HighIntervalsOrdered 3.05 (5.0%) 3.10 (5.0%) 1.6% ( -8% - 12%) 0.323 LowSloppyPhrase 18.12 (5.2%) 18.41 (2.3%) 1.6% ( -5% - 9%) 0.217 OrHighLow 658.61 (2.4%) 670.64 (1.9%) 1.8% ( -2% - 6%) 0.008 MedIntervalsOrdered 18.21 (4.2%) 18.59 (3.7%) 2.1% ( -5% - 10%) 0.096 OrNotHighMed 337.44 (4.1%) 350.80 (4.5%) 4.0% ( -4% - 13%) 0.004 OrHighMed 147.53 (2.6%) 153.47 (2.1%) 4.0% ( 0% - 8%) 0.000 IntNRQ 130.99 (21.2%) 139.01 (21.2%) 6.1% ( -29% - 61%) 0.360 AndHighMed 270.68 (2.4%) 291.75 (2.6%) 7.8% ( 2% - 13%) 0.000 AndHighLow 903.88 (2.2%) 1000.22 (3.1%) 10.7% ( 5% - 16%) 0.000 OrNotHighLow 793.34 (2.3%) 935.56 (2.1%) 17.9% ( 13% - 22%) 0.000 CountAndHighHigh 43.99 (2.1%) 51.94 (3.3%) 18.1% ( 12% - 23%) 0.000 CountAndHighMed 129.42 (2.3%) 155.03 (3.4%) 19.8% ( 13% - 26%) 0.000 ``` `CountAndHighHigh` and `CountAndHighMed` became almost 20% faster! These are the main targets that I was targeting with this change, so it's good to see they are seeing a significant speedup. This confirms that we have some non-negligible overhead for skipping today though it's not easy to tell how much comes from the additional abstractions vs. multiple levels of skip lists. `OrNotHighLow` and `OrNotHighMed` are faster. This is because the bottleneck of these queries is advancing the `MUST_NOT` clause, which are not scoring. So it's very similar to the speedup we're seeing on the counting queries. `AndHighLow` and `AndHighMed` are 8%-11% faster. Again, I would attribute this to the faster skipping logic since this is about clauses that have different doc frequencies, so the higher frequency clause will need to do a lot of skipping to catch up with the leading clause. Interestingly, the fact that we are storing a single level of impact data doesn't hurt. `AndHighHigh` and `OrHighHigh` are slightly slower (or is it noise?). I could believe that there is a small performance hit on this one due to having a single level of impact data. This forces Lucene to use the maximum score across the entire doc ID space as a score upper bound for the clause that has the higher cost. Maybe it could be enough to compute global impacts to have better performance on these queries by having slightly better score upper bounds for the following clause. `HighTerm`, `MedTerm`, `OrHighNotLow`, `OrHighNotMed`, `HighTerm`, `OrHighNotHigh`, `OrNotHighHigh` are slower. This is expected as there are queries that have a single positive clause, which in-turn are queries where the score upper bounds that we compute are very close to the actual produced scores, which in-turn enables these queries to take advantage of the higher levels of impacts to skip more docs at once. `HighTermMonthSort` is slower. This is because the sort dynamically introduces a filter that is so selective that the term query can take advantage of skip data on higher levels to skip more docs at once. `CountOrHighHigh` is a bit slower because there's a bit more overhead to collect postings lists exhaustively now that skip data and impacts are inlined. It's not a net win, but this suggests that we have some room for improvement here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org