[I] Can we decrease the overhead of skipping? [lucene]

via GitHub Thu, 15 Feb 2024 00:53:57 -0800


jpountz opened a new issue, #13106:
URL: https://github.com/apache/lucene/issues/13106


   ### Description
   
   On top-k queries, Lucene is now competitive with Tantivy/PISA on 
https://tantivy-search.github.io/bench/, but it's still quite slower on 
counting queries. This made me want to run a similar experiment as 
https://github.com/Tony-X/search-benchmark-game/issues/44, though with a few 
more changes to how skipping works:
    - Single level of skip lists.
    - Skip data and impacts are inlined between blocks of postings.
    - Less overhead:
      - no separate SkipReader abstraction that gets lazily instantiated: the 
skipping logic is more lightweight and within the postings/impacts enum logic,
      - checking whether to skip and decode a new block is now a single check 
on `BlockDocsEnum` while it requires two different checks today.
   
   A hacky implementation of this can be found at 
https://github.com/apache/lucene/compare/main...jpountz:lucene:skip_experiment?expand=1:
    - It doesn't replace existings skip data, just adds additional skip data 
and impacts inlined between blocks.
    - Only `BlockDocsEnum` and `BlockImpactsDocsEnum` switched to this new skip 
data, other impls still use existing skip data. So term queries will see a 
change, but not phrase queries.
   
   It's quite naive, we could probably do something that is a bit more 
efficient. Yet results on wikibigall are interesting:
   
   ```
                               TaskQPS baseline      StdDevQPS 
my_modified_version      StdDev                Pct diff p-value
                       OrHighNotLow      327.87      (7.9%)      183.13      
(5.4%)  -44.1% ( -53% -  -33%) 0.000
                       OrHighNotMed      374.44      (6.7%)      242.58      
(5.0%)  -35.2% ( -43% -  -25%) 0.000
                  HighTermMonthSort     4971.86      (1.6%)     3322.82      
(1.1%)  -33.2% ( -35% -  -30%) 0.000
                           HighTerm      480.63      (6.5%)      343.22      
(5.2%)  -28.6% ( -37% -  -18%) 0.000
                      OrHighNotHigh      226.71      (7.4%)      169.91      
(6.2%)  -25.1% ( -36% -  -12%) 0.000
                            MedTerm      705.95      (5.8%)      617.27      
(5.1%)  -12.6% ( -22% -   -1%) 0.000
                      OrNotHighHigh      183.15      (6.7%)      161.14      
(6.3%)  -12.0% ( -23% -    1%) 0.000
                    CountOrHighHigh       56.83     (16.6%)       53.32     
(10.5%)   -6.2% ( -28% -   25%) 0.160
                         OrHighHigh       73.45      (1.9%)       71.56      
(1.3%)   -2.6% (  -5% -    0%) 0.000
                         TermDTSort      259.01      (4.1%)      253.64      
(5.8%)   -2.1% ( -11% -    8%) 0.193
               HighTermTitleBDVSort       15.83      (6.0%)       15.60      
(4.9%)   -1.4% ( -11% -   10%) 0.417
                        AndHighHigh       71.57      (2.2%)       70.60      
(1.8%)   -1.4% (  -5% -    2%) 0.032
                          CountTerm    14081.51      (4.0%)    13918.96      
(3.9%)   -1.2% (  -8% -    6%) 0.353
                  HighTermTitleSort      112.66      (5.3%)      112.15      
(2.4%)   -0.5% (  -7% -    7%) 0.724
                            Prefix3      340.35      (3.4%)      339.24      
(2.9%)   -0.3% (  -6% -    6%) 0.745
                           Wildcard      106.77      (3.2%)      106.51      
(3.5%)   -0.2% (  -6% -    6%) 0.824
                           PKLookup      282.16      (2.0%)      281.52      
(1.3%)   -0.2% (  -3% -    3%) 0.667
                        LowSpanNear       13.84      (2.8%)       13.81      
(2.6%)   -0.2% (  -5% -    5%) 0.839
                        CountPhrase        3.19      (8.0%)        3.19      
(9.1%)   -0.1% ( -15% -   18%) 0.974
                         HighPhrase       29.89      (2.9%)       29.90      
(4.7%)    0.0% (  -7% -    7%) 0.985
                        MedSpanNear        9.92      (2.3%)        9.93      
(2.3%)    0.0% (  -4% -    4%) 0.952
                       HighSpanNear        5.37      (3.7%)        5.37      
(3.4%)    0.1% (  -6% -    7%) 0.923
                            LowTerm     1154.36      (4.2%)     1155.62      
(3.6%)    0.1% (  -7% -    8%) 0.930
                    MedSloppyPhrase       25.88      (2.7%)       25.93      
(1.8%)    0.2% (  -4% -    4%) 0.789
                            Respell       59.03      (1.6%)       59.16      
(1.6%)    0.2% (  -2% -    3%) 0.661
                          LowPhrase       48.93      (2.4%)       49.10      
(4.7%)    0.4% (  -6% -    7%) 0.763
              HighTermDayOfYearSort      437.52      (1.2%)      440.03      
(1.7%)    0.6% (  -2% -    3%) 0.227
                     CountOrHighMed      108.93     (11.3%)      109.65      
(7.7%)    0.7% ( -16% -   22%) 0.830
                          MedPhrase       26.25      (2.6%)       26.44      
(5.1%)    0.7% (  -6% -    8%) 0.580
                   HighSloppyPhrase        6.35      (3.5%)        6.40      
(2.1%)    0.8% (  -4% -    6%) 0.361
                             Fuzzy2       72.65      (1.1%)       73.27      
(1.2%)    0.9% (  -1% -    3%) 0.022
                             Fuzzy1       90.62      (1.1%)       91.49      
(1.2%)    1.0% (  -1% -    3%) 0.008
                LowIntervalsOrdered       15.56      (3.6%)       15.74      
(3.4%)    1.2% (  -5% -    8%) 0.287
               HighIntervalsOrdered        3.05      (5.0%)        3.10      
(5.0%)    1.6% (  -8% -   12%) 0.323
                    LowSloppyPhrase       18.12      (5.2%)       18.41      
(2.3%)    1.6% (  -5% -    9%) 0.217
                          OrHighLow      658.61      (2.4%)      670.64      
(1.9%)    1.8% (  -2% -    6%) 0.008
                MedIntervalsOrdered       18.21      (4.2%)       18.59      
(3.7%)    2.1% (  -5% -   10%) 0.096
                       OrNotHighMed      337.44      (4.1%)      350.80      
(4.5%)    4.0% (  -4% -   13%) 0.004
                          OrHighMed      147.53      (2.6%)      153.47      
(2.1%)    4.0% (   0% -    8%) 0.000
                             IntNRQ      130.99     (21.2%)      139.01     
(21.2%)    6.1% ( -29% -   61%) 0.360
                         AndHighMed      270.68      (2.4%)      291.75      
(2.6%)    7.8% (   2% -   13%) 0.000
                         AndHighLow      903.88      (2.2%)     1000.22      
(3.1%)   10.7% (   5% -   16%) 0.000
                       OrNotHighLow      793.34      (2.3%)      935.56      
(2.1%)   17.9% (  13% -   22%) 0.000
                   CountAndHighHigh       43.99      (2.1%)       51.94      
(3.3%)   18.1% (  12% -   23%) 0.000
                    CountAndHighMed      129.42      (2.3%)      155.03      
(3.4%)   19.8% (  13% -   26%) 0.000
   ```
   
   `CountAndHighHigh` and `CountAndHighMed` became almost 20% faster! These are 
the main targets that I was targeting with this change, so it's good to see 
they are seeing a significant speedup. This confirms that we have some 
non-negligible overhead for skipping today though it's not easy to tell how 
much comes from the additional abstractions vs. multiple levels of skip lists.
   
   `OrNotHighLow` and `OrNotHighMed` are faster. This is because the bottleneck 
of these queries is advancing the `MUST_NOT` clause, which are not scoring. So 
it's very similar to the speedup we're seeing on the counting queries.
   
   `AndHighLow` and `AndHighMed` are 8%-11% faster. Again, I would attribute 
this to the faster skipping logic since this is about clauses that have 
different doc frequencies, so the higher frequency clause will need to do a lot 
of skipping to catch up with the leading clause. Interestingly, the fact that 
we are storing a single level of impact data doesn't hurt.
   
   `AndHighHigh` and `OrHighHigh` are slightly slower (or is it noise?). I 
could believe that there is a small performance hit on this one due to having a 
single level of impact data. This forces Lucene to use the maximum score across 
the entire doc ID space as a score upper bound for the clause that has the 
higher cost. Maybe it could be enough to compute global impacts to have better 
performance on these queries by having slightly better score upper bounds for 
the following clause.
   
   `HighTerm`, `MedTerm`, `OrHighNotLow`, `OrHighNotMed`, `HighTerm`, 
`OrHighNotHigh`, `OrNotHighHigh` are slower. This is expected as there are 
queries that have a single positive clause, which in-turn are queries where the 
score upper bounds that we compute are very close to the actual produced 
scores, which in-turn enables these queries to take advantage of the higher 
levels of impacts to skip more docs at once.
   
   `HighTermMonthSort` is slower. This is because the sort dynamically 
introduces a filter that is so selective that the term query can take advantage 
of skip data on higher levels to skip more docs at once.
   
   `CountOrHighHigh` is a bit slower because there's a bit more overhead to 
collect postings lists exhaustively now that skip data and impacts are inlined.
   
   It's not a net win, but this suggests that we have some room for improvement 
here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[I] Can we decrease the overhead of skipping? [lucene]

Reply via email to