Re: [PR] Random access term dictionary [lucene]

via GitHub Thu, 14 Dec 2023 23:41:00 -0800


Tony-X commented on PR #12688:
URL: https://github.com/apache/lucene/pull/12688#issuecomment-1857417012


   Here is the even more interesting stuff. After all those allocation 
optimizations. I also implemented the on-paper more "efficient" algorithm to 
intersect FST and FSA for Terms.intersect(), which leverages the sorted nature 
of the FST arcs and FSA transitions from a given state (so at least we can 
binary search to advance with some skipping). FST in some cases have direct 
addressing which is exploited, too. As a side note -- it also uncovered a bug 
for the NFA impl which is tracked here 
https://github.com/apache/lucene/issues/12906.  
   
   But that change is not moving the needle at all for `WildCard` and `Prefix3` 
search tasks. 
   
   
   ```
   Before 
                               TaskQPS baseline      StdDevQPS 
my_modified_version      StdDev                Pct diff p-value
                           Wildcard       62.85      (1.8%)       12.66      
(0.6%)  -79.9% ( -80% -  -78%) 0.000
                             Fuzzy2       55.12      (1.0%)       18.92      
(0.8%)  -65.7% ( -66% -  -64%) 0.000
                             Fuzzy1       61.20      (0.8%)       22.55      
(0.8%)  -63.2% ( -64% -  -62%) 0.000
                            Respell       31.11      (0.8%)       11.95      
(0.6%)  -61.6% ( -62% -  -60%) 0.000
                            Prefix3      135.69      (2.0%)       65.06      
(0.7%)  -52.1% ( -53% -  -50%) 0.000
                  HighTermTitleSort      119.58      (0.9%)      111.03      
(1.7%)   -7.2% (  -9% -   -4%) 0.000
                             IntNRQ       22.25      (1.1%)       21.87      
(1.5%)   -1.7% (  -4% -    0%) 0.000
                         HighPhrase       25.82      (3.6%)       25.55      
(3.2%)   -1.1% (  -7% -    5%) 0.318
                          MedPhrase        7.41      (2.4%)        7.35      
(2.2%)   -0.8% (  -5% -    3%) 0.259
                        LowSpanNear        8.81      (1.9%)        8.74      
(2.1%)   -0.8% (  -4% -    3%) 0.202
             OrHighMedDayTaxoFacets        3.86      (5.8%)        3.83      
(4.8%)   -0.8% ( -10% -   10%) 0.636
                         TermDTSort      100.75      (2.9%)       99.98      
(2.0%)   -0.8% (  -5% -    4%) 0.336
               HighIntervalsOrdered        6.07      (2.1%)        6.03      
(2.4%)   -0.7% (  -5% -    3%) 0.342
                MedIntervalsOrdered       45.89      (2.0%)       45.61      
(2.4%)   -0.6% (  -4% -    3%) 0.389
                       HighSpanNear       10.73      (1.0%)       10.66      
(1.5%)   -0.6% (  -3% -    2%) 0.165
              HighTermDayOfYearSort      206.09      (1.8%)      204.93      
(1.9%)   -0.6% (  -4% -    3%) 0.338
                LowIntervalsOrdered        8.39      (2.3%)        8.37      
(2.5%)   -0.3% (  -5% -    4%) 0.654
                        MedSpanNear       66.00      (1.3%)       65.81      
(1.9%)   -0.3% (  -3% -    2%) 0.574
                            MedTerm      322.61      (4.7%)      321.89      
(4.5%)   -0.2% (  -9% -    9%) 0.878
            AndHighMedDayTaxoFacets       22.62      (1.0%)       22.58      
(1.2%)   -0.2% (  -2% -    2%) 0.617
                          LowPhrase       48.52      (1.3%)       48.46      
(1.4%)   -0.1% (  -2% -    2%) 0.745
                            LowTerm      403.54      (2.9%)      403.22      
(2.4%)   -0.1% (  -5% -    5%) 0.923
        BrowseRandomLabelTaxoFacets        3.20      (0.7%)        3.20      
(0.9%)   -0.0% (  -1% -    1%) 0.905
           AndHighHighDayTaxoFacets        8.06      (1.3%)        8.06      
(1.6%)    0.0% (  -2% -    2%) 0.962
          BrowseDayOfYearTaxoFacets        3.76      (0.6%)        3.76      
(0.6%)    0.0% (  -1% -    1%) 0.859
              BrowseMonthTaxoFacets        3.62      (1.0%)        3.62      
(1.0%)    0.1% (  -1% -    2%) 0.866
                      OrHighNotHigh      156.16      (6.0%)      156.26      
(5.9%)    0.1% ( -11% -   12%) 0.972
               BrowseDateTaxoFacets        3.73      (0.6%)        3.73      
(0.6%)    0.1% (  -1% -    1%) 0.722
                      OrNotHighHigh      144.55      (5.1%)      144.68      
(4.7%)    0.1% (  -9% -   10%) 0.957
               MedTermDayTaxoFacets       17.57      (2.8%)       17.59      
(2.8%)    0.2% (  -5% -    5%) 0.863
          BrowseDayOfYearSSDVFacets        3.22      (0.9%)        3.23      
(0.7%)    0.2% (  -1% -    1%) 0.424
                           HighTerm      401.13      (5.4%)      402.30      
(5.7%)    0.3% ( -10% -   12%) 0.868
              BrowseMonthSSDVFacets        3.27      (0.7%)        3.28      
(1.0%)    0.3% (  -1% -    2%) 0.202
                        AndHighHigh       20.87      (2.8%)       20.94      
(2.5%)    0.4% (  -4% -    5%) 0.670
                   HighSloppyPhrase        6.58      (3.3%)        6.61      
(3.4%)    0.4% (  -6% -    7%) 0.727
                         AndHighMed       89.34      (1.8%)       89.76      
(1.4%)    0.5% (  -2% -    3%) 0.355
               BrowseDateSSDVFacets        0.95      (2.8%)        0.95      
(4.0%)    0.5% (  -6% -    7%) 0.656
                       OrNotHighLow      420.17      (2.2%)      422.34      
(2.2%)    0.5% (  -3% -    4%) 0.452
                    LowSloppyPhrase        2.89      (2.4%)        2.91      
(1.9%)    0.6% (  -3% -    4%) 0.369
                       OrNotHighMed      219.50      (4.5%)      221.02      
(4.1%)    0.7% (  -7% -    9%) 0.611
                    MedSloppyPhrase       10.44      (2.2%)       10.52      
(1.4%)    0.7% (  -2% -    4%) 0.222
                       OrHighNotLow      288.48      (5.4%)      290.70      
(5.7%)    0.8% (  -9% -   12%) 0.663
                          OrHighMed       53.25      (3.6%)       53.66      
(3.5%)    0.8% (  -6% -    8%) 0.488
               HighTermTitleBDVSort        2.77      (7.2%)        2.79      
(7.0%)    0.9% ( -12% -   16%) 0.699
                       OrHighNotMed      270.38      (5.7%)      272.88      
(5.4%)    0.9% (  -9% -   12%) 0.601
                         OrHighHigh       20.86      (5.1%)       21.08      
(5.2%)    1.1% (  -8% -   12%) 0.519
                          OrHighLow      220.40      (4.2%)      223.52      
(5.6%)    1.4% (  -8% -   11%) 0.367
        BrowseRandomLabelSSDVFacets        2.32      (4.3%)        2.38      
(7.6%)    2.3% (  -9% -   14%) 0.240
                         AndHighLow      395.82      (2.9%)      405.36      
(2.8%)    2.4% (  -3% -    8%) 0.008
                  HighTermMonthSort     2375.71      (3.7%)     2555.72      
(5.0%)    7.6% (  -1% -   16%) 0.000
                           PKLookup      140.60      (1.8%)      178.02      
(1.3%)   26.6% (  23% -   30%) 0.000
   
   
   After
                               TaskQPS baseline      StdDevQPS 
my_modified_version      StdDev                Pct diff p-value
                           Wildcard       37.77      (2.7%)        5.70      
(1.3%)  -84.9% ( -86% -  -83%) 0.000
                            Prefix3       52.27      (2.7%)       22.71      
(1.8%)  -56.6% ( -59% -  -53%) 0.000
                             Fuzzy1       59.96      (1.6%)       54.80      
(2.3%)   -8.6% ( -12% -   -4%) 0.000
                  HighTermTitleSort      106.17      (2.1%)      101.30      
(1.5%)   -4.6% (  -7% -   -1%) 0.000
                             Fuzzy2       33.40      (1.3%)       32.14      
(1.6%)   -3.8% (  -6% -    0%) 0.000
                            MedTerm      273.84      (5.4%)      265.92      
(9.7%)   -2.9% ( -17% -   12%) 0.245
                           HighTerm      349.66      (5.2%)      341.63      
(8.9%)   -2.3% ( -15% -   12%) 0.320
                            LowTerm      356.23      (3.1%)      350.27      
(4.3%)   -1.7% (  -8% -    5%) 0.156
                    LowSloppyPhrase        4.47      (2.1%)        4.44      
(4.6%)   -0.8% (  -7% -    6%) 0.492
                       HighSpanNear        8.12      (2.1%)        8.06      
(2.5%)   -0.7% (  -5% -    4%) 0.331
                    MedSloppyPhrase       31.05      (3.6%)       30.83      
(4.1%)   -0.7% (  -8% -    7%) 0.559
                MedIntervalsOrdered        3.88      (3.3%)        3.86      
(3.3%)   -0.7% (  -7% -    6%) 0.519
                        MedSpanNear        8.94      (1.2%)        8.88      
(1.6%)   -0.7% (  -3% -    2%) 0.126
                LowIntervalsOrdered        7.40      (3.3%)        7.35      
(3.4%)   -0.7% (  -7% -    6%) 0.537
                        LowSpanNear       29.33      (2.0%)       29.15      
(2.2%)   -0.6% (  -4% -    3%) 0.374
                   HighSloppyPhrase        6.68      (3.6%)        6.64      
(3.5%)   -0.5% (  -7% -    6%) 0.624
               MedTermDayTaxoFacets        9.14      (3.0%)        9.11      
(8.2%)   -0.3% ( -11% -   11%) 0.861
                         HighPhrase      115.62      (3.9%)      115.24      
(4.0%)   -0.3% (  -7% -    7%) 0.798                                            
    [162/1927]
                        AndHighHigh       13.95      (4.2%)       13.92      
(4.5%)   -0.3% (  -8% -    8%) 0.847
              BrowseMonthSSDVFacets        3.30      (0.8%)        3.29      
(1.0%)   -0.3% (  -2% -    1%) 0.377
                         AndHighMed       85.40      (2.0%)       85.18      
(2.1%)   -0.2% (  -4% -    3%) 0.695
                             IntNRQ       16.65      (4.1%)       16.63      
(3.8%)   -0.1% (  -7% -    8%) 0.914
        BrowseRandomLabelSSDVFacets        2.30      (0.9%)        2.30      
(1.1%)   -0.1% (  -2% -    1%) 0.754
                         OrHighHigh       24.99      (6.1%)       24.97      
(5.3%)   -0.1% ( -10% -   12%) 0.957
           AndHighHighDayTaxoFacets        2.29      (2.8%)        2.29      
(2.4%)   -0.0% (  -5% -    5%) 0.977
            AndHighMedDayTaxoFacets       40.17      (1.4%)       40.20      
(1.4%)    0.1% (  -2% -    2%) 0.872
             OrHighMedDayTaxoFacets        3.15      (3.9%)        3.15      
(3.4%)    0.1% (  -6% -    7%) 0.946
                          LowPhrase       30.23      (2.5%)       30.26      
(2.4%)    0.1% (  -4% -    5%) 0.911
                       OrNotHighLow      201.78      (2.9%)      202.03      
(3.0%)    0.1% (  -5% -    6%) 0.896
        BrowseRandomLabelTaxoFacets        3.20      (3.2%)        3.21      
(4.1%)    0.1% (  -6% -    7%) 0.899
               HighIntervalsOrdered        0.42      (1.9%)        0.42      
(1.6%)    0.2% (  -3% -    3%) 0.699
                       OrNotHighMed      235.49      (5.5%)      236.01      
(4.7%)    0.2% (  -9% -   11%) 0.892
              BrowseMonthTaxoFacets        3.62      (1.0%)        3.63      
(1.0%)    0.2% (  -1% -    2%) 0.477
                      OrNotHighHigh      329.77      (4.9%)      330.79      
(5.5%)    0.3% (  -9% -   11%) 0.851
                          MedPhrase       35.79      (3.4%)       35.90      
(3.4%)    0.3% (  -6% -    7%) 0.771
                         TermDTSort      112.10      (3.2%)      112.45      
(3.4%)    0.3% (  -6% -    7%) 0.763
               BrowseDateSSDVFacets        0.97      (7.1%)        0.98     
(10.0%)    0.4% ( -15% -   18%) 0.897
          BrowseDayOfYearSSDVFacets        3.21      (2.2%)        3.22      
(1.6%)    0.4% (  -3% -    4%) 0.525
              HighTermDayOfYearSort      235.24      (2.1%)      236.16      
(1.6%)    0.4% (  -3% -    4%) 0.512
                          OrHighMed       70.60      (3.3%)       70.99      
(2.7%)    0.5% (  -5% -    6%) 0.571
                         AndHighLow      370.31      (3.2%)      372.60      
(3.4%)    0.6% (  -5% -    7%) 0.559
               HighTermTitleBDVSort        5.53      (4.1%)        5.56      
(4.5%)    0.6% (  -7% -    9%) 0.648
                       OrHighNotLow      263.18      (5.6%)      264.95      
(6.2%)    0.7% ( -10% -   13%) 0.717
                       OrHighNotMed      222.41      (5.8%)      224.06      
(5.8%)    0.7% ( -10% -   13%) 0.688
                      OrHighNotHigh      233.04      (5.5%)      234.89      
(5.8%)    0.8% (  -9% -   12%) 0.657
                          OrHighLow      463.17      (3.0%)      466.91      
(3.1%)    0.8% (  -5% -    7%) 0.403
          BrowseDayOfYearTaxoFacets        3.77      (0.6%)        3.84      
(8.9%)    1.9% (  -7% -   11%) 0.342
               BrowseDateTaxoFacets        3.74      (0.6%)        3.81      
(8.8%)    1.9% (  -7% -   11%) 0.332
                  HighTermMonthSort     2350.73      (4.0%)     2477.32      
(4.5%)    5.4% (  -2% -   14%) 0.000
                            Respell       30.81      (1.5%)       34.87      
(1.7%)   13.2% (   9% -   16%) 0.000
                           PKLookup      141.03      (1.9%)      177.54      
(2.0%)   25.9% (  21% -   30%) 0.000
   
   ```
   
   I tried to modify the bench task file and only run `WildCard` to understand 
where the time is spent.
   
   ### My version
   
![image](https://github.com/apache/lucene/assets/7917591/cf0589da-befd-4250-a4fb-6d2e4f65c0cf)
   
   ### mainline
   
![image](https://github.com/apache/lucene/assets/7917591/dd8b820d-4471-429f-909d-55b1d5ff35d9)
   
   
   So we can see that the most time is spent in actually reading out the FST 
arcs and FSA transitions... My intuitive explanation for why this is slower 
than the blocktree is that it has worse locality in its data access pattern. 
(@mikemccand maybe you can shed some light) Here are some relevant factors:
   * The FST is larger as it contains all terms. So there are more Arcs to 
visit. Blocktree (main) use the FST to index prefixes.
   * When binary-searching or directly address Arc/Transitions the target is 
somewhat random. 
   * The FST bytes are read backwards. (probably less of an issue if we read 
sequentially on modern HW)
   * Blocktree at a given node reads bytes sequentially and terms are sorted, 
too.
   
   
   Just out of curiosity I altered my code to load the FST on-heap to compare 
with the default off-heap option. It did not help much with `Wildcard` but 
PKLookup got substantially faster!
   
   The PKLookup task is a great proxy to FST performance, as both versions of 
the code visits the exact same number of Arcs. 
   
   ```
   
   Off heap
                           Wildcard       47.56      (1.7%)       10.13      
(0.4%)  -78.7% ( -79% -  -77%) 0.000
                           PKLookup      136.03      (2.4%)      147.93      
(2.3%)    8.8% (   3% -   13%) 0.000
   
   on heap
                           Wildcard       37.11      (1.5%)        8.35      
(0.3%)  -77.5% ( -78% -  -76%) 0.000
                           PKLookup      136.04      (3.3%)      269.60      
(9.0%)   98.2% (  83% -  114%) 0.000
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Random access term dictionary [lucene]

Reply via email to