[GitHub] [lucene] gautamworah96 opened a new pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyWriter

GitBox Mon, 14 Jun 2021 00:08:25 -0700


gautamworah96 opened a new pull request #179:
URL: https://github.com/apache/lucene/pull/179



   <!--
   _(If you are a project committer then you may remove some/all of the 
following template.)_
   
   Before creating a pull request, please file an issue in the ASF Jira system 
for Lucene:
   
   * https://issues.apache.org/jira/projects/LUCENE
   
   You will need to create an account in Jira in order to create an issue.
   
   The title of the PR should reference the Jira issue number in the form:
   
   * LUCENE-####: <short description of problem or changes>
   
   LUCENE must be fully capitalized. A short description helps people scanning 
pull requests for items they can work on.
   
   Properly referencing the issue in the title ensures that Jira is correctly 
updated with code review comments and commits. -->
   
   
   # Description
   
   In LUCENE-9450 we switched the Taxonomy index from Stored Fields to 
BinaryDocValues. In the resulting implementation of the getPath code, we create 
a new BinaryDocValues's values instance for each ordinal.
   It may happen that we may traverse over the same nodes over and over again 
if the getPath API is called multiple times for ordinals in the same 
segment/with the same readerIndex.
   
   This PR takes advantage of that fact by sorting ordinals and then trying to 
find out if some of the ordinals are present in the same segment/have the same 
readerIndex (by trying to advanceExact to the correct position and not failing) 
thereby allowing us to reuse the previous BinaryDocValues object.
   
   **This PR was originally proposed in the older `lucene-solr` repo 
[here](https://github.com/apache/lucene-solr/pull/2247). I discontinued it 
because we could not find a good "real" use case for it.**
   
   **This new PR uses the `getBulkPath` API inside `IntTaxonomyFacetCounts` and 
`FloatTaxonomyFacetCounts` classes.** In the `getTopChildren` method within 
these classes, we usually collect all the `top-n` ordinals first and then 
translate them into `FacetLabels` one by one using an iterative API. This call 
has been replaced with the `getBulkPath` method in this PR.
   
   # Solution
   
   Steps:
   
   1. Sort all ordinals and remember their position so as to store the path in 
the correct position
   2. Try to advanceExact to the correct position with the previously 
calculated readerIndex. If the operation fails, try to find the correct segment 
for the ordinal and then advanceExact to the desired position.
    3.  Store this position for future ordinals.
   
   
   # Tests
   
   1. Added a new randomized test for the API that compares the individual 
`getPath` results from ordinals with the bulk `FacetLabels` returned by the 
getBulkPath API
   2. Added a backwards compatibility test for the API that reads an older 
StoredFields based index
   
   Benchmarks:
   
   The stock `luceneutil` taxonomy tasks 
[call](https://github.com/mikemccand/luceneutil/blob/4d285784b41ea73f771aeffbc0addf7b626a3acf/src/main/perf/SearchTask.java#L208)
 `getTopChildren(10, ..)` for testing the performance of taxonomy facet 
counting classes. Since `FastTaxonomyFacetCounts` inherits from 
`IntTaxonomyFacetCounts` (and uses the same `getTopChildren` logic) the user 
should be able to reap the benefits of the bulk API directly.
   
   `luceneutil` that tests the top 10 children:
   
   ```
                      TaskQPS baseline      StdDevQPS my_modified_version      
StdDev                Pct diff p-value
                     Fuzzy1       44.07      (6.0%)       42.44      (7.2%)   
-3.7% ( -16% -   10%) 0.078
          HighTermMonthSort       22.69     (12.1%)       21.99     (11.3%)   
-3.1% ( -23% -   23%) 0.402
       HighTermTitleBDVSort       19.59     (12.4%)       18.99     (12.4%)   
-3.1% ( -24% -   24%) 0.434
      HighTermDayOfYearSort       54.33     (15.7%)       53.11     (16.3%)   
-2.3% ( -29% -   35%) 0.656
                 TermDTSort       15.58      (9.1%)       15.24     (10.9%)   
-2.2% ( -20% -   19%) 0.488
                   HighTerm     1010.56      (4.0%)      988.75      (5.2%)   
-2.2% ( -10% -    7%) 0.142
              OrHighNotHigh      422.64      (4.4%)      415.30      (5.6%)   
-1.7% ( -11% -    8%) 0.274
                    MedTerm      734.85      (5.5%)      722.45      (4.8%)   
-1.7% ( -11% -    9%) 0.301
                     Fuzzy2       14.51      (3.6%)       14.39      (2.9%)   
-0.9% (  -7% -    5%) 0.399
                AndHighHigh       18.22      (3.4%)       18.07      (2.6%)   
-0.8% (  -6% -    5%) 0.388
                 AndHighLow      181.77      (1.7%)      180.51      (2.3%)   
-0.7% (  -4% -    3%) 0.277
                    Respell       30.50      (2.0%)       30.36      (1.7%)   
-0.5% (  -4% -    3%) 0.438
               OrNotHighLow      347.94      (4.7%)      346.46      (3.6%)   
-0.4% (  -8% -    8%) 0.747
                LowSpanNear        3.83      (1.9%)        3.82      (2.2%)   
-0.3% (  -4% -    3%) 0.626
                 HighPhrase       39.95      (7.0%)       39.85      (7.3%)   
-0.3% ( -13% -   15%) 0.909
               OrHighNotMed      405.04      (4.9%)      404.00      (5.4%)   
-0.3% ( -10% -   10%) 0.874
                  OrHighLow      110.15      (3.0%)      109.87      (3.2%)   
-0.2% (  -6% -    6%) 0.800
                  OrHighMed       61.11      (3.1%)       60.96      (2.9%)   
-0.2% (  -6% -    5%) 0.795
            MedSloppyPhrase        5.47      (6.0%)        5.46      (3.1%)   
-0.2% (  -8% -    9%) 0.890
                MedSpanNear       12.25      (1.6%)       12.24      (2.0%)   
-0.1% (  -3% -    3%) 0.887
                     IntNRQ       25.39      (1.3%)       25.37      (2.0%)   
-0.1% (  -3% -    3%) 0.880
                 AndHighMed       64.96      (3.1%)       64.95      (2.7%)   
-0.0% (  -5% -    6%) 0.977
               HighSpanNear       17.37      (2.0%)       17.38      (3.7%)    
0.0% (  -5% -    5%) 0.963
            LowSloppyPhrase       55.39      (4.0%)       55.42      (2.6%)    
0.1% (  -6% -    6%) 0.958
                 OrHighHigh        9.85      (2.1%)        9.86      (2.9%)    
0.1% (  -4% -    5%) 0.918
                   Wildcard       29.06      (4.5%)       29.08      (3.4%)    
0.1% (  -7% -    8%) 0.942
       HighIntervalsOrdered        7.97      (2.5%)        7.98      (3.1%)    
0.1% (  -5% -    5%) 0.907
                    LowTerm      883.85      (4.4%)      884.97      (6.1%)    
0.1% (  -9% -   11%) 0.940
                    Prefix3       40.49      (7.1%)       40.55      (5.2%)    
0.2% ( -11% -   13%) 0.939
               OrHighNotLow      440.52      (5.1%)      441.49      (4.8%)    
0.2% (  -9% -   10%) 0.888
                  MedPhrase      138.15      (4.3%)      138.51      (4.7%)    
0.3% (  -8% -    9%) 0.855
                  LowPhrase      111.07      (2.9%)      111.40      (3.0%)    
0.3% (  -5% -    6%) 0.751
               OrNotHighMed      375.26      (6.7%)      376.64      (6.9%)    
0.4% ( -12% -   15%) 0.865
      BrowseMonthTaxoFacets        0.60      (5.2%)        0.60      (5.6%)    
0.4% (  -9% -   11%) 0.806
           HighSloppyPhrase        2.11      (5.7%)        2.12      (3.5%)    
0.5% (  -8% -   10%) 0.730
       BrowseDateTaxoFacets        0.57      (5.1%)        0.58      (5.5%)    
0.5% (  -9% -   11%) 0.746
   BrowseDayOfYearTaxoFacets        0.57      (5.2%)        0.58      (5.5%)    
0.6% (  -9% -   11%) 0.742
              OrNotHighHigh      391.99      (5.2%)      394.52      (5.3%)    
0.6% (  -9% -   11%) 0.701
                   PKLookup      113.96      (3.1%)      115.11      (3.4%)    
1.0% (  -5% -    7%) 0.320
      BrowseMonthSSDVFacets        2.62      (1.0%)        2.69      (1.8%)    
3.0% (   0% -    5%) 0.000
   BrowseDayOfYearSSDVFacets        2.46      (0.3%)        2.54      (2.7%)    
3.3% (   0% -    6%) 0.000
   ```
   
   Since, the previous results did not show any gain in the `Browse*TaxoFacets` 
tasks, I realized that maybe it was because the ordinalCache with a stock size 
of 4000 had all ordinals cached which led to the bulk API behaving the same as 
the iterative `getPath` call.
   
   The next benchmark tested the `getTopChildren` call with 10,000 children
   
   `luceneutil` that tests the top 10000 children:
   
   ```
                       TaskQPS baseline      StdDevQPS my_modified_version      
StdDev                Pct diff p-value
                     Fuzzy2       35.18      (4.4%)       34.81      (7.4%)   
-1.0% ( -12% -   11%) 0.588
          HighTermMonthSort       47.71     (12.1%)       47.28     (11.8%)   
-0.9% ( -22% -   26%) 0.812
                 AndHighMed       33.50      (3.4%)       33.35      (2.6%)   
-0.4% (  -6% -    5%) 0.648
                  OrHighMed       31.86      (2.2%)       31.75      (1.9%)   
-0.3% (  -4% -    3%) 0.595
       HighIntervalsOrdered        2.59      (1.8%)        2.58      (1.4%)   
-0.3% (  -3% -    2%) 0.540
                  OrHighLow      175.82      (2.7%)      175.43      (3.4%)   
-0.2% (  -6% -    5%) 0.816
                AndHighHigh       29.36      (3.9%)       29.33      (2.7%)   
-0.1% (  -6% -    6%) 0.915
            MedSloppyPhrase       18.63      (2.0%)       18.61      (2.6%)   
-0.1% (  -4% -    4%) 0.909
                 TermDTSort       37.97     (10.0%)       37.97     (10.2%)    
0.0% ( -18% -   22%) 0.997
                     Fuzzy1       24.97      (4.4%)       24.97      (4.1%)    
0.0% (  -8% -    8%) 0.983
                MedSpanNear       51.30      (2.0%)       51.35      (1.6%)    
0.1% (  -3% -    3%) 0.860
      HighTermDayOfYearSort       45.86     (16.5%)       45.92     (16.4%)    
0.1% ( -28% -   39%) 0.980
                 OrHighHigh        8.55      (2.1%)        8.56      (1.9%)    
0.1% (  -3% -    4%) 0.826
               HighSpanNear        9.30      (1.7%)        9.32      (1.7%)    
0.2% (  -3% -    3%) 0.766
       BrowseDateTaxoFacets        0.58      (4.7%)        0.58      (4.8%)    
0.2% (  -8% -   10%) 0.915
      BrowseMonthTaxoFacets        0.60      (4.9%)        0.60      (4.8%)    
0.2% (  -9% -   10%) 0.894
   BrowseDayOfYearTaxoFacets        0.57      (4.7%)        0.58      (4.8%)    
0.3% (  -8% -   10%) 0.861
                   PKLookup      116.98      (3.6%)      117.30      (3.1%)    
0.3% (  -6% -    7%) 0.794
            LowSloppyPhrase        5.58      (4.0%)        5.60      (3.9%)    
0.3% (  -7% -    8%) 0.785
                LowSpanNear       11.11      (1.6%)       11.15      (1.7%)    
0.4% (  -2% -    3%) 0.451
       HighTermTitleBDVSort       25.79      (7.2%)       25.89      (8.7%)    
0.4% ( -14% -   17%) 0.875
                     IntNRQ       25.27      (1.5%)       25.37      (1.3%)    
0.4% (  -2% -    3%) 0.347
                  LowPhrase       40.67      (2.2%)       40.84      (1.4%)    
0.4% (  -3% -    4%) 0.478
           HighSloppyPhrase        5.99      (4.1%)        6.01      (4.1%)    
0.4% (  -7% -    8%) 0.745
                   Wildcard       20.27      (3.5%)       20.41      (3.4%)    
0.7% (  -5% -    7%) 0.500
                    Respell       34.15      (2.1%)       34.41      (1.4%)    
0.8% (  -2% -    4%) 0.175
                 AndHighLow      358.53      (3.8%)      361.84      (3.8%)    
0.9% (  -6% -    8%) 0.438
               OrNotHighLow      369.77      (4.3%)      374.69      (2.7%)    
1.3% (  -5% -    8%) 0.239
                    MedTerm      980.66      (4.2%)      994.40      (5.2%)    
1.4% (  -7% -   11%) 0.347
              OrHighNotHigh      388.09      (4.6%)      394.10      (4.0%)    
1.5% (  -6% -   10%) 0.258
               OrNotHighMed      419.20      (3.9%)      426.30      (3.9%)    
1.7% (  -5% -    9%) 0.173
                  MedPhrase      281.59      (4.2%)      287.03      (3.3%)    
1.9% (  -5% -    9%) 0.107
                    LowTerm      758.64      (5.7%)      775.00      (4.9%)    
2.2% (  -8% -   13%) 0.200
                 HighPhrase       69.25      (6.5%)       70.85      (6.2%)    
2.3% (  -9% -   16%) 0.250
              OrNotHighHigh      517.94      (5.7%)      530.26      (3.5%)    
2.4% (  -6% -   12%) 0.111
               OrHighNotMed      438.39      (5.5%)      449.26      (3.8%)    
2.5% (  -6% -   12%) 0.096
                   HighTerm      650.61      (4.9%)      667.25      (4.4%)    
2.6% (  -6% -   12%) 0.081
               OrHighNotLow      308.34      (4.3%)      317.07      (5.0%)    
2.8% (  -6% -   12%) 0.053
                    Prefix3       88.13      (8.6%)       91.32      (8.5%)    
3.6% ( -12% -   22%) 0.181
      BrowseMonthSSDVFacets        2.61      (1.0%)        2.71      (1.4%)    
3.9% (   1% -    6%) 0.000
   BrowseDayOfYearSSDVFacets        2.46      (0.4%)        2.56      (2.1%)    
4.1% (   1% -    6%) 0.000
   ```
   
   I surprisingly see some gain in `BrowseDayOfYearSSDVFacets` and 
`BrowseMonthSSDVFacets` tasks but I think that is noise (because taxonomy 
facets have a different impl. as compared to `SortedSetDocValuesFacets`). 
   `Browse*TaxoFacets` tasks don't show gain.
   
   # Checklist
   
   Please review the following and check all that apply:
   
   - [x] I have reviewed the guidelines for [How to 
Contribute](https://wiki.apache.org/lucene/HowToContribute) and my code 
conforms to the standards described there to the best of my ability.
   - [x] I have created a Jira issue and added the issue ID to my pull request 
title.
   - [x] I have given Lucene maintainers 
[access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork)
 to contribute to my PR branch. (optional but recommended)
   - [x] I have developed this patch against the `main` branch.
   - [x] I have run `./gradlew check`.
   - [x] I have added tests for my changes.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gautamworah96 opened a new pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyWriter

Reply via email to