gautamworah96 opened a new pull request #179: URL: https://github.com/apache/lucene/pull/179
<!-- _(If you are a project committer then you may remove some/all of the following template.)_ Before creating a pull request, please file an issue in the ASF Jira system for Lucene: * https://issues.apache.org/jira/projects/LUCENE You will need to create an account in Jira in order to create an issue. The title of the PR should reference the Jira issue number in the form: * LUCENE-####: <short description of problem or changes> LUCENE must be fully capitalized. A short description helps people scanning pull requests for items they can work on. Properly referencing the issue in the title ensures that Jira is correctly updated with code review comments and commits. --> # Description In LUCENE-9450 we switched the Taxonomy index from Stored Fields to BinaryDocValues. In the resulting implementation of the getPath code, we create a new BinaryDocValues's values instance for each ordinal. It may happen that we may traverse over the same nodes over and over again if the getPath API is called multiple times for ordinals in the same segment/with the same readerIndex. This PR takes advantage of that fact by sorting ordinals and then trying to find out if some of the ordinals are present in the same segment/have the same readerIndex (by trying to advanceExact to the correct position and not failing) thereby allowing us to reuse the previous BinaryDocValues object. **This PR was originally proposed in the older `lucene-solr` repo [here](https://github.com/apache/lucene-solr/pull/2247). I discontinued it because we could not find a good "real" use case for it.** **This new PR uses the `getBulkPath` API inside `IntTaxonomyFacetCounts` and `FloatTaxonomyFacetCounts` classes.** In the `getTopChildren` method within these classes, we usually collect all the `top-n` ordinals first and then translate them into `FacetLabels` one by one using an iterative API. This call has been replaced with the `getBulkPath` method in this PR. # Solution Steps: 1. Sort all ordinals and remember their position so as to store the path in the correct position 2. Try to advanceExact to the correct position with the previously calculated readerIndex. If the operation fails, try to find the correct segment for the ordinal and then advanceExact to the desired position. 3. Store this position for future ordinals. # Tests 1. Added a new randomized test for the API that compares the individual `getPath` results from ordinals with the bulk `FacetLabels` returned by the getBulkPath API 2. Added a backwards compatibility test for the API that reads an older StoredFields based index Benchmarks: The stock `luceneutil` taxonomy tasks [call](https://github.com/mikemccand/luceneutil/blob/4d285784b41ea73f771aeffbc0addf7b626a3acf/src/main/perf/SearchTask.java#L208) `getTopChildren(10, ..)` for testing the performance of taxonomy facet counting classes. Since `FastTaxonomyFacetCounts` inherits from `IntTaxonomyFacetCounts` (and uses the same `getTopChildren` logic) the user should be able to reap the benefits of the bulk API directly. `luceneutil` that tests the top 10 children: ``` TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value Fuzzy1 44.07 (6.0%) 42.44 (7.2%) -3.7% ( -16% - 10%) 0.078 HighTermMonthSort 22.69 (12.1%) 21.99 (11.3%) -3.1% ( -23% - 23%) 0.402 HighTermTitleBDVSort 19.59 (12.4%) 18.99 (12.4%) -3.1% ( -24% - 24%) 0.434 HighTermDayOfYearSort 54.33 (15.7%) 53.11 (16.3%) -2.3% ( -29% - 35%) 0.656 TermDTSort 15.58 (9.1%) 15.24 (10.9%) -2.2% ( -20% - 19%) 0.488 HighTerm 1010.56 (4.0%) 988.75 (5.2%) -2.2% ( -10% - 7%) 0.142 OrHighNotHigh 422.64 (4.4%) 415.30 (5.6%) -1.7% ( -11% - 8%) 0.274 MedTerm 734.85 (5.5%) 722.45 (4.8%) -1.7% ( -11% - 9%) 0.301 Fuzzy2 14.51 (3.6%) 14.39 (2.9%) -0.9% ( -7% - 5%) 0.399 AndHighHigh 18.22 (3.4%) 18.07 (2.6%) -0.8% ( -6% - 5%) 0.388 AndHighLow 181.77 (1.7%) 180.51 (2.3%) -0.7% ( -4% - 3%) 0.277 Respell 30.50 (2.0%) 30.36 (1.7%) -0.5% ( -4% - 3%) 0.438 OrNotHighLow 347.94 (4.7%) 346.46 (3.6%) -0.4% ( -8% - 8%) 0.747 LowSpanNear 3.83 (1.9%) 3.82 (2.2%) -0.3% ( -4% - 3%) 0.626 HighPhrase 39.95 (7.0%) 39.85 (7.3%) -0.3% ( -13% - 15%) 0.909 OrHighNotMed 405.04 (4.9%) 404.00 (5.4%) -0.3% ( -10% - 10%) 0.874 OrHighLow 110.15 (3.0%) 109.87 (3.2%) -0.2% ( -6% - 6%) 0.800 OrHighMed 61.11 (3.1%) 60.96 (2.9%) -0.2% ( -6% - 5%) 0.795 MedSloppyPhrase 5.47 (6.0%) 5.46 (3.1%) -0.2% ( -8% - 9%) 0.890 MedSpanNear 12.25 (1.6%) 12.24 (2.0%) -0.1% ( -3% - 3%) 0.887 IntNRQ 25.39 (1.3%) 25.37 (2.0%) -0.1% ( -3% - 3%) 0.880 AndHighMed 64.96 (3.1%) 64.95 (2.7%) -0.0% ( -5% - 6%) 0.977 HighSpanNear 17.37 (2.0%) 17.38 (3.7%) 0.0% ( -5% - 5%) 0.963 LowSloppyPhrase 55.39 (4.0%) 55.42 (2.6%) 0.1% ( -6% - 6%) 0.958 OrHighHigh 9.85 (2.1%) 9.86 (2.9%) 0.1% ( -4% - 5%) 0.918 Wildcard 29.06 (4.5%) 29.08 (3.4%) 0.1% ( -7% - 8%) 0.942 HighIntervalsOrdered 7.97 (2.5%) 7.98 (3.1%) 0.1% ( -5% - 5%) 0.907 LowTerm 883.85 (4.4%) 884.97 (6.1%) 0.1% ( -9% - 11%) 0.940 Prefix3 40.49 (7.1%) 40.55 (5.2%) 0.2% ( -11% - 13%) 0.939 OrHighNotLow 440.52 (5.1%) 441.49 (4.8%) 0.2% ( -9% - 10%) 0.888 MedPhrase 138.15 (4.3%) 138.51 (4.7%) 0.3% ( -8% - 9%) 0.855 LowPhrase 111.07 (2.9%) 111.40 (3.0%) 0.3% ( -5% - 6%) 0.751 OrNotHighMed 375.26 (6.7%) 376.64 (6.9%) 0.4% ( -12% - 15%) 0.865 BrowseMonthTaxoFacets 0.60 (5.2%) 0.60 (5.6%) 0.4% ( -9% - 11%) 0.806 HighSloppyPhrase 2.11 (5.7%) 2.12 (3.5%) 0.5% ( -8% - 10%) 0.730 BrowseDateTaxoFacets 0.57 (5.1%) 0.58 (5.5%) 0.5% ( -9% - 11%) 0.746 BrowseDayOfYearTaxoFacets 0.57 (5.2%) 0.58 (5.5%) 0.6% ( -9% - 11%) 0.742 OrNotHighHigh 391.99 (5.2%) 394.52 (5.3%) 0.6% ( -9% - 11%) 0.701 PKLookup 113.96 (3.1%) 115.11 (3.4%) 1.0% ( -5% - 7%) 0.320 BrowseMonthSSDVFacets 2.62 (1.0%) 2.69 (1.8%) 3.0% ( 0% - 5%) 0.000 BrowseDayOfYearSSDVFacets 2.46 (0.3%) 2.54 (2.7%) 3.3% ( 0% - 6%) 0.000 ``` Since, the previous results did not show any gain in the `Browse*TaxoFacets` tasks, I realized that maybe it was because the ordinalCache with a stock size of 4000 had all ordinals cached which led to the bulk API behaving the same as the iterative `getPath` call. The next benchmark tested the `getTopChildren` call with 10,000 children `luceneutil` that tests the top 10000 children: ``` TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value Fuzzy2 35.18 (4.4%) 34.81 (7.4%) -1.0% ( -12% - 11%) 0.588 HighTermMonthSort 47.71 (12.1%) 47.28 (11.8%) -0.9% ( -22% - 26%) 0.812 AndHighMed 33.50 (3.4%) 33.35 (2.6%) -0.4% ( -6% - 5%) 0.648 OrHighMed 31.86 (2.2%) 31.75 (1.9%) -0.3% ( -4% - 3%) 0.595 HighIntervalsOrdered 2.59 (1.8%) 2.58 (1.4%) -0.3% ( -3% - 2%) 0.540 OrHighLow 175.82 (2.7%) 175.43 (3.4%) -0.2% ( -6% - 5%) 0.816 AndHighHigh 29.36 (3.9%) 29.33 (2.7%) -0.1% ( -6% - 6%) 0.915 MedSloppyPhrase 18.63 (2.0%) 18.61 (2.6%) -0.1% ( -4% - 4%) 0.909 TermDTSort 37.97 (10.0%) 37.97 (10.2%) 0.0% ( -18% - 22%) 0.997 Fuzzy1 24.97 (4.4%) 24.97 (4.1%) 0.0% ( -8% - 8%) 0.983 MedSpanNear 51.30 (2.0%) 51.35 (1.6%) 0.1% ( -3% - 3%) 0.860 HighTermDayOfYearSort 45.86 (16.5%) 45.92 (16.4%) 0.1% ( -28% - 39%) 0.980 OrHighHigh 8.55 (2.1%) 8.56 (1.9%) 0.1% ( -3% - 4%) 0.826 HighSpanNear 9.30 (1.7%) 9.32 (1.7%) 0.2% ( -3% - 3%) 0.766 BrowseDateTaxoFacets 0.58 (4.7%) 0.58 (4.8%) 0.2% ( -8% - 10%) 0.915 BrowseMonthTaxoFacets 0.60 (4.9%) 0.60 (4.8%) 0.2% ( -9% - 10%) 0.894 BrowseDayOfYearTaxoFacets 0.57 (4.7%) 0.58 (4.8%) 0.3% ( -8% - 10%) 0.861 PKLookup 116.98 (3.6%) 117.30 (3.1%) 0.3% ( -6% - 7%) 0.794 LowSloppyPhrase 5.58 (4.0%) 5.60 (3.9%) 0.3% ( -7% - 8%) 0.785 LowSpanNear 11.11 (1.6%) 11.15 (1.7%) 0.4% ( -2% - 3%) 0.451 HighTermTitleBDVSort 25.79 (7.2%) 25.89 (8.7%) 0.4% ( -14% - 17%) 0.875 IntNRQ 25.27 (1.5%) 25.37 (1.3%) 0.4% ( -2% - 3%) 0.347 LowPhrase 40.67 (2.2%) 40.84 (1.4%) 0.4% ( -3% - 4%) 0.478 HighSloppyPhrase 5.99 (4.1%) 6.01 (4.1%) 0.4% ( -7% - 8%) 0.745 Wildcard 20.27 (3.5%) 20.41 (3.4%) 0.7% ( -5% - 7%) 0.500 Respell 34.15 (2.1%) 34.41 (1.4%) 0.8% ( -2% - 4%) 0.175 AndHighLow 358.53 (3.8%) 361.84 (3.8%) 0.9% ( -6% - 8%) 0.438 OrNotHighLow 369.77 (4.3%) 374.69 (2.7%) 1.3% ( -5% - 8%) 0.239 MedTerm 980.66 (4.2%) 994.40 (5.2%) 1.4% ( -7% - 11%) 0.347 OrHighNotHigh 388.09 (4.6%) 394.10 (4.0%) 1.5% ( -6% - 10%) 0.258 OrNotHighMed 419.20 (3.9%) 426.30 (3.9%) 1.7% ( -5% - 9%) 0.173 MedPhrase 281.59 (4.2%) 287.03 (3.3%) 1.9% ( -5% - 9%) 0.107 LowTerm 758.64 (5.7%) 775.00 (4.9%) 2.2% ( -8% - 13%) 0.200 HighPhrase 69.25 (6.5%) 70.85 (6.2%) 2.3% ( -9% - 16%) 0.250 OrNotHighHigh 517.94 (5.7%) 530.26 (3.5%) 2.4% ( -6% - 12%) 0.111 OrHighNotMed 438.39 (5.5%) 449.26 (3.8%) 2.5% ( -6% - 12%) 0.096 HighTerm 650.61 (4.9%) 667.25 (4.4%) 2.6% ( -6% - 12%) 0.081 OrHighNotLow 308.34 (4.3%) 317.07 (5.0%) 2.8% ( -6% - 12%) 0.053 Prefix3 88.13 (8.6%) 91.32 (8.5%) 3.6% ( -12% - 22%) 0.181 BrowseMonthSSDVFacets 2.61 (1.0%) 2.71 (1.4%) 3.9% ( 1% - 6%) 0.000 BrowseDayOfYearSSDVFacets 2.46 (0.4%) 2.56 (2.1%) 4.1% ( 1% - 6%) 0.000 ``` I surprisingly see some gain in `BrowseDayOfYearSSDVFacets` and `BrowseMonthSSDVFacets` tasks but I think that is noise (because taxonomy facets have a different impl. as compared to `SortedSetDocValuesFacets`). `Browse*TaxoFacets` tasks don't show gain. # Checklist Please review the following and check all that apply: - [x] I have reviewed the guidelines for [How to Contribute](https://wiki.apache.org/lucene/HowToContribute) and my code conforms to the standards described there to the best of my ability. - [x] I have created a Jira issue and added the issue ID to my pull request title. - [x] I have given Lucene maintainers [access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork) to contribute to my PR branch. (optional but recommended) - [x] I have developed this patch against the `main` branch. - [x] I have run `./gradlew check`. - [x] I have added tests for my changes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org