[ https://issues.apache.org/jira/browse/LUCENE-9237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17045265#comment-17045265 ]
Bruno Roustant edited comment on LUCENE-9237 at 2/28/20 9:01 AM: ----------------------------------------------------------------- So here is the wikimediumall benchmark for Lucene84 with FST on heap and UniformSplit with block of 26 terms (FST on heap). In this case UniformSplit takes 25.2 MB for the term dictionary (+6.1 MB for facets). Actually the effect of smaller blocks on fuzzy queries is not better. It slows down a little. I believe it is because we have a jump-to-next-accepted-term policy which suffers from having more blocks (as opposed to a browse-automaton-transitions policy which benefits from more smaller blocks to skip). TaskQPS Lucene84 StdDevQPS UniformSplit26 StdDev Pct diff Respell 47.21 (3.7%) 24.78 (1.6%) -47.5% ( -50% - -43%) Fuzzy1 55.44 (5.8%) 29.32 (2.0%) -47.1% ( -51% - -41%) Fuzzy2 52.63 (10.0%) 28.17 (4.0%) -46.5% ( -54% - -36%) PKLookup 179.58 (4.8%) 164.08 (6.3%) -8.6% ( -18% - 2%) Wildcard 14.96 (4.2%) 13.90 (5.8%) -7.0% ( -16% - 3%) OrHighNotHigh 588.73 (8.2%) 562.35 (8.0%) -4.5% ( -19% - 12%) BrowseDayOfYearSSDVFacets 3.64 (3.0%) 3.52 (4.9%) -3.4% ( -11% - 4%) HighSpanNear 4.03 (3.6%) 3.91 (5.0%) -3.0% ( -11% - 5%) LowSpanNear 12.70 (3.9%) 12.34 (4.9%) -2.8% ( -11% - 6%) BrowseMonthSSDVFacets 4.06 (3.1%) 3.95 (4.9%) -2.7% ( -10% - 5%) LowSloppyPhrase 16.87 (5.6%) 16.43 (6.0%) -2.6% ( -13% - 9%) MedSpanNear 12.14 (3.7%) 11.83 (4.6%) -2.6% ( -10% - 5%) OrHighMed 35.49 (4.0%) 34.62 (5.7%) -2.4% ( -11% - 7%) AndHighMed 131.58 (5.0%) 128.48 (6.4%) -2.4% ( -13% - 9%) AndHighHigh 33.98 (4.3%) 33.21 (6.5%) -2.3% ( -12% - 8%) MedSloppyPhrase 14.38 (4.6%) 14.06 (5.4%) -2.2% ( -11% - 8%) OrHighHigh 11.66 (3.6%) 11.41 (5.2%) -2.1% ( -10% - 6%) HighSloppyPhrase 6.04 (5.4%) 5.93 (6.2%) -1.9% ( -12% - 10%) OrNotHighLow 524.51 (7.6%) 514.55 (7.2%) -1.9% ( -15% - 14%) HighIntervalsOrdered 2.75 (3.3%) 2.70 (4.6%) -1.9% ( -9% - 6%) HighTermDayOfYearSort 32.36 (7.1%) 31.77 (9.8%) -1.8% ( -17% - 16%) HighTermMonthSort 17.40 (17.5%) 17.10 (16.9%) -1.7% ( -30% - 39%) OrNotHighMed 551.94 (6.1%) 542.75 (7.6%) -1.7% ( -14% - 12%) LowPhrase 26.37 (3.9%) 26.02 (5.9%) -1.3% ( -10% - 8%) OrHighNotLow 594.32 (8.6%) 586.44 (8.7%) -1.3% ( -17% - 17%) OrHighLow 311.94 (5.2%) 311.00 (6.5%) -0.3% ( -11% - 12%) OrHighNotMed 593.04 (7.0%) 593.24 (8.7%) 0.0% ( -14% - 16%) HighPhrase 196.99 (5.2%) 198.14 (6.2%) 0.6% ( -10% - 12%) AndHighLow 341.56 (6.0%) 344.35 (8.2%) 0.8% ( -12% - 15%) IntNRQ 96.54 (5.4%) 97.37 (6.6%) 0.9% ( -10% - 13%) OrNotHighHigh 605.77 (8.2%) 611.14 (9.5%) 0.9% ( -15% - 20%) LowTerm 1408.69 (7.5%) 1428.02 (7.9%) 1.4% ( -13% - 18%) MedPhrase 154.81 (7.0%) 157.66 (6.2%) 1.8% ( -10% - 16%) Prefix3 91.52 (6.5%) 94.01 (8.8%) 2.7% ( -11% - 19%) HighTerm 992.54 (7.2%) 1022.11 (8.3%) 3.0% ( -11% - 19%) MedTerm 1048.57 (8.0%) 1100.60 (8.3%) 5.0% ( -10% - 23%) BrowseDayOfYearTaxoFacets 0.92 (3.1%) 1.44 (8.2%) 55.9% ( 43% - 69%) BrowseDateTaxoFacets 0.93 (3.1%) 1.45 (7.9%) 56.7% ( 44% - 69%) BrowseMonthTaxoFacets 1.00 (2.8%) 1.67 (7.9%) 67.4% ( 55% - 80%) was (Author: broustant): So here is the wikimediumall benchmark for Lucene84 with FST on heap and UniformSplit with block of 26 terms (FST on heap). In this case UniformSplit takes 25.2 MB for the term dictionary (+6.1 MB for facets). Actually the effect of smaller blocks on fuzzy queries is not good. It slows down a little. I believe it is because we have a jump-to-next-accepted-term policy which suffers from having more blocks (as opposed to a browse-automaton-transitions policy which benefits from more smaller blocks to skip). TaskQPS Lucene84 StdDevQPS UniformSplit26 StdDev Pct diff Respell 47.21 (3.7%) 24.78 (1.6%) -47.5% ( -50% - -43%) Fuzzy1 55.44 (5.8%) 29.32 (2.0%) -47.1% ( -51% - -41%) Fuzzy2 52.63 (10.0%) 28.17 (4.0%) -46.5% ( -54% - -36%) PKLookup 179.58 (4.8%) 164.08 (6.3%) -8.6% ( -18% - 2%) Wildcard 14.96 (4.2%) 13.90 (5.8%) -7.0% ( -16% - 3%) OrHighNotHigh 588.73 (8.2%) 562.35 (8.0%) -4.5% ( -19% - 12%) BrowseDayOfYearSSDVFacets 3.64 (3.0%) 3.52 (4.9%) -3.4% ( -11% - 4%) HighSpanNear 4.03 (3.6%) 3.91 (5.0%) -3.0% ( -11% - 5%) LowSpanNear 12.70 (3.9%) 12.34 (4.9%) -2.8% ( -11% - 6%) BrowseMonthSSDVFacets 4.06 (3.1%) 3.95 (4.9%) -2.7% ( -10% - 5%) LowSloppyPhrase 16.87 (5.6%) 16.43 (6.0%) -2.6% ( -13% - 9%) MedSpanNear 12.14 (3.7%) 11.83 (4.6%) -2.6% ( -10% - 5%) OrHighMed 35.49 (4.0%) 34.62 (5.7%) -2.4% ( -11% - 7%) AndHighMed 131.58 (5.0%) 128.48 (6.4%) -2.4% ( -13% - 9%) AndHighHigh 33.98 (4.3%) 33.21 (6.5%) -2.3% ( -12% - 8%) MedSloppyPhrase 14.38 (4.6%) 14.06 (5.4%) -2.2% ( -11% - 8%) OrHighHigh 11.66 (3.6%) 11.41 (5.2%) -2.1% ( -10% - 6%) HighSloppyPhrase 6.04 (5.4%) 5.93 (6.2%) -1.9% ( -12% - 10%) OrNotHighLow 524.51 (7.6%) 514.55 (7.2%) -1.9% ( -15% - 14%) HighIntervalsOrdered 2.75 (3.3%) 2.70 (4.6%) -1.9% ( -9% - 6%) HighTermDayOfYearSort 32.36 (7.1%) 31.77 (9.8%) -1.8% ( -17% - 16%) HighTermMonthSort 17.40 (17.5%) 17.10 (16.9%) -1.7% ( -30% - 39%) OrNotHighMed 551.94 (6.1%) 542.75 (7.6%) -1.7% ( -14% - 12%) LowPhrase 26.37 (3.9%) 26.02 (5.9%) -1.3% ( -10% - 8%) OrHighNotLow 594.32 (8.6%) 586.44 (8.7%) -1.3% ( -17% - 17%) OrHighLow 311.94 (5.2%) 311.00 (6.5%) -0.3% ( -11% - 12%) OrHighNotMed 593.04 (7.0%) 593.24 (8.7%) 0.0% ( -14% - 16%) HighPhrase 196.99 (5.2%) 198.14 (6.2%) 0.6% ( -10% - 12%) AndHighLow 341.56 (6.0%) 344.35 (8.2%) 0.8% ( -12% - 15%) IntNRQ 96.54 (5.4%) 97.37 (6.6%) 0.9% ( -10% - 13%) OrNotHighHigh 605.77 (8.2%) 611.14 (9.5%) 0.9% ( -15% - 20%) LowTerm 1408.69 (7.5%) 1428.02 (7.9%) 1.4% ( -13% - 18%) MedPhrase 154.81 (7.0%) 157.66 (6.2%) 1.8% ( -10% - 16%) Prefix3 91.52 (6.5%) 94.01 (8.8%) 2.7% ( -11% - 19%) HighTerm 992.54 (7.2%) 1022.11 (8.3%) 3.0% ( -11% - 19%) MedTerm 1048.57 (8.0%) 1100.60 (8.3%) 5.0% ( -10% - 23%) BrowseDayOfYearTaxoFacets 0.92 (3.1%) 1.44 (8.2%) 55.9% ( 43% - 69%) BrowseDateTaxoFacets 0.93 (3.1%) 1.45 (7.9%) 56.7% ( 44% - 69%) BrowseMonthTaxoFacets 1.00 (2.8%) 1.67 (7.9%) 67.4% ( 55% - 80%) > Faster TermsEnum intersect for UniformSplit > ------------------------------------------- > > Key: LUCENE-9237 > URL: https://issues.apache.org/jira/browse/LUCENE-9237 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Bruno Roustant > Assignee: Bruno Roustant > Priority: Major > Time Spent: 4h 10m > Remaining Estimate: 0h > > New version of TermsEnum intersect for UniformSplit. It is 75% more efficient > than the previous version for FuzzyQuery. > Compared to BlockTree IntersectTermsEnum: > - It is still slower for FuzzyQuery (between -37% and -44% in our > benchmarks) but it is faster than the previous version (which was -65%). > - It is on par or slightly slower for WildcardQuery (between -5% and 0%). > - It is slightly faster for PrefixQuery (between +5% and +10%). > > When I debugged thoroughly to understand what was the limitation of the > previous approach we had (to compute the common prefix between two > consecutive block keys in the FST), I saw that actually for all FuzzyQuery > the common prefix matched so we entered all blocks. > I realized that the FuzzyQuery automaton accepts many variations for the > prefix, and the common prefix was not long enough to allow us to filter > correctly. > I looked at what VarGapFixedInterval did. It jumped all the time after each > term to find the next target term accepted by the automaton. And this was > sufficiently efficient thanks to a vital optimization that compared the > target term to the immediate following term, to actually not jump most of the > time. > So I applied the same idea to compute the next accepted term and jump, but > now with a first condition based on the number of consecutively rejected > terms, and by anticipating the comparison of the accepted term with the > immediate next term. This is the main factor of the improvement. We leverage > also other optimizations that speed up the automaton validation of each > sequential term in the block. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org