[jira] [Comment Edited] (LUCENE-9237) Faster TermsEnum intersect for UniformSplit

Bruno Roustant (Jira) Wed, 26 Feb 2020 01:05:52 -0800


    [ 
https://issues.apache.org/jira/browse/LUCENE-9237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17045274#comment-17045274
 ]


Bruno Roustant edited comment on LUCENE-9237 at 2/26/20 9:05 AM:
-----------------------------------------------------------------

Other wikimediumall benchmark with Lucene84 FST-on-heap and UniformSplit block 
of 32 (default) FST-on-heap.

FST term dictionary is more compact (-20% size) but actually the speed for 
fuzzy query is better.
 So my conclusion is that we should not attempt to reduce the block size for 
UniformSplit. With blocks of 32 terms we have a good balance between lookup 
speed and fuzzy query speed.

I notice that with Lucene84 FST-on-heap, the relative speed of fuzzy queries 
for UniformSplit is -44%, wildcard is on par, prefix query is +9%. I'll update 
this Jira entry description.

TaskQPS Lucene84 StdDevQPS UniformSplit26 StdDev Pct diff
 Fuzzy2 56.74 (8.1%) 29.96 (3.1%) -47.2% ( -53% - -39%)
 Fuzzy1 77.44 (9.9%) 43.79 (3.4%) -43.5% ( -51% - -33%)
 Respell 51.63 (3.3%) 30.85 (2.4%) -40.2% ( -44% - -35%)
 PKLookup 178.85 (3.0%) 165.93 (3.3%) -7.2% ( -13% - 0%)
 HighTermMonthSort 53.25 (10.6%) 51.44 (13.0%) -3.4% ( -24% - 22%)
 BrowseDayOfYearSSDVFacets 3.64 (2.9%) 3.59 (2.1%) -1.3% ( -6% - 3%)
 HighSpanNear 3.60 (3.2%) 3.55 (2.4%) -1.3% ( -6% - 4%)
 LowSpanNear 14.52 (3.5%) 14.33 (2.7%) -1.3% ( -7% - 5%)
 BrowseMonthSSDVFacets 4.06 (2.6%) 4.01 (2.1%) -1.2% ( -5% - 3%)
 MedSpanNear 4.53 (2.8%) 4.48 (2.0%) -1.0% ( -5% - 3%)
 LowSloppyPhrase 18.44 (5.4%) 18.26 (4.0%) -1.0% ( -9% - 8%)
 HighIntervalsOrdered 2.57 (3.4%) 2.55 (2.9%) -1.0% ( -7% - 5%)
 HighTermDayOfYearSort 38.25 (7.3%) 37.89 (5.9%) -0.9% ( -13% - 13%)
 IntNRQ 20.81 (16.1%) 20.62 (15.9%) -0.9% ( -28% - 37%)
 OrHighMed 60.92 (3.9%) 60.54 (4.1%) -0.6% ( -8% - 7%)
 MedSloppyPhrase 10.53 (7.1%) 10.46 (5.7%) -0.6% ( -12% - 13%)
 OrHighHigh 10.99 (3.2%) 10.94 (3.4%) -0.5% ( -6% - 6%)
 HighSloppyPhrase 7.99 (8.1%) 7.95 (7.1%) -0.5% ( -14% - 15%)
 Wildcard 49.06 (10.8%) 49.00 (10.0%) -0.1% ( -18% - 23%)
 OrHighLow 51.28 (3.9%) 51.38 (3.7%) 0.2% ( -7% - 8%)
 MedPhrase 31.21 (3.9%) 31.44 (4.1%) 0.7% ( -7% - 9%)
 AndHighHigh 22.94 (5.0%) 23.13 (4.5%) 0.9% ( -8% - 10%)
 AndHighMed 52.33 (4.0%) 52.79 (4.6%) 0.9% ( -7% - 9%)
 OrNotHighMed 452.70 (5.5%) 462.21 (6.7%) 2.1% ( -9% - 15%)
 OrNotHighHigh 503.99 (5.9%) 518.75 (6.4%) 2.9% ( -8% - 16%)
 OrHighNotLow 575.63 (7.8%) 594.89 (6.8%) 3.3% ( -10% - 19%)
 LowPhrase 94.01 (4.9%) 97.29 (5.3%) 3.5% ( -6% - 14%)
 OrHighNotHigh 428.99 (5.9%) 444.99 (7.8%) 3.7% ( -9% - 18%)
 HighPhrase 69.13 (6.8%) 72.63 (5.7%) 5.1% ( -6% - 18%)
 OrNotHighLow 610.85 (5.7%) 644.82 (8.3%) 5.6% ( -8% - 20%)
 AndHighLow 414.81 (5.9%) 438.07 (4.7%) 5.6% ( -4% - 17%)
 MedTerm 1181.99 (6.3%) 1249.96 (5.4%) 5.8% ( -5% - 18%)
 OrHighNotMed 592.99 (6.2%) 637.89 (7.4%) 7.6% ( -5% - 22%)
 HighTerm 930.29 (5.8%) 1010.67 (7.2%) 8.6% ( -4% - 22%)
 Prefix3 125.93 (8.3%) 137.78 (10.1%) 9.4% ( -8% - 30%)
 LowTerm 1387.71 (7.7%) 1543.23 (8.1%) 11.2% ( -4% - 29%)
 BrowseDayOfYearTaxoFacets 0.92 (3.7%) 1.44 (6.6%) 56.2% ( 44% - 69%)
 BrowseDateTaxoFacets 0.93 (3.8%) 1.45 (6.4%) 57.1% ( 45% - 69%)
 BrowseMonthTaxoFacets 1.00 (4.0%) 1.68 (5.9%) 68.2% ( 56% - 81%)


was (Author: broustant):
Other benchmark with Lucene84 FST-on-heap and UniformSplit block of 32 
(default) FST-on-heap.

FST term dictionary is more compact (-20% size) but actually the speed for 
fuzzy query is better.
So my conclusion is that we should not attempt to reduce the block size for 
UniformSplit. With blocks of 32 terms we have a good balance between lookup 
speed and fuzzy query speed.

I notice that with Lucene84 FST-on-heap, the relative speed of fuzzy queries 
for UniformSplit is -44%, wildcard is on par, prefix query is +9%. I'll update 
this Jira entry description.

TaskQPS Lucene84 StdDevQPS UniformSplit26 StdDev Pct diff
 Fuzzy2 56.74 (8.1%) 29.96 (3.1%) -47.2% ( -53% - -39%)
 Fuzzy1 77.44 (9.9%) 43.79 (3.4%) -43.5% ( -51% - -33%)
 Respell 51.63 (3.3%) 30.85 (2.4%) -40.2% ( -44% - -35%)
 PKLookup 178.85 (3.0%) 165.93 (3.3%) -7.2% ( -13% - 0%)
 HighTermMonthSort 53.25 (10.6%) 51.44 (13.0%) -3.4% ( -24% - 22%)
BrowseDayOfYearSSDVFacets 3.64 (2.9%) 3.59 (2.1%) -1.3% ( -6% - 3%)
 HighSpanNear 3.60 (3.2%) 3.55 (2.4%) -1.3% ( -6% - 4%)
 LowSpanNear 14.52 (3.5%) 14.33 (2.7%) -1.3% ( -7% - 5%)
 BrowseMonthSSDVFacets 4.06 (2.6%) 4.01 (2.1%) -1.2% ( -5% - 3%)
 MedSpanNear 4.53 (2.8%) 4.48 (2.0%) -1.0% ( -5% - 3%)
 LowSloppyPhrase 18.44 (5.4%) 18.26 (4.0%) -1.0% ( -9% - 8%)
 HighIntervalsOrdered 2.57 (3.4%) 2.55 (2.9%) -1.0% ( -7% - 5%)
 HighTermDayOfYearSort 38.25 (7.3%) 37.89 (5.9%) -0.9% ( -13% - 13%)
 IntNRQ 20.81 (16.1%) 20.62 (15.9%) -0.9% ( -28% - 37%)
 OrHighMed 60.92 (3.9%) 60.54 (4.1%) -0.6% ( -8% - 7%)
 MedSloppyPhrase 10.53 (7.1%) 10.46 (5.7%) -0.6% ( -12% - 13%)
 OrHighHigh 10.99 (3.2%) 10.94 (3.4%) -0.5% ( -6% - 6%)
 HighSloppyPhrase 7.99 (8.1%) 7.95 (7.1%) -0.5% ( -14% - 15%)
 Wildcard 49.06 (10.8%) 49.00 (10.0%) -0.1% ( -18% - 23%)
 OrHighLow 51.28 (3.9%) 51.38 (3.7%) 0.2% ( -7% - 8%)
 MedPhrase 31.21 (3.9%) 31.44 (4.1%) 0.7% ( -7% - 9%)
 AndHighHigh 22.94 (5.0%) 23.13 (4.5%) 0.9% ( -8% - 10%)
 AndHighMed 52.33 (4.0%) 52.79 (4.6%) 0.9% ( -7% - 9%)
 OrNotHighMed 452.70 (5.5%) 462.21 (6.7%) 2.1% ( -9% - 15%)
 OrNotHighHigh 503.99 (5.9%) 518.75 (6.4%) 2.9% ( -8% - 16%)
 OrHighNotLow 575.63 (7.8%) 594.89 (6.8%) 3.3% ( -10% - 19%)
 LowPhrase 94.01 (4.9%) 97.29 (5.3%) 3.5% ( -6% - 14%)
 OrHighNotHigh 428.99 (5.9%) 444.99 (7.8%) 3.7% ( -9% - 18%)
 HighPhrase 69.13 (6.8%) 72.63 (5.7%) 5.1% ( -6% - 18%)
 OrNotHighLow 610.85 (5.7%) 644.82 (8.3%) 5.6% ( -8% - 20%)
 AndHighLow 414.81 (5.9%) 438.07 (4.7%) 5.6% ( -4% - 17%)
 MedTerm 1181.99 (6.3%) 1249.96 (5.4%) 5.8% ( -5% - 18%)
 OrHighNotMed 592.99 (6.2%) 637.89 (7.4%) 7.6% ( -5% - 22%)
 HighTerm 930.29 (5.8%) 1010.67 (7.2%) 8.6% ( -4% - 22%)
 Prefix3 125.93 (8.3%) 137.78 (10.1%) 9.4% ( -8% - 30%)
 LowTerm 1387.71 (7.7%) 1543.23 (8.1%) 11.2% ( -4% - 29%)
BrowseDayOfYearTaxoFacets 0.92 (3.7%) 1.44 (6.6%) 56.2% ( 44% - 69%)
 BrowseDateTaxoFacets 0.93 (3.8%) 1.45 (6.4%) 57.1% ( 45% - 69%)
 BrowseMonthTaxoFacets 1.00 (4.0%) 1.68 (5.9%) 68.2% ( 56% - 81%)

> Faster TermsEnum intersect for UniformSplit
> -------------------------------------------
>
>                 Key: LUCENE-9237
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9237
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Bruno Roustant
>            Assignee: Bruno Roustant
>            Priority: Major
>          Time Spent: 4h
>  Remaining Estimate: 0h
>
> New version of TermsEnum intersect for UniformSplit. It is 75% more efficient 
> than the previous version for FuzzyQuery.
> Compared to BlockTree IntersectTermsEnum:
>  - It is still slower for FuzzyQuery (-37%) but it is faster than the 
> previous version (which was -65%).
>  - It is slightly slower for WildcardQuery (-5%).
>  - It is slightly faster for PrefixQuery (+5%). Sometimes benchmarks show 
> more improvement (I've seen up to +17% a fourth of the time).
>  
> When I debugged thoroughly to understand what was the limitation of the 
> approach we had (to compute the common prefix between two consecutive block 
> keys in the FST), I saw that actually for all FuzzyQuery the common prefix 
> matched so we entered all blocks.
> I realized that the FuzzyQuery automaton accepts many variations for the 
> prefix, and the common prefix was not long enough to allow us to filter 
> correctly.
> I looked at what VarGapFixedInterval did. It jumped all the time after each 
> term to find the next target term accepted by the automaton. And this was 
> sufficiently efficient thanks to a vital optimization that compared the 
> target term to the immediate following term, to actually not jump most of the 
> time.
> So I applied the same idea to compute the next accepted term and jump, but 
> now with a first condition based on the number of consecutively rejected 
> terms, and by anticipating the comparison of the accepted term with the 
> immediate next term. This is the main factor of the improvement. We leverage 
> also other optimizations that speed up the automaton validation of each 
> sequential term in the block.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-9237) Faster TermsEnum intersect for UniformSplit

Reply via email to