[jira] [Updated] (LUCENE-9237) Faster TermsEnum intersect for UniformSplit

Bruno Roustant (Jira) Wed, 26 Feb 2020 01:09:52 -0800


     [ 
https://issues.apache.org/jira/browse/LUCENE-9237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Bruno Roustant updated LUCENE-9237:
-----------------------------------
    Description: 
New version of TermsEnum intersect for UniformSplit. It is 75% more efficient 
than the previous version for FuzzyQuery.

Compared to BlockTree IntersectTermsEnum:
 - It is still slower for FuzzyQuery (between -37% and -44% in our benchmarks) 
but it is faster than the previous version (which was -65%).
 - It is on par or slightly slower for WildcardQuery (between -5% and 0%).
 - It is slightly faster for PrefixQuery (between +5% and +10%).

 

When I debugged thoroughly to understand what was the limitation of the 
previous approach we had (to compute the common prefix between two consecutive 
block keys in the FST), I saw that actually for all FuzzyQuery the common 
prefix matched so we entered all blocks.
 I realized that the FuzzyQuery automaton accepts many variations for the 
prefix, and the common prefix was not long enough to allow us to filter 
correctly.

I looked at what VarGapFixedInterval did. It jumped all the time after each 
term to find the next target term accepted by the automaton. And this was 
sufficiently efficient thanks to a vital optimization that compared the target 
term to the immediate following term, to actually not jump most of the time.

So I applied the same idea to compute the next accepted term and jump, but now 
with a first condition based on the number of consecutively rejected terms, and 
by anticipating the comparison of the accepted term with the immediate next 
term. This is the main factor of the improvement. We leverage also other 
optimizations that speed up the automaton validation of each sequential term in 
the block.

  was:
New version of TermsEnum intersect for UniformSplit. It is 75% more efficient 
than the previous version for FuzzyQuery.

Compared to BlockTree IntersectTermsEnum:
 - It is still slower for FuzzyQuery (-37%) but it is faster than the previous 
version (which was -65%).
 - It is slightly slower for WildcardQuery (-5%).
 - It is slightly faster for PrefixQuery (+5%). Sometimes benchmarks show more 
improvement (I've seen up to +17% a fourth of the time).

 

When I debugged thoroughly to understand what was the limitation of the 
approach we had (to compute the common prefix between two consecutive block 
keys in the FST), I saw that actually for all FuzzyQuery the common prefix 
matched so we entered all blocks.
I realized that the FuzzyQuery automaton accepts many variations for the 
prefix, and the common prefix was not long enough to allow us to filter 
correctly.

I looked at what VarGapFixedInterval did. It jumped all the time after each 
term to find the next target term accepted by the automaton. And this was 
sufficiently efficient thanks to a vital optimization that compared the target 
term to the immediate following term, to actually not jump most of the time.

So I applied the same idea to compute the next accepted term and jump, but now 
with a first condition based on the number of consecutively rejected terms, and 
by anticipating the comparison of the accepted term with the immediate next 
term. This is the main factor of the improvement. We leverage also other 
optimizations that speed up the automaton validation of each sequential term in 
the block.


> Faster TermsEnum intersect for UniformSplit
> -------------------------------------------
>
>                 Key: LUCENE-9237
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9237
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Bruno Roustant
>            Assignee: Bruno Roustant
>            Priority: Major
>          Time Spent: 4h
>  Remaining Estimate: 0h
>
> New version of TermsEnum intersect for UniformSplit. It is 75% more efficient 
> than the previous version for FuzzyQuery.
> Compared to BlockTree IntersectTermsEnum:
>  - It is still slower for FuzzyQuery (between -37% and -44% in our 
> benchmarks) but it is faster than the previous version (which was -65%).
>  - It is on par or slightly slower for WildcardQuery (between -5% and 0%).
>  - It is slightly faster for PrefixQuery (between +5% and +10%).
>  
> When I debugged thoroughly to understand what was the limitation of the 
> previous approach we had (to compute the common prefix between two 
> consecutive block keys in the FST), I saw that actually for all FuzzyQuery 
> the common prefix matched so we entered all blocks.
>  I realized that the FuzzyQuery automaton accepts many variations for the 
> prefix, and the common prefix was not long enough to allow us to filter 
> correctly.
> I looked at what VarGapFixedInterval did. It jumped all the time after each 
> term to find the next target term accepted by the automaton. And this was 
> sufficiently efficient thanks to a vital optimization that compared the 
> target term to the immediate following term, to actually not jump most of the 
> time.
> So I applied the same idea to compute the next accepted term and jump, but 
> now with a first condition based on the number of consecutively rejected 
> terms, and by anticipating the comparison of the accepted term with the 
> immediate next term. This is the main factor of the improvement. We leverage 
> also other optimizations that speed up the automaton validation of each 
> sequential term in the block.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (LUCENE-9237) Faster TermsEnum intersect for UniformSplit

Reply via email to