[jira] [Commented] (LUCENE-9237) Faster TermsEnum intersect for UniformSplit

ASF subversion and git services (Jira) Fri, 28 Feb 2020 06:18:42 -0800


    [ 
https://issues.apache.org/jira/browse/LUCENE-9237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17047674#comment-17047674
 ]


ASF subversion and git services commented on LUCENE-9237:
---------------------------------------------------------

Commit 7302effd9c6d7dde203992df428f0c8d2389bfb3 in lucene-solr's branch 
refs/heads/branch_8x from Bruno Roustant
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=7302eff ]

LUCENE-9237: Faster UniformSplit intersect TermsEnum.


> Faster TermsEnum intersect for UniformSplit
> -------------------------------------------
>
>                 Key: LUCENE-9237
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9237
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Bruno Roustant
>            Assignee: Bruno Roustant
>            Priority: Major
>          Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> New version of TermsEnum intersect for UniformSplit. It is 75% more efficient 
> than the previous version for FuzzyQuery.
> Compared to BlockTree IntersectTermsEnum:
>  - It is still slower for FuzzyQuery (between -37% and -44% in our 
> benchmarks) but it is faster than the previous version (which was -65%).
>  - It is on par or slightly slower for WildcardQuery (between -5% and 0%).
>  - It is slightly faster for PrefixQuery (between +5% and +10%).
>  
> When I debugged thoroughly to understand what was the limitation of the 
> previous approach we had (to compute the common prefix between two 
> consecutive block keys in the FST), I saw that actually for all FuzzyQuery 
> the common prefix matched so we entered all blocks.
>  I realized that the FuzzyQuery automaton accepts many variations for the 
> prefix, and the common prefix was not long enough to allow us to filter 
> correctly.
> I looked at what VarGapFixedInterval did. It jumped all the time after each 
> term to find the next target term accepted by the automaton. And this was 
> sufficiently efficient thanks to a vital optimization that compared the 
> target term to the immediate following term, to actually not jump most of the 
> time.
> So I applied the same idea to compute the next accepted term and jump, but 
> now with a first condition based on the number of consecutively rejected 
> terms, and by anticipating the comparison of the accepted term with the 
> immediate next term. This is the main factor of the improvement. We leverage 
> also other optimizations that speed up the automaton validation of each 
> sequential term in the block.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-9237) Faster TermsEnum intersect for UniformSplit

Reply via email to