[ https://issues.apache.org/jira/browse/LUCENE-9237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Bruno Roustant resolved LUCENE-9237. ------------------------------------ Fix Version/s: 8.5 Resolution: Fixed Thanks [~dsmiley] for reviewing! > Faster TermsEnum intersect for UniformSplit > ------------------------------------------- > > Key: LUCENE-9237 > URL: https://issues.apache.org/jira/browse/LUCENE-9237 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Bruno Roustant > Assignee: Bruno Roustant > Priority: Major > Fix For: 8.5 > > Time Spent: 4h 20m > Remaining Estimate: 0h > > New version of TermsEnum intersect for UniformSplit. It is 75% more efficient > than the previous version for FuzzyQuery. > Compared to BlockTree IntersectTermsEnum: > - It is still slower for FuzzyQuery (between -37% and -44% in our > benchmarks) but it is faster than the previous version (which was -65%). > - It is on par or slightly slower for WildcardQuery (between -5% and 0%). > - It is slightly faster for PrefixQuery (between +5% and +10%). > > When I debugged thoroughly to understand what was the limitation of the > previous approach we had (to compute the common prefix between two > consecutive block keys in the FST), I saw that actually for all FuzzyQuery > the common prefix matched so we entered all blocks. > I realized that the FuzzyQuery automaton accepts many variations for the > prefix, and the common prefix was not long enough to allow us to filter > correctly. > I looked at what VarGapFixedInterval did. It jumped all the time after each > term to find the next target term accepted by the automaton. And this was > sufficiently efficient thanks to a vital optimization that compared the > target term to the immediate following term, to actually not jump most of the > time. > So I applied the same idea to compute the next accepted term and jump, but > now with a first condition based on the number of consecutively rejected > terms, and by anticipating the comparison of the accepted term with the > immediate next term. This is the main factor of the improvement. We leverage > also other optimizations that speed up the automaton validation of each > sequential term in the block. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org