[I] MultiTermQueryConstantScoreBlendedWrapper#createWeight#rewriteInner performance optimization ideas [lucene]

via GitHub Thu, 27 Feb 2025 23:36:57 -0800


hanbj opened a new issue, #14313:
URL: https://github.com/apache/lucene/issues/14313

### Description

There are many implementations of MultiTermQuery, such as TermInSetQuery
FuzzyQuery、WildcardQuery、PrefixQuery、TermRangeQuery、RegexpQuery、TermsQuery、AutomatonQuery
Wait, so optimizing the performance of MultiTermQuery has corresponding
performance improvements for various queries.

`The default logic is as follows:`
1. When the number of terms is less than or equal to 16, rewrite it as
BooleanQuery
2. When the number of terms is greater than 16, traverse the posting list
corresponding to each term to collect document numbers

> 2.1. If the document frequency corresponding to this term is less than or
equal to 512, record the document ID in otherTerms

> 2.2. If the document frequency corresponding to the term is greater than
512, add the posting list corresponding to the term to the priority queue
highFrequent Terms

3. Encapsulate the 16 posting lists contained in otherTerms and
highFrequency Terms into the set subs
4. Use the Disjunction DISIApproximation wrapper to jointly participate in
the collection of document numbers during the merging of posting lists

`The optimization idea is as follows:`
1. Traverse the posting list corresponding to each term and delay
processing, so that it can be returned in advance when encountering the
following situations

> 1.1. A term matches all documents

> 1.2. A term matches all documents contained in that field

2. The frequency of documents corresponding to a certain term is very high,
less than or equal to reader. maxDoc() -4096. When encountering a large posting
list, reverse collection can be performed. At this time, the posting lists
corresponding to other terms can be traversed, and the corresponding document
IDs can be deleted from the reverse collected set. If the reverse collected set
is empty, it means that all documents are matched and can be returned in
advance. If it is not empty, the document IDs contained in the reverse
collection set are also relatively small, and the performance will be fast when
merging the reverse linked list later
3. When the term iteration is completed and it is found that the number of
terms is equal to the number of terms contained in the field, all documents are
included, and there is no need to traverse the posting list of each term.

I have already implemented this optimization myself and it has been about
half a year since it was launched in the production environment. Currently, I
have not found any customer feedback issues, but the code changes are slightly
significant. Is the Lucene community interested? If so, I will submit a PR.
The test results are as follows:

-----------------------------------------------------------------------------------
type
| Performance improvement

-----------------------------------------------------------------------------------
A term match all docs |
80 times

-----------------------------------------------------------------------------------
A term matches all documents containing that field | 70 times

-----------------------------------------------------------------------------------
contains all terms
| 80 times

-----------------------------------------------------------------------------------
Reverse collection
| 8 times

-----------------------------------------------------------------------------------

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[I] MultiTermQueryConstantScoreBlendedWrapper#createWeight#rewriteInner performance optimization ideas [lucene]

Reply via email to