hanbj opened a new issue, #14313: URL: https://github.com/apache/lucene/issues/14313
### Description There are many implementations of MultiTermQuery, such as TermInSetQuery FuzzyQuery、WildcardQuery、PrefixQuery、TermRangeQuery、RegexpQuery、TermsQuery、AutomatonQuery Wait, so optimizing the performance of MultiTermQuery has corresponding performance improvements for various queries. `The default logic is as follows:` 1. When the number of terms is less than or equal to 16, rewrite it as BooleanQuery 2. When the number of terms is greater than 16, traverse the posting list corresponding to each term to collect document numbers > 2.1. If the document frequency corresponding to this term is less than or equal to 512, record the document ID in otherTerms > 2.2. If the document frequency corresponding to the term is greater than 512, add the posting list corresponding to the term to the priority queue highFrequent Terms 3. Encapsulate the 16 posting lists contained in otherTerms and highFrequency Terms into the set subs 4. Use the Disjunction DISIApproximation wrapper to jointly participate in the collection of document numbers during the merging of posting lists `The optimization idea is as follows:` 1. Traverse the posting list corresponding to each term and delay processing, so that it can be returned in advance when encountering the following situations > 1.1. A term matches all documents > 1.2. A term matches all documents contained in that field 2. The frequency of documents corresponding to a certain term is very high, less than or equal to reader. maxDoc() -4096. When encountering a large posting list, reverse collection can be performed. At this time, the posting lists corresponding to other terms can be traversed, and the corresponding document IDs can be deleted from the reverse collected set. If the reverse collected set is empty, it means that all documents are matched and can be returned in advance. If it is not empty, the document IDs contained in the reverse collection set are also relatively small, and the performance will be fast when merging the reverse linked list later 3. When the term iteration is completed and it is found that the number of terms is equal to the number of terms contained in the field, all documents are included, and there is no need to traverse the posting list of each term. I have already implemented this optimization myself and it has been about half a year since it was launched in the production environment. Currently, I have not found any customer feedback issues, but the code changes are slightly significant. Is the Lucene community interested? If so, I will submit a PR. The test results are as follows: ----------------------------------------------------------------------------------- type | Performance improvement ----------------------------------------------------------------------------------- A term match all docs | 80 times ----------------------------------------------------------------------------------- A term matches all documents containing that field | 70 times ----------------------------------------------------------------------------------- contains all terms | 80 times ----------------------------------------------------------------------------------- Reverse collection | 8 times ----------------------------------------------------------------------------------- -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org