hanbj opened a new issue, #14313:
URL: https://github.com/apache/lucene/issues/14313

   ### Description
   
   There are many implementations of MultiTermQuery, such as TermInSetQuery 
FuzzyQuery、WildcardQuery、PrefixQuery、TermRangeQuery、RegexpQuery、TermsQuery、AutomatonQuery
 Wait, so optimizing the performance of MultiTermQuery has corresponding 
performance improvements for various queries.
   
   `The default logic is as follows:`
   1. When the number of terms is less than or equal to 16, rewrite it as 
BooleanQuery
   2. When the number of terms is greater than 16, traverse the posting list 
corresponding to each term to collect document numbers
   
   > 2.1. If the document frequency corresponding to this term is less than or 
equal to 512, record the document ID in otherTerms
   
   > 2.2. If the document frequency corresponding to the term is greater than 
512, add the posting list corresponding to the term to the priority queue 
highFrequent Terms
   
   3. Encapsulate the 16 posting lists contained in otherTerms and 
highFrequency Terms into the set subs
   4. Use the Disjunction DISIApproximation wrapper to jointly participate in 
the collection of document numbers during the merging of posting lists
   
   `The optimization idea is as follows:`
   1. Traverse the posting list corresponding to each term and delay 
processing, so that it can be returned in advance when encountering the 
following situations
   
   > 1.1. A term matches all documents
   
   > 1.2. A term matches all documents contained in that field
   
   2. The frequency of documents corresponding to a certain term is very high, 
less than or equal to reader. maxDoc() -4096. When encountering a large posting 
list, reverse collection can be performed. At this time, the posting lists 
corresponding to other terms can be traversed, and the corresponding document 
IDs can be deleted from the reverse collected set. If the reverse collected set 
is empty, it means that all documents are matched and can be returned in 
advance. If it is not empty, the document IDs contained in the reverse 
collection set are also relatively small, and the performance will be fast when 
merging the reverse linked list later
   3. When the term iteration is completed and it is found that the number of 
terms is equal to the number of terms contained in the field, all documents are 
included, and there is no need to traverse the posting list of each term.
   
   I have already implemented this optimization myself and it has been about 
half a year since it was launched in the production environment. Currently, I 
have not found any customer feedback issues, but the code changes are slightly 
significant. Is the Lucene community interested? If so, I will submit a PR.
   The test results are as follows:
   
-----------------------------------------------------------------------------------
   type                                                                         
        |   Performance improvement
   
-----------------------------------------------------------------------------------
   A term match all docs                                                   |    
        80 times
   
-----------------------------------------------------------------------------------
   A term matches all documents containing that field  |            70 times
   
-----------------------------------------------------------------------------------
   contains all terms                                                           
|            80 times
   
-----------------------------------------------------------------------------------
   Reverse collection                                                          
|            8 times
   
-----------------------------------------------------------------------------------


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to