hanbj opened a new pull request, #14352:
URL: https://github.com/apache/lucene/pull/14352

   ### Description
   
   Performance issues with terms number field encountered in production 
environments: high query time and very high CPU usage.
   
   Through analysis and localization, it was found that the main time 
consumption is in the build_store stage. The main task of build_store is to 
collect document IDs and fill bitset in memory. When there are a large number 
of document IDs, it takes up a lot of CPU.
   
   After reading the code and optimizing it, the performance of terms number 
field is improved by 5-10 times in low cardinality case.
   
   1. values.getDocCount() == reader.maxDoc() : ensures that each document 
contains this field.
   2. values.getDocCount() == values.size() : ensures that each document has 
only one value for this field.
   3. cost() > reader.maxDoc() / 2 : ensure that more than half of the 
documents matched by all the points to be queried.
   
   Because when each document has only one value for this field, the document 
id in the bkd tree will not be duplicated. There will be no collection of 
duplicate document IDs. Therefore, first determine the number of documents that 
match all the points to be queried. If it exceeds half, use reverse collection 
to reduce the number of collected document ids.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to