hanbj opened a new pull request, #14352: URL: https://github.com/apache/lucene/pull/14352
### Description Performance issues with terms number field encountered in production environments: high query time and very high CPU usage. Through analysis and localization, it was found that the main time consumption is in the build_store stage. The main task of build_store is to collect document IDs and fill bitset in memory. When there are a large number of document IDs, it takes up a lot of CPU. After reading the code and optimizing it, the performance of terms number field is improved by 5-10 times in low cardinality case. 1. values.getDocCount() == reader.maxDoc() : ensures that each document contains this field. 2. values.getDocCount() == values.size() : ensures that each document has only one value for this field. 3. cost() > reader.maxDoc() / 2 : ensure that more than half of the documents matched by all the points to be queried. Because when each document has only one value for this field, the document id in the bkd tree will not be duplicated. There will be no collection of duplicate document IDs. Therefore, first determine the number of documents that match all the points to be queried. If it exceeds half, use reverse collection to reduce the number of collected document ids. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org