gashutos opened a new issue, #12720: URL: https://github.com/apache/lucene/issues/12720
### Description ### Problem With higher number of deleted entries in a segment, the sort query shows up to `10x` degradation after one point. We did this experiment using [nyc_taxis](https://github.com/opensearch-project/opensearch-benchmark-workloads/tree/main/nyc_taxis) workload and updated around 20% of documents in this workload. With this, the number of deleted documents in some segments rises up to 40%~. The `sort` query on this workload before update and after update of documents is showing arounf 10x degradation. ### Root cause Numeric field in Lucene utilizes BKD based skipping logic to optimize `sort` queries. The entire logic depends on `estimatedNumberOfPoints` logic which checks if we are able to drill down competitive hits `1/8th` [NumericComparator.java#L281C8](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/comparators/NumericComparator.java#L281C8-L281C8). It might be possible that, if segment has 30% deleted documents and most likely those deleted ones are the competitive, with that logic, we will never able to bring down `estimatedNumberOfMatches` below threshold and there wont be any skipping. This can cause problem for smaller amount of deleted documents as well, the skipping logic will be lesser and lesser optimized as the deleted document size goes higher, and at one point it will just not apply optimization at all. ### Open question to community Is there a way we can handle deleted documents scenario here ? I was thinking if we can directly substract `numDeletedDocuments` from `estimatedNumberOfMatches` so that we can cross the threshold, but that will make `[intersect](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/comparators/NumericComparator.java#L290C19-L290C28)` logic very costly as it has to iterate over those deleted entries. This was originaly caught while we were doing some merge policy improvement in OpenSearch https://github.com/opensearch-project/OpenSearch/issues/10340 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org