gashutos opened a new issue, #12720:
URL: https://github.com/apache/lucene/issues/12720

   ### Description
   
   ### Problem
   With higher number of deleted entries in a segment, the sort query shows up 
to `10x` degradation after one point. We did this experiment using 
[nyc_taxis](https://github.com/opensearch-project/opensearch-benchmark-workloads/tree/main/nyc_taxis)
 workload and updated around 20% of documents in this workload. With this, the 
number of deleted documents in some segments rises up to 40%~.  The `sort` 
query on this workload before update and after update of documents is showing 
arounf 10x degradation.
   
   ### Root cause
   Numeric field in Lucene utilizes BKD based skipping logic to optimize `sort` 
queries.  The entire logic depends on `estimatedNumberOfPoints` logic which 
checks if we are able to drill down competitive hits `1/8th` 
[NumericComparator.java#L281C8](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/comparators/NumericComparator.java#L281C8-L281C8).
  It might be possible that, if segment has 30% deleted documents and most 
likely those deleted ones are the competitive, with that logic, we will never 
able to bring down `estimatedNumberOfMatches` below threshold and there wont be 
any skipping. This can cause problem for smaller amount of deleted documents as 
well, the skipping logic will be lesser and lesser optimized as the deleted 
document size goes higher, and at one point it will just not apply optimization 
at all. 
   
   ### Open question to community
   Is there a way we can handle deleted documents scenario here ? 
   I was thinking if we can directly substract `numDeletedDocuments` from 
`estimatedNumberOfMatches` so that we can cross the threshold, but that will 
make 
`[intersect](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/comparators/NumericComparator.java#L290C19-L290C28)`
 logic very costly as it has to iterate over those deleted entries. 
   
   This was originaly caught while we were doing some merge policy improvement 
in OpenSearch https://github.com/opensearch-project/OpenSearch/issues/10340 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to