RS146BIJAY opened a new issue, #15352: URL: https://github.com/apache/lucene/issues/15352
### Description ## Background We are currently working on a feature in OpenSearch to support context aware segment within OpenSearch which involves maintaining multiple IndexWriter instances, one for each group, within a shard to collocate related data into same segment or group of segments. The design is detailed in the following RFCs and LLD: * [OpenSearch RFC](https://github.com/opensearch-project/OpenSearch/issues/18576) * [Lucene RFC](https://github.com/apache/lucene/issues/13387) * [OpenSearch LLD](https://github.com/opensearch-project/OpenSearch/issues/19530) ## Current Use Case With Context Aware Segment, within a shard, writes are routed to respective group-specific `IndexWriter` instances. To maintain consistent versioning across writers during update operation, we perform a **hard delete** of the previous document version in the parent (accumulating) `IndexWriter` whenever a new version is added to a group-specific writer. ## Problem Description Currently with just soft deletes enabled, during OpenSearch's DocRep recovery, OpenSearch [uses `SegmentReader.hardLiveDocs` to query live docs](https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/common/lucene/Lucene.java#L942) from segments with hard deletes (which may have gotten introduced due to IndexWriter hitting non-aborted exceptions). The number of liveDocs is efficiently derived as: `segmentReader.maxDoc() - segmentReader.getSegmentInfo().getDelCount()` However, by performing both soft and hard delete on a context aware enabled Lucene Index, the above calculation breaks down as `segmentReader.getSegmentInfo().getDelCount()` no longer provide the accurate live delete count on a segment. Based on [Lucene's unit tests for mixed deletes](https://github.com/apache/lucene/blob/f2da05b25396a72adb07895c8858a15841c3c6a9/lucene/core/src/test/org/apache/lucene/index/TestSoftDeletesRetentionMergePolicy.java#L696), the only reliable method to get the live doc count is to iterate through the hardLiveDocs and count the set bits. ## Performance Impact This iterative counting operation is computationally expensive for large segments and can potentially cause significant performance regressions during shard recovery. ## Ask from this issue Is there a more optimized, direct way to retrieve the count of live documents from a SegmentReader's hardLiveDocs when a segment has undergone both hard and soft deletes? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
