Re: [PR] perf(core): Implement sparse LiveDocs to reduce memory by up to 8x [lucene]

via GitHub Wed, 19 Nov 2025 18:05:48 -0800


jainankitk commented on code in PR #15413:
URL: https://github.com/apache/lucene/pull/15413#discussion_r2544098943



##########
lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90LiveDocsFormat.java:
##########
@@ -99,12 +101,66 @@ public Bits readLiveDocs(Directory dir, SegmentCommitInfo 
info, IOContext contex
     throw new AssertionError();
   }
 
+  /**
+   * Reads live docs from input and chooses between sparse and dense 
representation based on
+   * deletion rate.
+   */
+  private Bits readLiveDocs(IndexInput input, int maxDoc, double deletionRate, 
int expectedDelCount)
+      throws IOException {
+    Bits liveDocs;
+    int actualDelCount;
+
+    if (deletionRate <= SPARSE_DENSE_THRESHOLD) {
+      SparseFixedBitSet sparse = readSparseFixedBitSet(input, maxDoc);
+      actualDelCount = sparse.cardinality();
+      liveDocs = SparseLiveDocs.builder(sparse, 
maxDoc).withDeletedCount(actualDelCount).build();
+    } else {
+      FixedBitSet dense = readFixedBitSet(input, maxDoc);
+      actualDelCount = maxDoc - dense.cardinality();
+      liveDocs = DenseLiveDocs.builder(dense, 
maxDoc).withDeletedCount(actualDelCount).build();
+    }
+
+    if (actualDelCount != expectedDelCount) {
+      throw new CorruptIndexException(
+          "bits.deleted=" + actualDelCount + " info.delcount=" + 
expectedDelCount, input);
+    }
+
+    return liveDocs;
+  }
+
   private FixedBitSet readFixedBitSet(IndexInput input, int length) throws 
IOException {
     long[] data = new long[FixedBitSet.bits2words(length)];
     input.readLongs(data, 0, data.length);
     return new FixedBitSet(data, length);
   }
 
+  private SparseFixedBitSet readSparseFixedBitSet(IndexInput input, int 
length) throws IOException {
+    long[] data = new long[FixedBitSet.bits2words(length)];
+    input.readLongs(data, 0, data.length);
+
+    SparseFixedBitSet sparse = new SparseFixedBitSet(length);
+    for (int wordIndex = 0; wordIndex < data.length; wordIndex++) {
+      long word = data[wordIndex];
+      // Semantic inversion: disk format stores LIVE docs (bit=1 means live, 
bit=0 means deleted)
+      // but SparseLiveDocs stores DELETED docs (bit=1 means deleted).
+      // Skip words with all bits set (all docs live in disk format = no 
deletions to convert)
+      if (word == -1L) {
+        continue;
+      }
+      int baseDocId = wordIndex << 6;
+      int maxDocInWord = Math.min(baseDocId + 64, length);
+      for (int docId = baseDocId; docId < maxDocInWord; docId++) {
+        int bitIndex = docId & 63;
+        // If bit is 0 in disk format (deleted doc), set it in sparse 
representation (bit=1 means
+        // deleted)
+        if ((word & (1L << bitIndex)) == 0) {
+          sparse.set(docId);
+        }
+      }

Review Comment:
   I am wondering if we can directly insert initialize the `SparseFixedBitSet` 
using non -1 long values instead of iterating over individual docIds ? Looking 
at `SparseFixedBitSet` could not see a clear way of doing that, but should be 
more efficient I guess?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] perf(core): Implement sparse LiveDocs to reduce memory by up to 8x [lucene]

Reply via email to