salvatorecampagna commented on code in PR #15413:
URL: https://github.com/apache/lucene/pull/15413#discussion_r2544978274


##########
lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90LiveDocsFormat.java:
##########
@@ -99,12 +101,66 @@ public Bits readLiveDocs(Directory dir, SegmentCommitInfo 
info, IOContext contex
     throw new AssertionError();
   }
 
+  /**
+   * Reads live docs from input and chooses between sparse and dense 
representation based on
+   * deletion rate.
+   */
+  private Bits readLiveDocs(IndexInput input, int maxDoc, double deletionRate, 
int expectedDelCount)
+      throws IOException {
+    Bits liveDocs;
+    int actualDelCount;
+
+    if (deletionRate <= SPARSE_DENSE_THRESHOLD) {
+      SparseFixedBitSet sparse = readSparseFixedBitSet(input, maxDoc);
+      actualDelCount = sparse.cardinality();
+      liveDocs = SparseLiveDocs.builder(sparse, 
maxDoc).withDeletedCount(actualDelCount).build();
+    } else {
+      FixedBitSet dense = readFixedBitSet(input, maxDoc);
+      actualDelCount = maxDoc - dense.cardinality();
+      liveDocs = DenseLiveDocs.builder(dense, 
maxDoc).withDeletedCount(actualDelCount).build();
+    }
+
+    if (actualDelCount != expectedDelCount) {
+      throw new CorruptIndexException(
+          "bits.deleted=" + actualDelCount + " info.delcount=" + 
expectedDelCount, input);
+    }
+
+    return liveDocs;
+  }
+
   private FixedBitSet readFixedBitSet(IndexInput input, int length) throws 
IOException {
     long[] data = new long[FixedBitSet.bits2words(length)];
     input.readLongs(data, 0, data.length);
     return new FixedBitSet(data, length);
   }
 
+  private SparseFixedBitSet readSparseFixedBitSet(IndexInput input, int 
length) throws IOException {
+    long[] data = new long[FixedBitSet.bits2words(length)];
+    input.readLongs(data, 0, data.length);
+
+    SparseFixedBitSet sparse = new SparseFixedBitSet(length);
+    for (int wordIndex = 0; wordIndex < data.length; wordIndex++) {
+      long word = data[wordIndex];
+      // Semantic inversion: disk format stores LIVE docs (bit=1 means live, 
bit=0 means deleted)
+      // but SparseLiveDocs stores DELETED docs (bit=1 means deleted).
+      // Skip words with all bits set (all docs live in disk format = no 
deletions to convert)
+      if (word == -1L) {
+        continue;
+      }
+      int baseDocId = wordIndex << 6;
+      int maxDocInWord = Math.min(baseDocId + 64, length);
+      for (int docId = baseDocId; docId < maxDocInWord; docId++) {
+        int bitIndex = docId & 63;
+        // If bit is 0 in disk format (deleted doc), set it in sparse 
representation (bit=1 means
+        // deleted)
+        if ((word & (1L << bitIndex)) == 0) {
+          sparse.set(docId);
+        }
+      }

Review Comment:
   @jainankitk I see your point, and I think it is an interesting opportunity 
at least to take into account.
   
   Ideally `deletedDocs = ~liveDocs` (I understand this is your idea), anyway 
the challenge is to maintain sparsity, avoiding unnecessary word allocations.
   
   `SparseFixedBitSet`'s key advantage is that it doesn't allocate memory for 
words with no bits set. The current bit-by-bit approach naturally preserves 
this: if a word on disk is all 1s (all live docs), we never call `sparse.set()` 
and no memory is allocated for that word/block.
   
   If we try to optimize by directly inserting inverted word values, we'd need 
to:
   1. Check if ~word != 0 (to skip all-live docs in a word)
   2. Navigate `SparseFixedBitSet`'s two-level sparse structure
   3. Ensure blocks are allocated only when needed
   
   This would require exposing or duplicating `SparseFixedBitSet`'s internal 
allocation logic, possibly adding complexity. Also, I think exposing a method 
like `setWord(...)` might be a bit error-prone, as it might result in 
setting/clearing bits incorrectly. I think the ideal would be to just do that 
as part of the construction logic without exposing an API.
   
   Moreover, the current approach skips words with no deletions (word == -1L 
check), and for words with deletions, we iterate the 64 bit positions but only 
allocate memory for the actual deleted docs. The bit manipulation overhead is 
negligible compared to disk I/O.
   
   I think the current implementation strikes the right balance between 
simplicity, correctness, and performance. I might try to look at this anyway 
just to see if it makes sense and maybe push a draft follow up PR including 
benchmarks.
   
   If that approach is faster, it makes sense at the end, considering that this 
change is adding a bit of overhead when loading data from disk and converting 
from dense to sparse.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to