salvatorecampagna commented on code in PR #15413:
URL: https://github.com/apache/lucene/pull/15413#discussion_r2544978274
##########
lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90LiveDocsFormat.java:
##########
@@ -99,12 +101,66 @@ public Bits readLiveDocs(Directory dir, SegmentCommitInfo
info, IOContext contex
throw new AssertionError();
}
+ /**
+ * Reads live docs from input and chooses between sparse and dense
representation based on
+ * deletion rate.
+ */
+ private Bits readLiveDocs(IndexInput input, int maxDoc, double deletionRate,
int expectedDelCount)
+ throws IOException {
+ Bits liveDocs;
+ int actualDelCount;
+
+ if (deletionRate <= SPARSE_DENSE_THRESHOLD) {
+ SparseFixedBitSet sparse = readSparseFixedBitSet(input, maxDoc);
+ actualDelCount = sparse.cardinality();
+ liveDocs = SparseLiveDocs.builder(sparse,
maxDoc).withDeletedCount(actualDelCount).build();
+ } else {
+ FixedBitSet dense = readFixedBitSet(input, maxDoc);
+ actualDelCount = maxDoc - dense.cardinality();
+ liveDocs = DenseLiveDocs.builder(dense,
maxDoc).withDeletedCount(actualDelCount).build();
+ }
+
+ if (actualDelCount != expectedDelCount) {
+ throw new CorruptIndexException(
+ "bits.deleted=" + actualDelCount + " info.delcount=" +
expectedDelCount, input);
+ }
+
+ return liveDocs;
+ }
+
private FixedBitSet readFixedBitSet(IndexInput input, int length) throws
IOException {
long[] data = new long[FixedBitSet.bits2words(length)];
input.readLongs(data, 0, data.length);
return new FixedBitSet(data, length);
}
+ private SparseFixedBitSet readSparseFixedBitSet(IndexInput input, int
length) throws IOException {
+ long[] data = new long[FixedBitSet.bits2words(length)];
+ input.readLongs(data, 0, data.length);
+
+ SparseFixedBitSet sparse = new SparseFixedBitSet(length);
+ for (int wordIndex = 0; wordIndex < data.length; wordIndex++) {
+ long word = data[wordIndex];
+ // Semantic inversion: disk format stores LIVE docs (bit=1 means live,
bit=0 means deleted)
+ // but SparseLiveDocs stores DELETED docs (bit=1 means deleted).
+ // Skip words with all bits set (all docs live in disk format = no
deletions to convert)
+ if (word == -1L) {
+ continue;
+ }
+ int baseDocId = wordIndex << 6;
+ int maxDocInWord = Math.min(baseDocId + 64, length);
+ for (int docId = baseDocId; docId < maxDocInWord; docId++) {
+ int bitIndex = docId & 63;
+ // If bit is 0 in disk format (deleted doc), set it in sparse
representation (bit=1 means
+ // deleted)
+ if ((word & (1L << bitIndex)) == 0) {
+ sparse.set(docId);
+ }
+ }
Review Comment:
@jainankitk I see your point, and I think it is an interesting opportunity
at least to take into account.
Ideally `deletedDocs = ~liveDocs` (I understand this is your idea), anyway
the challenge is to maintain sparsity, avoiding unnecessary word allocations.
`SparseFixedBitSet`'s key advantage is that it doesn't allocate memory for
words with no bits set. The current bit-by-bit approach naturally preserves
this: if a word on disk is all 1s (all live docs), we never call `sparse.set()`
and no memory is allocated for that word/block.
If we try to optimize by directly inserting inverted word values, we'd need
to:
1. Check if ~word != 0 (to skip all-live docs in a word)
2. Navigate `SparseFixedBitSet`'s two-level sparse structure
3. Ensure blocks are allocated only when needed
This would require exposing or duplicating `SparseFixedBitSet`'s internal
allocation logic, possibly adding complexity. Also, I think exposing a method
like `setWord(...)` might be a bit error-prone, as it might result in
setting/clearing bits incorrectly. IMO (probably what you suggest too if I
understand correctly) the ideal would be to just do that as part of the
construction logic without exposing an API.
Moreover, the current approach skips words with no deletions (word == -1L
check), and for words with deletions, we iterate the 64 bit positions but only
allocate memory for the actual deleted docs. The bit manipulation overhead is
negligible compared to disk I/O.
I think the current implementation strikes the right balance between
simplicity, correctness, and performance. I might try to look at this anyway
just to see if it makes sense and maybe push a draft follow up PR including
benchmarks.
If this approach is faster, it makes sense at the end, considering that this
change is adding a bit of overhead when loading data from disk while converting
from dense to sparse.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]