julespi opened a new issue, #16250:
URL: https://github.com/apache/lucene/issues/16250
### Description
### Summary
`RoaringDocIdSet.Iterator#intoBitSet` throws a `NullPointerException` when
it is invoked on an already-exhausted iterator. When the iterator is exhausted,
`advance(...)` and `firstDocFromNextBlock()` set `sub = null` together with
`doc = NO_MORE_DOCS`, but `intoBitSet` does not guard for that state and
dereferences `sub`.
This is reached in production through `LRUQueryCache` while materializing
a filter into a `RoaringDocIdSet` (`cacheIntoRoaringDocIdSet` →
`DenseConjunctionBulkScorer#scoreWindowUsingBitSet`
→ `RoaringDocIdSet.Iterator#intoBitSet`), producing failed queries (HTTP
500s) on dense filter conjunctions.
### Stack trace (Lucene 10.4.0)
```
java.lang.NullPointerException: Cannot invoke
"org.apache.lucene.search.DocIdSetIterator.intoBitSet(int,
org.apache.lucene.util.FixedBitSet, int)"
because "this.sub" is null
at
org.apache.lucene.util.RoaringDocIdSet$Iterator.intoBitSet(RoaringDocIdSet.java:325)
at
org.apache.lucene.search.DenseConjunctionBulkScorer.scoreWindowUsingBitSet(DenseConjunctionBulkScorer.java:233)
at
org.apache.lucene.search.DenseConjunctionBulkScorer.scoreWindow(DenseConjunctionBulkScorer.java:202)
at
org.apache.lucene.search.DenseConjunctionBulkScorer.score(DenseConjunctionBulkScorer.java:133)
at
org.apache.lucene.search.LRUQueryCache.cacheIntoRoaringDocIdSet(LRUQueryCache.java:575)
at
org.apache.lucene.search.LRUQueryCache.cacheImpl(LRUQueryCache.java:532)
at
org.apache.lucene.search.LRUQueryCache$CachingWrapperWeight$1.iterator(LRUQueryCache.java:828)
at
org.apache.lucene.search.ConstantScoreScorerSupplier.bulkScorer(ConstantScoreScorerSupplier.java:80)
at
org.apache.lucene.search.IndexSearcher.searchLeaf(IndexSearcher.java:842)
at
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:809)
```
### Root cause
In `RoaringDocIdSet.Iterator#intoBitSet`, once the iterator has been
exhausted `sub` is `null` (set by `advance`/`firstDocFromNextBlock` alongside
`doc = NO_MORE_DOCS`). If `intoBitSet` is then called with an `upTo` that falls
in or beyond the exhausted block (`subUpto >= 0`), it reaches
`sub.intoBitSet(...)` and NPEs. Sibling methods in the same class already guard
for the exhausted state; `intoBitSet` does not.
### Minimal reproduction
```java
int maxDoc = 3 << 16; // 3 blocks; docs live only in block 0, later blocks
are empty
RoaringDocIdSet set = new
RoaringDocIdSet.Builder(maxDoc).add(5).add(10).add(20).build();
DocIdSetIterator it = set.iterator();
while (it.nextDoc() != DocIdSetIterator.NO_MORE_DOCS) {
// exhaust the iterator -> sub becomes null, doc becomes NO_MORE_DOCS
}
FixedBitSet bitSet = new FixedBitSet(maxDoc);
it.intoBitSet(maxDoc, bitSet, 0); // throws NPE: this.sub is null
```
### Suggested fix
A null-guard at the start of the loop is sufficient and changes no results
(when `sub == null` the iterator is already at `NO_MORE_DOCS`, so there is
nothing left to write):
```java
public void intoBitSet(int upTo, FixedBitSet bitSet, int offset) throws
IOException {
for (; ; ) {
if (sub == null) { // iterator exhausted (doc == NO_MORE_DOCS)
break;
}
int subUpto = upTo - (block << 16);
if (subUpto < 0) {
break;
}
...
}
}
```
### Note on `main`
This specific code path no longer exists on `main`: the LRU cache
materialization was reworked to use `LeafCollector#collectRange` /
`DocIdStream` instead of the `intoBitSet` / `DenseConjunctionBulkScorer` path
(PRs #16081 and #16178). However, those changes are not part of any released
version, and **10.4.0 (the latest release) is affected**. A targeted guard
would allow fixing it in a 10.4.x release without backporting the larger
refactor.
### Version and environment details
Lucene 10.4.0 (latest released version).
JVM: Amazon Corretto 25 (Java 25). OS: Linux (container).
Reached at search time via IndexSearcher + LRUQueryCache with an aggressive
caching policy that caches filters on first use, on a large index (~600M docs).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]