[PR] Add SIMD-accelerated bulk range evaluation for dense numeric doc values [lucene]

via GitHub Mon, 11 May 2026 23:30:08 -0700


sgup432 opened a new pull request, #16050:
URL: https://github.com/apache/lucene/pull/16050


   ### Description
   
   Numeric range queries on dense fields use DocValuesRangeIterator, which is a 
TwoPhaseIterator that uses SkipBlockRangeIterator as an approximation. This 
works well, but for MAYBE blocks (where values partially overlap the query 
range), it still falls back to per-doc evaluation: each doc is checked 
individually via values.advance(doc) + values.longValue() + range comparison.
   
   Since DocValuesRangeIterator is a TwoPhaseIterator, 
`DenseConjunctionBulkScorer` routes it through the leap-frog path(see 
[here](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/DenseConjunctionBulkScorer.java#L201C5-L208C6))
 and `intoBitSet()` is never called. This means SIMD is never used for MAYBE 
block evaluation, even though the underlying storage for dense fields is a 
packed long[] that's ideal for vectorized comparison.
   
   ### PR changes
   
   For dense singleton numeric fields with a skip index, replace 
DocValuesRangeIterator with a new `BatchDocValuesRangeIterator` which is a 
plain DocIdSetIterator (not TwoPhaseIterator). This was added so that we force 
DenseConjunctionBulkScorer to call intoBitSet() on it directly,  enabling the 
bitset intersection path. I am open to suggestion if this is a right approach
   
   This PR also adds support to do SIMD-accelerated bulk range evaluation for 
MAYBE (partial overlap) blocks, which seem to be the most expensive case when 
running range queries through doc values.
   
   For this we added below changes:
   
   - Add `NumericDocValues.rangeIntoBitSet(fromDoc, toDoc, minValue, maxValue, 
bitSet, offset)`:  a new bulk API with a per-doc fallback default. 
Lucene90DocValuesProducer overrides this for dense fields to dispatch to the 
vectorization layer.
   
   - Add a **DocValuesRangeSupport** interface with two implementations:
   
      - **PanamaDocValuesRangeSupport** — SIMD implementation using the Panama 
Vector API (LongVector.SPECIES_PREFERRED). Evaluates multiple values per CPU 
instruction using vectorized range comparisons.
      - **DefaultDocValuesRangeSupport** — scalar tight loop fallback.
      
   - `VectorizationProvider.getDocValuesRangeSupport()` returns the appropriate 
implementation at startup.
   
   
   
   ### Benchmarks 
   ```
   MultiFieldDocValuesRangeBenchmark (c5.2xlarge, AVX-512)
   Mode: Throughput (ops/s, higher is better)
   JVM args: --add-modules=jdk.incubator.vector
   Warmup: 3 x 3s, Measurement: 5 x 5s, Fork: 1
   ```
   
   
   Data Pattern | docCount |  Fields | Baseline (ops/s) | Optimized (ops/s) |   
Change
   
-------------|----------|---------|------------------|-------------------|----------
   random       |       1M |       1 |            59.99 |            208.27 |   
+247%
   random       |       1M |       3 |            34.83 |             69.30 |   
 +99%
   random       |       1M |       5 |            29.40 |             65.10 |   
+121%
   random       |      10M |       1 |             6.12 |             25.16 |   
+311%
   random       |      10M |       3 |             3.41 |              8.38 |   
+146%
   random       |      10M |       5 |             2.82 |              7.45 |   
+164%
   clustered    |       1M |       1 |          6231.86 |           8584.63 |   
 +38%
   clustered    |       1M |       3 |          9142.82 |          35488.66 |   
+288%
   clustered    |       1M |       5 |          7072.30 |          32583.89 |   
+361%
   clustered    |      10M |       1 |           685.27 |           1253.04 |   
 +83%
   clustered    |      10M |       3 |          8314.53 |          23913.65 |   
+188%
   clustered    |      10M |       5 |          8855.14 |          12703.13 |   
 +43%
   
   
   The numbers look great across the board!
   <!--
   If this is your first contribution to Lucene, please make sure you have 
reviewed the contribution guide.
   https://github.com/apache/lucene/blob/main/CONTRIBUTING.md
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] Add SIMD-accelerated bulk range evaluation for dense numeric doc values [lucene]

Reply via email to