[PR] Help collectors take advantage of bulk-retrieval of doc values. [lucene]

via GitHub Thu, 11 Sep 2025 23:52:15 -0700


jpountz opened a new pull request, #15173:
URL: https://github.com/apache/lucene/pull/15173


   We added `NumericDocValues#longValues` to retrieve multiple values in a 
single call and better amortize the cost of virtual function calls in #15149. 
But it's not easy today to take advantage of this API in collectors, the best 
option is to buffer doc IDs in `LeafCollector#collect`, but then it's not great 
since this `LeafCollector#collect` call itself may be virtual, so we may still 
be doing 1+ virtual call per collected doc.
   
   This PR adds `DocIdStream#intoArray` to help collectors collect batches of 
doc IDs at once and retrieve numeric doc values for them. This way, you can 
compute facets/aggregations with less than one virtual call per collected 
document.
   
   I benchmarked on the geonames dataset, computing the average of the 
elevation field with the following collector:
   
   ```java
   class AverageCollector implements Collector {
   
     long sum;
   
     @Override
     public LeafCollector getLeafCollector(LeafReaderContext context) throws 
IOException {
       SortedNumericDocValues sortedValues = 
context.reader().getSortedNumericDocValues("elevation");
       if (sortedValues == null) {
         throw new CollectionTerminatedException();
       }
       NumericDocValues values = DocValues.unwrapSingleton(sortedValues);
       if (values == null) { // the field is single-valued
         throw new Error();
       }
       return new LeafCollector() {
   
         int[] docBuffer = new int[64];
         long[] valueBuffer = new long[docBuffer.length];
   
         @Override
         public void setScorer(Scorable scorer) throws IOException {
   
         }
   
         @Override
         public void collect(DocIdStream stream) throws IOException {
           for (int count = stream.intoArray(docBuffer); count != 0; count = 
stream.intoArray(docBuffer)) {
             values.longValues(count, docBuffer, valueBuffer, Long.MIN_VALUE);
   
             for (int i = 0; i < count; ++i) {
               long v = valueBuffer[i];
               sum += v;
               if (v != Long.MIN_VALUE) {
                 AverageCollector.this.count++;
               }
             }
           }
         }
   
         @Override
         public void collect(int doc) throws IOException {
           if (values.advanceExact(doc)) {
             sum += values.longValue();
             count += 1;
           }
         }
       };
     }
   
     @Override
     public ScoreMode scoreMode() {
       return ScoreMode.COMPLETE_NO_SCORES;
     }
   
   }
   ```
   
   The benchmark runs a few queries with different collectors first, so that 
`LeafCollector#collect` is polymorphic. I get the following numbers:
   
   | Query |  | Latency without DocIdStream#intoArray (ms) | Latency with 
DocIdStream#intoArray (ms) |
   | - | - | - | - |
   | `MatchAllDocsQuery` | Uses `RangeDocIdStream` under the hood | 81 | 65 |
   | `featureClass:(S P)` | Matches spots or cities (5M docs), uses 
`BitSetDocIdStream` under the hood | 54 | 43 |
   
   Note that `NumericDocValues#advanceExact` and `NumericDocValues#longValue` 
are not polymorphic in this benchmark since all segments use the same impl on 
the `elevation` field. The difference would be bigger if they were.
   
   My profiler suggests that retrieving doc values is the bottleneck when 
computing the average on a `MatchAllDocsQuery`, while the bottleneck is a mix 
of retrieving doc values and extracting set bits from the bit set when 
computing the average on `featureClass:(S P)`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] Help collectors take advantage of bulk-retrieval of doc values. [lucene]

Reply via email to