jpountz opened a new pull request, #15173:
URL: https://github.com/apache/lucene/pull/15173
We added `NumericDocValues#longValues` to retrieve multiple values in a
single call and better amortize the cost of virtual function calls in #15149.
But it's not easy today to take advantage of this API in collectors, the best
option is to buffer doc IDs in `LeafCollector#collect`, but then it's not great
since this `LeafCollector#collect` call itself may be virtual, so we may still
be doing 1+ virtual call per collected doc.
This PR adds `DocIdStream#intoArray` to help collectors collect batches of
doc IDs at once and retrieve numeric doc values for them. This way, you can
compute facets/aggregations with less than one virtual call per collected
document.
I benchmarked on the geonames dataset, computing the average of the
elevation field with the following collector:
```java
class AverageCollector implements Collector {
long sum;
@Override
public LeafCollector getLeafCollector(LeafReaderContext context) throws
IOException {
SortedNumericDocValues sortedValues =
context.reader().getSortedNumericDocValues("elevation");
if (sortedValues == null) {
throw new CollectionTerminatedException();
}
NumericDocValues values = DocValues.unwrapSingleton(sortedValues);
if (values == null) { // the field is single-valued
throw new Error();
}
return new LeafCollector() {
int[] docBuffer = new int[64];
long[] valueBuffer = new long[docBuffer.length];
@Override
public void setScorer(Scorable scorer) throws IOException {
}
@Override
public void collect(DocIdStream stream) throws IOException {
for (int count = stream.intoArray(docBuffer); count != 0; count =
stream.intoArray(docBuffer)) {
values.longValues(count, docBuffer, valueBuffer, Long.MIN_VALUE);
for (int i = 0; i < count; ++i) {
long v = valueBuffer[i];
sum += v;
if (v != Long.MIN_VALUE) {
AverageCollector.this.count++;
}
}
}
}
@Override
public void collect(int doc) throws IOException {
if (values.advanceExact(doc)) {
sum += values.longValue();
count += 1;
}
}
};
}
@Override
public ScoreMode scoreMode() {
return ScoreMode.COMPLETE_NO_SCORES;
}
}
```
The benchmark runs a few queries with different collectors first, so that
`LeafCollector#collect` is polymorphic. I get the following numbers:
| Query | | Latency without DocIdStream#intoArray (ms) | Latency with
DocIdStream#intoArray (ms) |
| - | - | - | - |
| `MatchAllDocsQuery` | Uses `RangeDocIdStream` under the hood | 81 | 65 |
| `featureClass:(S P)` | Matches spots or cities (5M docs), uses
`BitSetDocIdStream` under the hood | 54 | 43 |
Note that `NumericDocValues#advanceExact` and `NumericDocValues#longValue`
are not polymorphic in this benchmark since all segments use the same impl on
the `elevation` field. The difference would be bigger if they were.
My profiler suggests that retrieving doc values is the bottleneck when
computing the average on a `MatchAllDocsQuery`, while the bottleneck is a mix
of retrieving doc values and extracting set bits from the bit set when
computing the average on `featureClass:(S P)`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]