peteralfonsi opened a new issue, #15321:
URL: https://github.com/apache/lucene/issues/15321

   ### Description
   
   In [this blog post](https://blunders.io/posts/es-benchmark-4-inlining) and 
its related [PR](https://github.com/apache/lucene/pull/13149), we saw large 
latency improvements on BKD range queries by batching together ~512 per-doc 
virtual calls into 1 per-leaf virtual call. Today this is done in the various 
`readInts16/32/etc` methods in `DocIdsWriter`, and also in the 
`CELL_CROSSES_QUERY` case in `BKDReader.visitDocValues()` 
([link](https://github.com/apache/lucene/blob/fcbeefef9bb510f65fc1d320e3b03fdff492bcee/lucene/core/src/java/org/apache/lucene/util/bkd/BKDReader.java#L860)).
 
   
   For the `CELL_CROSSES_QUERY` case we currently use `visit(int docID, byte[] 
packedValue)` in a loop, on each doc in the intersecting leaf. I saw in 
flamegraphs that resolving this has a virtual call stub (this example is from 
an opensearch date_histo agg, which internally becomes basically a bunch of 
range queries): 
   
   <img width="2492" height="1177" alt="Image" 
src="https://github.com/user-attachments/assets/a5686053-688f-4214-a402-a357bed921e4";
 />
   
   To check if this had any actual impact, I did what the blog did, and ran my 
query of interest in isolation, so that it could be inlined, and then at the 
end of a longer benchmark, so it'd be a virtual call. I picked a narrow 2D 
range query as I figured this would maximize the amount of relevant docs which 
live in an intersecting cell vs a totally-inside cell. 
   
   I also wrote a proof-of-concept change for this, where we would add a new 
default method to `PointValues.IntersectVisitor` similar to the one taking in 
an `IntsRef`: 
   
   ```
   /**
        * Similar to {@link IntersectVisitor#visit(DocIdSetIterator, byte[])} 
but the second argument provides
        * a supplier for a packedValue byte[] for each point. This method may 
offer a speedup over repeated calls to
        * {@link IntersectVisitor#visit(int, byte[])} because virtual calls are 
reduced.
        */
       default void visit(IntsRef ref, LeafPackedValuesSupplier 
packedValuesSupplier) throws IOException {
         for (int i = ref.offset; i < ref.length + ref.offset; i++) {
           // TODO: for now the caller is responsible for ensuring the provider 
can provide enough byte[] for each value in the intsRef.
           assert packedValuesSupplier.next();
           visit(ref.ints[i], packedValuesSupplier.packedValue());
         }
       }
   
   
       /**
        * WIP interface for providing packedValues in bulk
        */
       interface LeafPackedValuesSupplier {
         /** Move to the next packed value; return false if exhausted. */
         boolean next() throws IOException;
   
         /** Return the current packed value (scratch buffer, reused). */
         byte[] packedValue();
       }
   ```
   
   and then individual call sites such as 
`BKDReader.visitCompressedDocValues()` could define their own final 
`LeafPackedValuesSupplier` implementation based on whatever logic they used to 
populate `scratchPackedValue` before. They'd only call `visit()` with arguments 
of that final class, so that each call to `next()` would be inlined. 
(Implementations would reuse the same byte[] each time, as before). 
   
   I did see moderate speedups for the inlined case, and also my PoC 
implementation, as compared to running the range query at the end of a 
benchmark. 
   
   | percentile | End-of-benchmark latency (ms) | Beginning-of-benchmark 
latency (ms) | End-of-benchmark latency with contender (ms) | 
   |--|--|--|--|
   | p50 | 207.8 | 176.8 | 183.5 | 
   | p90 | 210.9 | 181.1 | 186.1 |
   | p99 | 283.9 | 206.6 | 246.4 | 
   
   (this is all on nyc_taxis for OpenSearchBenchmark, my workload branch is 
[here](https://github.com/peteralfonsi/opensearch-benchmark-workloads/tree/megamorphism-testing))
   
   Note that the speedup only showed up on a very small 2-core `c5.large` 
instance type - it didn't show up at all on a 8-core `c5.2xl` instance for some 
reason. I guess the overhead matters more on smaller instances but it's still 
strange. 
   
   Curious what others think! I'm not attached to the specific implementation, 
it is quite clunky to have to define a new iterator for each call site, so if 
anyone has any better ideas I would appreciate them!
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to