nodece opened a new pull request, #26010: URL: https://github.com/apache/pulsar/pull/26010
### Motivation `TripleLongPriorityQueue` is the core data structure for Pulsar's delayed message delivery, storing `(deliverAt, ledgerId, entryId)` tuples in a min-heap backed by `SegmentedLongArray` (direct memory, 16MB segments). CPU profiling shows `siftUp`/`siftDown` consuming ~37% of hot thread time during delayed message processing. The original implementation has two performance issues: 1. **Redundant readLong calls**: `compare(tupleIdx1, tupleIdx2)` reads 6 longs per comparison, but the current node's values are already known from the prior iteration — they're re-read from `SegmentedLongArray` unnecessarily. 2. **Swap-based sift**: Each heap layer swap performs 6 readLong + 6 writeLong, but a hole-based approach only needs 3 writeLong per layer. Each `readLong` on `SegmentedLongArray` incurs 3 layers of indirection (division + `ArrayList.get()` + `ByteBuf.getLong()`), so reducing readLong calls directly translates to throughput gains. ### Modifications Rewrote `siftUp` and `siftDown` using the hole-based algorithm with register-cached values: - **`siftUp(tupleIdx, n1, n2, n3)`**: Passes the inserted element's values as parameters. Each layer only reads the parent's 3 longs (was: 6 longs for self + parent), then writes the parent down. The displaced value lives in local variables and is written once at the final position. - **`siftDown(tupleIdx, val0, val1, val2)`**: Pop reads the last tuple and passes it directly. Each layer reads left + right child's 3 longs each (was: 6 longs for self + left + right), promotes the smaller child, and writes the displaced value at the leaf. - **`add()`**: Writes tuple to array directly, passes cached values to `siftUp`. - **`pop()`**: Reads last tuple, passes to `siftDown(0, n1, n2, n3)`. - Removed `compare()`, `swap()`, `put()` methods. - `>>> 1` instead of `/ 2` for unsigned shift. Per-layer readLong reduction: | Operation | Before | After | |-----------|--------|-------| | siftUp per layer | 12 | 3 | | siftDown per layer | 24 | 6 | Benchmark results on `SegmentedLongArray`: | Dataset size | Before (AVG) | After (AVG) | Improvement | |---|---|---|---| | 50K | 38,690 us | 34,220 us | **11.5%** | | 500K | 525,517 us | 461,174 us | **12.2%** | | 2M | 2,136,791 us | 1,704,437 us | **20.2%** | ### Further optimization The remaining performance gap is dominated by `SegmentedLongArray`'s per-readLong indirection cost. Replacing `SegmentedLongArray` with a flat `long[]` eliminates the segment lookup entirely — each `readLong` becomes a single array load with hardware prefetcher support. Benchmark comparison shows `long[]`-backed implementation is **~3x faster** than `SegmentedLongArray` at the same heap operations. This could be considered as a follow-up if the direct memory trade-off (GC pressure vs. off-heap) is acceptable for the delayed delivery use case. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
