RamakrishnaChilaka commented on PR #15110: URL: https://github.com/apache/lucene/pull/15110#issuecomment-3223985494
In this PR, we targeted optimising the scalar path by rearranging loops so that memory accesses are sequential rather than strided, which gave a measurable performance improvement (see [comment](https://github.com/apache/lucene/pull/15110#issue-3344186479), the JMH only tests for scalar path). I applied the same optimization to the vectorized path, but it didn’t yield the expected improvement — decodeVector performance actually regressed compared to baseline (as also verified by Adrian). I didn't run microbenchmarks for it prior to this. I suspect this is because, in the mainline vectorized version, we are loading from the memory segment only for that lane. To achieve the same contiguous write pattern as the scalar version, we would need to load the entire vector from memory at once, and this may be causing the regression. I have reproduced this behavior consistently. Please check decodeVector and decodeAndPrefixSumVector below. ``` contender -1 Benchmark (bpv) Mode Cnt Score Error Units PostingIndexInputBenchmark.decode 7 thrpt 15 46.397 ± 1.220 ops/us PostingIndexInputBenchmark.decodeAndPrefixSum 7 thrpt 15 22.414 ± 0.049 ops/us PostingIndexInputBenchmark.decodeAndPrefixSumVector 7 thrpt 15 21.056 ± 0.179 ops/us PostingIndexInputBenchmark.decodeVector 7 thrpt 15 37.523 ± 0.666 ops/us contender-2 Benchmark (bpv) Mode Cnt Score Error Units PostingIndexInputBenchmark.decode 7 thrpt 15 46.140 ± 0.370 ops/us PostingIndexInputBenchmark.decodeAndPrefixSum 7 thrpt 15 22.415 ± 0.042 ops/us PostingIndexInputBenchmark.decodeAndPrefixSumVector 7 thrpt 15 21.322 ± 0.048 ops/us PostingIndexInputBenchmark.decodeVector 7 thrpt 15 37.132 ± 1.084 ops/us baseline Benchmark (bpv) Mode Cnt Score Error Units PostingIndexInputBenchmark.decode 7 thrpt 15 39.971 ± 0.040 ops/us PostingIndexInputBenchmark.decodeAndPrefixSum 7 thrpt 15 20.836 ± 0.039 ops/us PostingIndexInputBenchmark.decodeAndPrefixSumVector 7 thrpt 15 24.911 ± 3.529 ops/us PostingIndexInputBenchmark.decodeVector 7 thrpt 15 62.210 ± 1.306 ops/us contender-3 Benchmark (bpv) Mode Cnt Score Error Units PostingIndexInputBenchmark.decode 7 thrpt 15 46.313 ± 0.614 ops/us PostingIndexInputBenchmark.decodeAndPrefixSum 7 thrpt 15 22.406 ± 0.037 ops/us PostingIndexInputBenchmark.decodeAndPrefixSumVector 7 thrpt 15 17.476 ± 0.302 ops/us PostingIndexInputBenchmark.decodeVector 7 thrpt 15 31.087 ± 1.201 ops/us ``` Summary of the perf changes: | Benchmark | Baseline (ops/us) | Contender-1 (ops/us, %Δ) | Contender-2 (ops/us, %Δ) | Contender-3 (ops/us, %Δ) | |-------------------------------|-----------------:|-------------------------:|-------------------------:|-------------------------:| | decode | 39.971 | 46.397 (+16.1%) | 46.140 (+15.4%) | 46.313 (+15.9%) | | decodeAndPrefixSum | 20.836 | 22.414 (+7.6%) | 22.415 (+7.6%) | 22.406 (+7.5%) | | decodeAndPrefixSumVector | 24.911 ± 3.529 | 21.056 (−15.5%) | 21.322 (−14.4%) | 17.476 (−29.9%) | | decodeVector | 62.210 | 37.523 (−39.7%) | 37.132 (−40.3%) | 31.087 (−50.0%) | I also ran the benchmarks across all DPVs with this PR (only PostingDecodingUtil changes, MemorySegmentDecodingUtil reverted). As expected, the scalar version shows consistent improvements, while the vectorized version remains mostly unchanged. ``` contender Benchmark (bpv) Mode Cnt Score Error Units PostingIndexInputBenchmark.decode 2 thrpt 15 56.685 ± 0.596 ops/us PostingIndexInputBenchmark.decode 3 thrpt 15 50.723 ± 0.091 ops/us PostingIndexInputBenchmark.decode 4 thrpt 15 53.631 ± 2.356 ops/us PostingIndexInputBenchmark.decode 5 thrpt 15 45.242 ± 0.193 ops/us PostingIndexInputBenchmark.decode 6 thrpt 15 47.290 ± 0.061 ops/us PostingIndexInputBenchmark.decode 7 thrpt 15 46.388 ± 0.637 ops/us PostingIndexInputBenchmark.decode 8 thrpt 15 85.133 ± 0.560 ops/us PostingIndexInputBenchmark.decode 9 thrpt 15 32.661 ± 0.499 ops/us PostingIndexInputBenchmark.decode 10 thrpt 15 37.306 ± 0.350 ops/us PostingIndexInputBenchmark.decodeAndPrefixSum 2 thrpt 15 32.170 ± 0.085 ops/us PostingIndexInputBenchmark.decodeAndPrefixSum 3 thrpt 15 30.107 ± 0.039 ops/us PostingIndexInputBenchmark.decodeAndPrefixSum 4 thrpt 15 25.955 ± 0.058 ops/us PostingIndexInputBenchmark.decodeAndPrefixSum 5 thrpt 15 22.113 ± 0.014 ops/us PostingIndexInputBenchmark.decodeAndPrefixSum 6 thrpt 15 21.684 ± 0.006 ops/us PostingIndexInputBenchmark.decodeAndPrefixSum 7 thrpt 15 22.410 ± 0.074 ops/us PostingIndexInputBenchmark.decodeAndPrefixSum 8 thrpt 15 29.921 ± 0.047 ops/us PostingIndexInputBenchmark.decodeAndPrefixSum 9 thrpt 15 21.974 ± 0.034 ops/us PostingIndexInputBenchmark.decodeAndPrefixSum 10 thrpt 15 22.544 ± 0.063 ops/us PostingIndexInputBenchmark.decodeAndPrefixSumVector 2 thrpt 15 35.892 ± 0.182 ops/us PostingIndexInputBenchmark.decodeAndPrefixSumVector 3 thrpt 15 34.986 ± 0.097 ops/us PostingIndexInputBenchmark.decodeAndPrefixSumVector 4 thrpt 15 33.011 ± 0.312 ops/us PostingIndexInputBenchmark.decodeAndPrefixSumVector 5 thrpt 15 32.070 ± 0.044 ops/us PostingIndexInputBenchmark.decodeAndPrefixSumVector 6 thrpt 15 30.406 ± 0.118 ops/us PostingIndexInputBenchmark.decodeAndPrefixSumVector 7 thrpt 15 29.563 ± 0.322 ops/us PostingIndexInputBenchmark.decodeAndPrefixSumVector 8 thrpt 15 33.238 ± 0.159 ops/us PostingIndexInputBenchmark.decodeAndPrefixSumVector 9 thrpt 15 22.095 ± 0.058 ops/us PostingIndexInputBenchmark.decodeAndPrefixSumVector 10 thrpt 15 26.039 ± 0.085 ops/us PostingIndexInputBenchmark.decodeVector 2 thrpt 15 85.634 ± 5.048 ops/us PostingIndexInputBenchmark.decodeVector 3 thrpt 15 74.220 ± 2.586 ops/us PostingIndexInputBenchmark.decodeVector 4 thrpt 15 91.992 ± 0.821 ops/us PostingIndexInputBenchmark.decodeVector 5 thrpt 15 62.661 ± 0.321 ops/us PostingIndexInputBenchmark.decodeVector 6 thrpt 15 69.889 ± 1.028 ops/us PostingIndexInputBenchmark.decodeVector 7 thrpt 15 61.817 ± 0.806 ops/us PostingIndexInputBenchmark.decodeVector 8 thrpt 15 85.631 ± 0.214 ops/us PostingIndexInputBenchmark.decodeVector 9 thrpt 15 34.308 ± 0.251 ops/us PostingIndexInputBenchmark.decodeVector 10 thrpt 15 49.525 ± 0.180 ops/us baseline Benchmark (bpv) Mode Cnt Score Error Units PostingIndexInputBenchmark.decode 2 thrpt 15 56.139 ± 0.686 ops/us PostingIndexInputBenchmark.decode 3 thrpt 15 47.809 ± 0.121 ops/us PostingIndexInputBenchmark.decode 4 thrpt 15 56.922 ± 0.150 ops/us PostingIndexInputBenchmark.decode 5 thrpt 15 44.117 ± 0.125 ops/us PostingIndexInputBenchmark.decode 6 thrpt 15 41.455 ± 0.139 ops/us PostingIndexInputBenchmark.decode 7 thrpt 15 39.891 ± 0.182 ops/us PostingIndexInputBenchmark.decode 8 thrpt 15 85.428 ± 0.268 ops/us PostingIndexInputBenchmark.decode 9 thrpt 15 29.763 ± 0.259 ops/us PostingIndexInputBenchmark.decode 10 thrpt 15 30.598 ± 0.064 ops/us PostingIndexInputBenchmark.decodeAndPrefixSum 2 thrpt 15 31.135 ± 0.313 ops/us PostingIndexInputBenchmark.decodeAndPrefixSum 3 thrpt 15 27.499 ± 0.056 ops/us PostingIndexInputBenchmark.decodeAndPrefixSum 4 thrpt 15 25.458 ± 0.090 ops/us PostingIndexInputBenchmark.decodeAndPrefixSum 5 thrpt 15 23.068 ± 0.054 ops/us PostingIndexInputBenchmark.decodeAndPrefixSum 6 thrpt 15 22.404 ± 0.023 ops/us PostingIndexInputBenchmark.decodeAndPrefixSum 7 thrpt 15 20.858 ± 0.008 ops/us PostingIndexInputBenchmark.decodeAndPrefixSum 8 thrpt 15 24.736 ± 0.091 ops/us PostingIndexInputBenchmark.decodeAndPrefixSum 9 thrpt 15 19.946 ± 0.073 ops/us PostingIndexInputBenchmark.decodeAndPrefixSum 10 thrpt 15 20.841 ± 0.011 ops/us PostingIndexInputBenchmark.decodeAndPrefixSumVector 2 thrpt 15 33.515 ± 3.208 ops/us PostingIndexInputBenchmark.decodeAndPrefixSumVector 3 thrpt 15 33.492 ± 2.343 ops/us PostingIndexInputBenchmark.decodeAndPrefixSumVector 4 thrpt 15 33.055 ± 0.338 ops/us PostingIndexInputBenchmark.decodeAndPrefixSumVector 5 thrpt 15 31.688 ± 0.247 ops/us PostingIndexInputBenchmark.decodeAndPrefixSumVector 6 thrpt 15 28.542 ± 3.090 ops/us PostingIndexInputBenchmark.decodeAndPrefixSumVector 7 thrpt 15 29.425 ± 0.039 ops/us PostingIndexInputBenchmark.decodeAndPrefixSumVector 8 thrpt 15 33.136 ± 0.037 ops/us PostingIndexInputBenchmark.decodeAndPrefixSumVector 9 thrpt 15 22.099 ± 0.026 ops/us PostingIndexInputBenchmark.decodeAndPrefixSumVector 10 thrpt 15 25.956 ± 0.064 ops/us PostingIndexInputBenchmark.decodeVector 2 thrpt 15 85.568 ± 4.920 ops/us PostingIndexInputBenchmark.decodeVector 3 thrpt 15 74.447 ± 2.404 ops/us PostingIndexInputBenchmark.decodeVector 4 thrpt 15 87.720 ± 5.573 ops/us PostingIndexInputBenchmark.decodeVector 5 thrpt 15 62.400 ± 0.683 ops/us PostingIndexInputBenchmark.decodeVector 6 thrpt 15 67.609 ± 0.525 ops/us PostingIndexInputBenchmark.decodeVector 7 thrpt 15 62.465 ± 0.409 ops/us PostingIndexInputBenchmark.decodeVector 8 thrpt 15 85.587 ± 0.141 ops/us PostingIndexInputBenchmark.decodeVector 9 thrpt 15 33.910 ± 0.332 ops/us PostingIndexInputBenchmark.decodeVector 10 thrpt 15 49.282 ± 0.400 ops/us ``` Summarising the perf results: | bpv | decode (Baseline → Contender, %Δ) | decodeAndPrefixSum (Baseline → Contender, %Δ) | decodeAndPrefixSumVector (Baseline → Contender, %Δ) | decodeVector (Baseline → Contender, %Δ) | |-----|----------------------------------|-----------------------------------------------|----------------------------------------------------|----------------------------------------| | 2 | 56.139 → 56.685 (+1.0%) | 31.135 → 32.170 (+3.3%) | 33.515 → 35.892 (+7.1%) | 85.568 → 85.634 (+0.08%) | | 3 | 47.809 → 50.723 (+6.1%) | 27.499 → 30.107 (+9.5%) | 33.492 → 34.986 (+4.4%) | 74.447 → 74.220 (−0.3%) | | 4 | 56.922 → 53.631 (−5.8%) | 25.458 → 25.955 (+2.0%) | 33.055 → 33.011 (−0.1%) | 87.720 → 91.992 (+4.9%) | | 5 | 44.117 → 45.242 (+2.6%) | 23.068 → 22.113 (−4.1%) | 31.688 → 32.070 (+1.2%) | 62.400 → 62.661 (+0.4%) | | 6 | 41.455 → 47.290 (+14.1%) | 22.404 → 21.684 (−3.2%) | 28.542 → 30.406 (+6.5%) | 67.609 → 69.889 (+3.4%) | | 7 | 39.891 → 46.388 (+16.3%) | 20.858 → 22.410 (+7.4%) | 29.425 → 29.563 (+0.5%) | 62.465 → 61.817 (−1.0%) | | 8 | 85.428 → 85.133 (−0.3%) | 24.736 → 29.921 (+21.0%) | 33.136 → 33.238 (+0.3%) | 85.587 → 85.631 (+0.05%) | | 9 | 29.763 → 32.661 (+9.7%) | 19.946 → 21.974 (+10.2%) | 22.099 → 22.095 (−0.02%) | 33.910 → 34.308 (+1.2%) | | 10 | 30.598 → 37.306 (+21.9%) | 20.841 → 22.544 (+8.2%) | 25.956 → 26.039 (+0.3%) | 49.282 → 49.525 (+0.5%) | #### TLDR; Key takeaway: * Scalar path optimizations → measurable speedups * Vectorized path --> no change, because memory access pattern hasn’t changed * Regression in decodeVector & decodeAndPrefixSumVector is likely due to lane-wise memory loads vs contiguous vector loads -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
