Re: [PR] PostingsDecodingUtil: interchange loops to enable better memory access and SIMD vectorisation [lucene]

via GitHub Tue, 26 Aug 2025 05:32:19 -0700


RamakrishnaChilaka commented on PR #15110:
URL: https://github.com/apache/lucene/pull/15110#issuecomment-3223985494


   In this PR, we targeted optimising the scalar path by rearranging loops so 
that memory accesses are sequential rather than strided, which gave a 
measurable performance improvement (see 
[comment](https://github.com/apache/lucene/pull/15110#issue-3344186479), the 
JMH only tests for scalar path). 
   
   I applied the same optimization to the vectorized path, but it didn’t yield 
the expected improvement — decodeVector performance actually regressed compared 
to baseline (as also verified by Adrian). I didn't run microbenchmarks for it 
prior to this.
   
   I suspect this is because, in the mainline vectorized version, we are 
loading from the memory segment only for that lane. To achieve the same 
contiguous write pattern as the scalar version, we would need to load the 
entire vector from memory at once, and this may be causing the regression. I 
have reproduced this behavior consistently.
   
   Please check decodeVector and decodeAndPrefixSumVector below.
   
   ```
   contender -1
   
   Benchmark                                            (bpv)   Mode  Cnt   
Score   Error   Units
   PostingIndexInputBenchmark.decode                        7  thrpt   15  
46.397 ± 1.220  ops/us
   PostingIndexInputBenchmark.decodeAndPrefixSum            7  thrpt   15  
22.414 ± 0.049  ops/us
   PostingIndexInputBenchmark.decodeAndPrefixSumVector      7  thrpt   15  
21.056 ± 0.179  ops/us
   PostingIndexInputBenchmark.decodeVector                  7  thrpt   15  
37.523 ± 0.666  ops/us
   
   contender-2
   Benchmark                                            (bpv)   Mode  Cnt   
Score   Error   Units
   PostingIndexInputBenchmark.decode                        7  thrpt   15  
46.140 ± 0.370  ops/us
   PostingIndexInputBenchmark.decodeAndPrefixSum            7  thrpt   15  
22.415 ± 0.042  ops/us
   PostingIndexInputBenchmark.decodeAndPrefixSumVector      7  thrpt   15  
21.322 ± 0.048  ops/us
   PostingIndexInputBenchmark.decodeVector                  7  thrpt   15  
37.132 ± 1.084  ops/us
   
   
   baseline
   
   Benchmark                                            (bpv)   Mode  Cnt   
Score   Error   Units
   PostingIndexInputBenchmark.decode                        7  thrpt   15  
39.971 ± 0.040  ops/us
   PostingIndexInputBenchmark.decodeAndPrefixSum            7  thrpt   15  
20.836 ± 0.039  ops/us
   PostingIndexInputBenchmark.decodeAndPrefixSumVector      7  thrpt   15  
24.911 ± 3.529  ops/us
   PostingIndexInputBenchmark.decodeVector                  7  thrpt   15  
62.210 ± 1.306  ops/us
   
   contender-3
   
   Benchmark                                            (bpv)   Mode  Cnt   
Score   Error   Units
   PostingIndexInputBenchmark.decode                        7  thrpt   15  
46.313 ± 0.614  ops/us
   PostingIndexInputBenchmark.decodeAndPrefixSum            7  thrpt   15  
22.406 ± 0.037  ops/us
   PostingIndexInputBenchmark.decodeAndPrefixSumVector      7  thrpt   15  
17.476 ± 0.302  ops/us
   PostingIndexInputBenchmark.decodeVector                  7  thrpt   15  
31.087 ± 1.201  ops/us
   ```
   
   Summary of the perf changes:
   | Benchmark                    | Baseline (ops/us) | Contender-1 (ops/us, 
%Δ) | Contender-2 (ops/us, %Δ) | Contender-3 (ops/us, %Δ) |
   
|-------------------------------|-----------------:|-------------------------:|-------------------------:|-------------------------:|
   | decode                        | 39.971           | 46.397 (+16.1%)         
 | 46.140 (+15.4%)          | 46.313 (+15.9%)          |
   | decodeAndPrefixSum            | 20.836           | 22.414 (+7.6%)          
 | 22.415 (+7.6%)           | 22.406 (+7.5%)           |
   | decodeAndPrefixSumVector      | 24.911 ± 3.529   | 21.056 (−15.5%)         
 | 21.322 (−14.4%)          | 17.476 (−29.9%)          |
   | decodeVector                  | 62.210           | 37.523 (−39.7%)         
 | 37.132 (−40.3%)          | 31.087 (−50.0%)          |
   
   
   I also ran the benchmarks across all DPVs with this PR (only 
PostingDecodingUtil changes, MemorySegmentDecodingUtil reverted). As expected, 
the scalar version shows consistent improvements, while the vectorized version 
remains mostly unchanged.
   
   ```
   contender
   Benchmark                                            (bpv)   Mode  Cnt   
Score   Error   Units
   PostingIndexInputBenchmark.decode                        2  thrpt   15  
56.685 ± 0.596  ops/us
   PostingIndexInputBenchmark.decode                        3  thrpt   15  
50.723 ± 0.091  ops/us
   PostingIndexInputBenchmark.decode                        4  thrpt   15  
53.631 ± 2.356  ops/us
   PostingIndexInputBenchmark.decode                        5  thrpt   15  
45.242 ± 0.193  ops/us
   PostingIndexInputBenchmark.decode                        6  thrpt   15  
47.290 ± 0.061  ops/us
   PostingIndexInputBenchmark.decode                        7  thrpt   15  
46.388 ± 0.637  ops/us
   PostingIndexInputBenchmark.decode                        8  thrpt   15  
85.133 ± 0.560  ops/us
   PostingIndexInputBenchmark.decode                        9  thrpt   15  
32.661 ± 0.499  ops/us
   PostingIndexInputBenchmark.decode                       10  thrpt   15  
37.306 ± 0.350  ops/us
   PostingIndexInputBenchmark.decodeAndPrefixSum            2  thrpt   15  
32.170 ± 0.085  ops/us
   PostingIndexInputBenchmark.decodeAndPrefixSum            3  thrpt   15  
30.107 ± 0.039  ops/us
   PostingIndexInputBenchmark.decodeAndPrefixSum            4  thrpt   15  
25.955 ± 0.058  ops/us
   PostingIndexInputBenchmark.decodeAndPrefixSum            5  thrpt   15  
22.113 ± 0.014  ops/us
   PostingIndexInputBenchmark.decodeAndPrefixSum            6  thrpt   15  
21.684 ± 0.006  ops/us
   PostingIndexInputBenchmark.decodeAndPrefixSum            7  thrpt   15  
22.410 ± 0.074  ops/us
   PostingIndexInputBenchmark.decodeAndPrefixSum            8  thrpt   15  
29.921 ± 0.047  ops/us
   PostingIndexInputBenchmark.decodeAndPrefixSum            9  thrpt   15  
21.974 ± 0.034  ops/us
   PostingIndexInputBenchmark.decodeAndPrefixSum           10  thrpt   15  
22.544 ± 0.063  ops/us
   PostingIndexInputBenchmark.decodeAndPrefixSumVector      2  thrpt   15  
35.892 ± 0.182  ops/us
   PostingIndexInputBenchmark.decodeAndPrefixSumVector      3  thrpt   15  
34.986 ± 0.097  ops/us
   PostingIndexInputBenchmark.decodeAndPrefixSumVector      4  thrpt   15  
33.011 ± 0.312  ops/us
   PostingIndexInputBenchmark.decodeAndPrefixSumVector      5  thrpt   15  
32.070 ± 0.044  ops/us
   PostingIndexInputBenchmark.decodeAndPrefixSumVector      6  thrpt   15  
30.406 ± 0.118  ops/us
   PostingIndexInputBenchmark.decodeAndPrefixSumVector      7  thrpt   15  
29.563 ± 0.322  ops/us
   PostingIndexInputBenchmark.decodeAndPrefixSumVector      8  thrpt   15  
33.238 ± 0.159  ops/us
   PostingIndexInputBenchmark.decodeAndPrefixSumVector      9  thrpt   15  
22.095 ± 0.058  ops/us
   PostingIndexInputBenchmark.decodeAndPrefixSumVector     10  thrpt   15  
26.039 ± 0.085  ops/us
   PostingIndexInputBenchmark.decodeVector                  2  thrpt   15  
85.634 ± 5.048  ops/us
   PostingIndexInputBenchmark.decodeVector                  3  thrpt   15  
74.220 ± 2.586  ops/us
   PostingIndexInputBenchmark.decodeVector                  4  thrpt   15  
91.992 ± 0.821  ops/us
   PostingIndexInputBenchmark.decodeVector                  5  thrpt   15  
62.661 ± 0.321  ops/us
   PostingIndexInputBenchmark.decodeVector                  6  thrpt   15  
69.889 ± 1.028  ops/us
   PostingIndexInputBenchmark.decodeVector                  7  thrpt   15  
61.817 ± 0.806  ops/us
   PostingIndexInputBenchmark.decodeVector                  8  thrpt   15  
85.631 ± 0.214  ops/us
   PostingIndexInputBenchmark.decodeVector                  9  thrpt   15  
34.308 ± 0.251  ops/us
   PostingIndexInputBenchmark.decodeVector                 10  thrpt   15  
49.525 ± 0.180  ops/us
   
   baseline
   Benchmark                                            (bpv)   Mode  Cnt   
Score   Error   Units
   PostingIndexInputBenchmark.decode                        2  thrpt   15  
56.139 ± 0.686  ops/us
   PostingIndexInputBenchmark.decode                        3  thrpt   15  
47.809 ± 0.121  ops/us
   PostingIndexInputBenchmark.decode                        4  thrpt   15  
56.922 ± 0.150  ops/us
   PostingIndexInputBenchmark.decode                        5  thrpt   15  
44.117 ± 0.125  ops/us
   PostingIndexInputBenchmark.decode                        6  thrpt   15  
41.455 ± 0.139  ops/us
   PostingIndexInputBenchmark.decode                        7  thrpt   15  
39.891 ± 0.182  ops/us
   PostingIndexInputBenchmark.decode                        8  thrpt   15  
85.428 ± 0.268  ops/us
   PostingIndexInputBenchmark.decode                        9  thrpt   15  
29.763 ± 0.259  ops/us
   PostingIndexInputBenchmark.decode                       10  thrpt   15  
30.598 ± 0.064  ops/us
   PostingIndexInputBenchmark.decodeAndPrefixSum            2  thrpt   15  
31.135 ± 0.313  ops/us
   PostingIndexInputBenchmark.decodeAndPrefixSum            3  thrpt   15  
27.499 ± 0.056  ops/us
   PostingIndexInputBenchmark.decodeAndPrefixSum            4  thrpt   15  
25.458 ± 0.090  ops/us
   PostingIndexInputBenchmark.decodeAndPrefixSum            5  thrpt   15  
23.068 ± 0.054  ops/us
   PostingIndexInputBenchmark.decodeAndPrefixSum            6  thrpt   15  
22.404 ± 0.023  ops/us
   PostingIndexInputBenchmark.decodeAndPrefixSum            7  thrpt   15  
20.858 ± 0.008  ops/us
   PostingIndexInputBenchmark.decodeAndPrefixSum            8  thrpt   15  
24.736 ± 0.091  ops/us
   PostingIndexInputBenchmark.decodeAndPrefixSum            9  thrpt   15  
19.946 ± 0.073  ops/us
   PostingIndexInputBenchmark.decodeAndPrefixSum           10  thrpt   15  
20.841 ± 0.011  ops/us
   PostingIndexInputBenchmark.decodeAndPrefixSumVector      2  thrpt   15  
33.515 ± 3.208  ops/us
   PostingIndexInputBenchmark.decodeAndPrefixSumVector      3  thrpt   15  
33.492 ± 2.343  ops/us
   PostingIndexInputBenchmark.decodeAndPrefixSumVector      4  thrpt   15  
33.055 ± 0.338  ops/us
   PostingIndexInputBenchmark.decodeAndPrefixSumVector      5  thrpt   15  
31.688 ± 0.247  ops/us
   PostingIndexInputBenchmark.decodeAndPrefixSumVector      6  thrpt   15  
28.542 ± 3.090  ops/us
   PostingIndexInputBenchmark.decodeAndPrefixSumVector      7  thrpt   15  
29.425 ± 0.039  ops/us
   PostingIndexInputBenchmark.decodeAndPrefixSumVector      8  thrpt   15  
33.136 ± 0.037  ops/us
   PostingIndexInputBenchmark.decodeAndPrefixSumVector      9  thrpt   15  
22.099 ± 0.026  ops/us
   PostingIndexInputBenchmark.decodeAndPrefixSumVector     10  thrpt   15  
25.956 ± 0.064  ops/us
   PostingIndexInputBenchmark.decodeVector                  2  thrpt   15  
85.568 ± 4.920  ops/us
   PostingIndexInputBenchmark.decodeVector                  3  thrpt   15  
74.447 ± 2.404  ops/us
   PostingIndexInputBenchmark.decodeVector                  4  thrpt   15  
87.720 ± 5.573  ops/us
   PostingIndexInputBenchmark.decodeVector                  5  thrpt   15  
62.400 ± 0.683  ops/us
   PostingIndexInputBenchmark.decodeVector                  6  thrpt   15  
67.609 ± 0.525  ops/us
   PostingIndexInputBenchmark.decodeVector                  7  thrpt   15  
62.465 ± 0.409  ops/us
   PostingIndexInputBenchmark.decodeVector                  8  thrpt   15  
85.587 ± 0.141  ops/us
   PostingIndexInputBenchmark.decodeVector                  9  thrpt   15  
33.910 ± 0.332  ops/us
   PostingIndexInputBenchmark.decodeVector                 10  thrpt   15  
49.282 ± 0.400  ops/us
   ```
   Summarising the perf results:
   
   | bpv | decode (Baseline → Contender, %Δ) | decodeAndPrefixSum (Baseline → 
Contender, %Δ) | decodeAndPrefixSumVector (Baseline → Contender, %Δ) | 
decodeVector (Baseline → Contender, %Δ) |
   
|-----|----------------------------------|-----------------------------------------------|----------------------------------------------------|----------------------------------------|
   | 2   | 56.139 → 56.685 (+1.0%)         | 31.135 → 32.170 (+3.3%)            
           | 33.515 → 35.892 (+7.1%)                           | 85.568 → 
85.634 (+0.08%)               |
   | 3   | 47.809 → 50.723 (+6.1%)         | 27.499 → 30.107 (+9.5%)            
           | 33.492 → 34.986 (+4.4%)                           | 74.447 → 
74.220 (−0.3%)                |
   | 4   | 56.922 → 53.631 (−5.8%)         | 25.458 → 25.955 (+2.0%)            
           | 33.055 → 33.011 (−0.1%)                           | 87.720 → 
91.992 (+4.9%)                |
   | 5   | 44.117 → 45.242 (+2.6%)         | 23.068 → 22.113 (−4.1%)            
           | 31.688 → 32.070 (+1.2%)                           | 62.400 → 
62.661 (+0.4%)                |
   | 6   | 41.455 → 47.290 (+14.1%)        | 22.404 → 21.684 (−3.2%)            
           | 28.542 → 30.406 (+6.5%)                            | 67.609 → 
69.889 (+3.4%)                |
   | 7   | 39.891 → 46.388 (+16.3%)        | 20.858 → 22.410 (+7.4%)            
            | 29.425 → 29.563 (+0.5%)                           | 62.465 → 
61.817 (−1.0%)                |
   | 8   | 85.428 → 85.133 (−0.3%)         | 24.736 → 29.921 (+21.0%)           
            | 33.136 → 33.238 (+0.3%)                           | 85.587 → 
85.631 (+0.05%)               |
   | 9   | 29.763 → 32.661 (+9.7%)         | 19.946 → 21.974 (+10.2%)           
            | 22.099 → 22.095 (−0.02%)                          | 33.910 → 
34.308 (+1.2%)                |
   | 10  | 30.598 → 37.306 (+21.9%)        | 20.841 → 22.544 (+8.2%)            
            | 25.956 → 26.039 (+0.3%)                           | 49.282 → 
49.525 (+0.5%)                |
   
   
   #### TLDR;
   
   Key takeaway:
   
   * Scalar path optimizations → measurable speedups
   * Vectorized path --> no change, because memory access pattern hasn’t changed
   * Regression in decodeVector & decodeAndPrefixSumVector is likely due to 
lane-wise memory loads vs contiguous vector loads
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] PostingsDecodingUtil: interchange loops to enable better memory access and SIMD vectorisation [lucene]

Reply via email to