gf2121 commented on PR #14203:
URL: https://github.com/apache/lucene/pull/14203#issuecomment-2651208564

   Thanks for feedback! I implement the fixed-size inner loop and print out 
assembly for all. 
[perf_asm.log](https://github.com/user-attachments/files/18752147/perf_asm.log)
   
   * When profiling enabled, `#current countVariable=true` and `#current 
countVariable=false` has same (slow) speed. It seems like profiling prevented 
some optimization.
   
   * According to the assembly, `#current bpv=16` does not get auto-vectorized. 
`#current bpv=24` gets vectorized on the shift loop, but not for the remainder 
loop.
   
   * According to the assembly, the innerloop get auto-vectorized, but slower 
than vector API.
   
   **MAC M2**
   ```
   Benchmark                        (bpv)  (countVariable)   Mode  Cnt    Score 
   Error   Units
   BKDCodecBenchmark.current           16             true  thrpt    5  103.490 
±  6.785  ops/ms
   BKDCodecBenchmark.current           16            false  thrpt    5  212.488 
±  5.383  ops/ms
   BKDCodecBenchmark.current           24             true  thrpt    5   91.203 
±  1.023  ops/ms
   BKDCodecBenchmark.current           24            false  thrpt    5  149.742 
±  1.953  ops/ms
   BKDCodecBenchmark.currentVector     16             true  thrpt    5  213.162 
±  1.598  ops/ms
   BKDCodecBenchmark.currentVector     16            false  thrpt    5  216.529 
±  2.518  ops/ms
   BKDCodecBenchmark.currentVector     24             true  thrpt    5  153.970 
±  1.101  ops/ms
   BKDCodecBenchmark.currentVector     24            false  thrpt    5  140.103 
±  3.001  ops/ms
   BKDCodecBenchmark.innerLoop         16             true  thrpt    5  129.281 
±  0.471  ops/ms
   BKDCodecBenchmark.innerLoop         16            false  thrpt    5  131.083 
±  8.775  ops/ms
   BKDCodecBenchmark.innerLoop         24             true  thrpt    5   99.597 
±  2.850  ops/ms
   BKDCodecBenchmark.innerLoop         24            false  thrpt    5   96.235 
± 14.875  ops/ms
   BKDCodecBenchmark.legacy            16             true  thrpt    5  104.314 
±  0.557  ops/ms
   BKDCodecBenchmark.legacy            16            false  thrpt    5  202.175 
± 10.863  ops/ms
   BKDCodecBenchmark.legacy            24             true  thrpt    5   86.016 
±  1.315  ops/ms
   BKDCodecBenchmark.legacy            24            false  thrpt    5   85.609 
±  5.733  ops/ms
   ```
   
   **Linux X86 AVX512 profiling disabled**
   ```
   Benchmark                        (bpv)  (countVariable)   Mode  Cnt    Score 
   Error   Units
   BKDCodecBenchmark.current           16             true  thrpt    5   41.138 
±  1.770  ops/ms
   BKDCodecBenchmark.current           16            false  thrpt    5  142.277 
±  0.943  ops/ms
   BKDCodecBenchmark.current           24             true  thrpt    5   43.104 
±  0.066  ops/ms
   BKDCodecBenchmark.current           24            false  thrpt    5   42.760 
±  0.496  ops/ms
   BKDCodecBenchmark.currentVector     16             true  thrpt    5   86.565 
±  0.904  ops/ms
   BKDCodecBenchmark.currentVector     16            false  thrpt    5   86.624 
±  0.395  ops/ms
   BKDCodecBenchmark.currentVector     24             true  thrpt    5   80.064 
±  2.604  ops/ms
   BKDCodecBenchmark.currentVector     24            false  thrpt    5   76.638 
± 18.692  ops/ms
   BKDCodecBenchmark.innerLoop         16             true  thrpt    5   43.810 
±  1.096  ops/ms
   BKDCodecBenchmark.innerLoop         16            false  thrpt    5   42.485 
±  0.073  ops/ms
   BKDCodecBenchmark.innerLoop         24             true  thrpt    5   37.255 
±  0.994  ops/ms
   BKDCodecBenchmark.innerLoop         24            false  thrpt    5   37.243 
±  0.593  ops/ms
   BKDCodecBenchmark.legacy            16             true  thrpt    5   41.415 
±  0.079  ops/ms
   BKDCodecBenchmark.legacy            16            false  thrpt    5  145.713 
±  0.381  ops/ms
   BKDCodecBenchmark.legacy            24             true  thrpt    5   27.758 
±  4.210  ops/ms
   BKDCodecBenchmark.legacy            24            false  thrpt    5   28.519 
±  1.839  ops/ms
   ```
   
   **Linux X86 AVX512 profiling enabled**
   ```
   Benchmark                            (bpv)  (countVariable)   Mode  Cnt   
Score   Error   Units
   BKDCodecBenchmark.current               16             true  thrpt    5  
29.878 ± 0.130  ops/ms
   BKDCodecBenchmark.current:asm           16             true  thrpt          
NaN             ---
   BKDCodecBenchmark.current               16            false  thrpt    5  
29.314 ± 0.229  ops/ms
   BKDCodecBenchmark.current:asm           16            false  thrpt          
NaN             ---
   BKDCodecBenchmark.current               24             true  thrpt    5  
34.874 ± 0.320  ops/ms
   BKDCodecBenchmark.current:asm           24             true  thrpt          
NaN             ---
   BKDCodecBenchmark.current               24            false  thrpt    5  
33.987 ± 0.055  ops/ms
   BKDCodecBenchmark.current:asm           24            false  thrpt          
NaN             ---
   BKDCodecBenchmark.currentVector         16             true  thrpt    5  
79.717 ± 5.983  ops/ms
   BKDCodecBenchmark.currentVector:asm     16             true  thrpt          
NaN             ---
   BKDCodecBenchmark.currentVector         16            false  thrpt    5  
81.924 ± 3.799  ops/ms
   BKDCodecBenchmark.currentVector:asm     16            false  thrpt          
NaN             ---
   BKDCodecBenchmark.currentVector         24             true  thrpt    5  
65.615 ± 8.901  ops/ms
   BKDCodecBenchmark.currentVector:asm     24             true  thrpt          
NaN             ---
   BKDCodecBenchmark.currentVector         24            false  thrpt    5  
74.759 ± 2.173  ops/ms
   BKDCodecBenchmark.currentVector:asm     24            false  thrpt          
NaN             ---
   BKDCodecBenchmark.innerLoop             16             true  thrpt    5  
40.869 ± 3.407  ops/ms
   BKDCodecBenchmark.innerLoop:asm         16             true  thrpt          
NaN             ---
   BKDCodecBenchmark.innerLoop             16            false  thrpt    5  
41.825 ± 1.644  ops/ms
   BKDCodecBenchmark.innerLoop:asm         16            false  thrpt          
NaN             ---
   BKDCodecBenchmark.innerLoop             24             true  thrpt    5  
37.251 ± 3.447  ops/ms
   BKDCodecBenchmark.innerLoop:asm         24             true  thrpt          
NaN             ---
   BKDCodecBenchmark.innerLoop             24            false  thrpt    5  
37.419 ± 1.238  ops/ms
   BKDCodecBenchmark.innerLoop:asm         24            false  thrpt          
NaN             ---
   BKDCodecBenchmark.legacy                16             true  thrpt    5  
28.477 ± 3.747  ops/ms
   BKDCodecBenchmark.legacy:asm            16             true  thrpt          
NaN             ---
   BKDCodecBenchmark.legacy                16            false  thrpt    5  
29.838 ± 0.163  ops/ms
   BKDCodecBenchmark.legacy:asm            16            false  thrpt          
NaN             ---
   BKDCodecBenchmark.legacy                24             true  thrpt    5  
28.295 ± 1.224  ops/ms
   BKDCodecBenchmark.legacy:asm            24             true  thrpt          
NaN             ---
   BKDCodecBenchmark.legacy                24            false  thrpt    5  
27.915 ± 0.911  ops/ms
   BKDCodecBenchmark.legacy:asm            24            false  thrpt          
NaN             ---
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to