gf2121 commented on PR #14203: URL: https://github.com/apache/lucene/pull/14203#issuecomment-2651208564
Thanks for feedback! I implement the fixed-size inner loop and print out assembly for all. [perf_asm.log](https://github.com/user-attachments/files/18752147/perf_asm.log) * When profiling enabled, `#current countVariable=true` and `#current countVariable=false` has same (slow) speed. It seems like profiling prevented some optimization. * According to the assembly, `#current bpv=16` does not get auto-vectorized. `#current bpv=24` gets vectorized on the shift loop, but not for the remainder loop. * According to the assembly, the innerloop get auto-vectorized, but slower than vector API. **MAC M2** ``` Benchmark (bpv) (countVariable) Mode Cnt Score Error Units BKDCodecBenchmark.current 16 true thrpt 5 103.490 ± 6.785 ops/ms BKDCodecBenchmark.current 16 false thrpt 5 212.488 ± 5.383 ops/ms BKDCodecBenchmark.current 24 true thrpt 5 91.203 ± 1.023 ops/ms BKDCodecBenchmark.current 24 false thrpt 5 149.742 ± 1.953 ops/ms BKDCodecBenchmark.currentVector 16 true thrpt 5 213.162 ± 1.598 ops/ms BKDCodecBenchmark.currentVector 16 false thrpt 5 216.529 ± 2.518 ops/ms BKDCodecBenchmark.currentVector 24 true thrpt 5 153.970 ± 1.101 ops/ms BKDCodecBenchmark.currentVector 24 false thrpt 5 140.103 ± 3.001 ops/ms BKDCodecBenchmark.innerLoop 16 true thrpt 5 129.281 ± 0.471 ops/ms BKDCodecBenchmark.innerLoop 16 false thrpt 5 131.083 ± 8.775 ops/ms BKDCodecBenchmark.innerLoop 24 true thrpt 5 99.597 ± 2.850 ops/ms BKDCodecBenchmark.innerLoop 24 false thrpt 5 96.235 ± 14.875 ops/ms BKDCodecBenchmark.legacy 16 true thrpt 5 104.314 ± 0.557 ops/ms BKDCodecBenchmark.legacy 16 false thrpt 5 202.175 ± 10.863 ops/ms BKDCodecBenchmark.legacy 24 true thrpt 5 86.016 ± 1.315 ops/ms BKDCodecBenchmark.legacy 24 false thrpt 5 85.609 ± 5.733 ops/ms ``` **Linux X86 AVX512 profiling disabled** ``` Benchmark (bpv) (countVariable) Mode Cnt Score Error Units BKDCodecBenchmark.current 16 true thrpt 5 41.138 ± 1.770 ops/ms BKDCodecBenchmark.current 16 false thrpt 5 142.277 ± 0.943 ops/ms BKDCodecBenchmark.current 24 true thrpt 5 43.104 ± 0.066 ops/ms BKDCodecBenchmark.current 24 false thrpt 5 42.760 ± 0.496 ops/ms BKDCodecBenchmark.currentVector 16 true thrpt 5 86.565 ± 0.904 ops/ms BKDCodecBenchmark.currentVector 16 false thrpt 5 86.624 ± 0.395 ops/ms BKDCodecBenchmark.currentVector 24 true thrpt 5 80.064 ± 2.604 ops/ms BKDCodecBenchmark.currentVector 24 false thrpt 5 76.638 ± 18.692 ops/ms BKDCodecBenchmark.innerLoop 16 true thrpt 5 43.810 ± 1.096 ops/ms BKDCodecBenchmark.innerLoop 16 false thrpt 5 42.485 ± 0.073 ops/ms BKDCodecBenchmark.innerLoop 24 true thrpt 5 37.255 ± 0.994 ops/ms BKDCodecBenchmark.innerLoop 24 false thrpt 5 37.243 ± 0.593 ops/ms BKDCodecBenchmark.legacy 16 true thrpt 5 41.415 ± 0.079 ops/ms BKDCodecBenchmark.legacy 16 false thrpt 5 145.713 ± 0.381 ops/ms BKDCodecBenchmark.legacy 24 true thrpt 5 27.758 ± 4.210 ops/ms BKDCodecBenchmark.legacy 24 false thrpt 5 28.519 ± 1.839 ops/ms ``` **Linux X86 AVX512 profiling enabled** ``` Benchmark (bpv) (countVariable) Mode Cnt Score Error Units BKDCodecBenchmark.current 16 true thrpt 5 29.878 ± 0.130 ops/ms BKDCodecBenchmark.current:asm 16 true thrpt NaN --- BKDCodecBenchmark.current 16 false thrpt 5 29.314 ± 0.229 ops/ms BKDCodecBenchmark.current:asm 16 false thrpt NaN --- BKDCodecBenchmark.current 24 true thrpt 5 34.874 ± 0.320 ops/ms BKDCodecBenchmark.current:asm 24 true thrpt NaN --- BKDCodecBenchmark.current 24 false thrpt 5 33.987 ± 0.055 ops/ms BKDCodecBenchmark.current:asm 24 false thrpt NaN --- BKDCodecBenchmark.currentVector 16 true thrpt 5 79.717 ± 5.983 ops/ms BKDCodecBenchmark.currentVector:asm 16 true thrpt NaN --- BKDCodecBenchmark.currentVector 16 false thrpt 5 81.924 ± 3.799 ops/ms BKDCodecBenchmark.currentVector:asm 16 false thrpt NaN --- BKDCodecBenchmark.currentVector 24 true thrpt 5 65.615 ± 8.901 ops/ms BKDCodecBenchmark.currentVector:asm 24 true thrpt NaN --- BKDCodecBenchmark.currentVector 24 false thrpt 5 74.759 ± 2.173 ops/ms BKDCodecBenchmark.currentVector:asm 24 false thrpt NaN --- BKDCodecBenchmark.innerLoop 16 true thrpt 5 40.869 ± 3.407 ops/ms BKDCodecBenchmark.innerLoop:asm 16 true thrpt NaN --- BKDCodecBenchmark.innerLoop 16 false thrpt 5 41.825 ± 1.644 ops/ms BKDCodecBenchmark.innerLoop:asm 16 false thrpt NaN --- BKDCodecBenchmark.innerLoop 24 true thrpt 5 37.251 ± 3.447 ops/ms BKDCodecBenchmark.innerLoop:asm 24 true thrpt NaN --- BKDCodecBenchmark.innerLoop 24 false thrpt 5 37.419 ± 1.238 ops/ms BKDCodecBenchmark.innerLoop:asm 24 false thrpt NaN --- BKDCodecBenchmark.legacy 16 true thrpt 5 28.477 ± 3.747 ops/ms BKDCodecBenchmark.legacy:asm 16 true thrpt NaN --- BKDCodecBenchmark.legacy 16 false thrpt 5 29.838 ± 0.163 ops/ms BKDCodecBenchmark.legacy:asm 16 false thrpt NaN --- BKDCodecBenchmark.legacy 24 true thrpt 5 28.295 ± 1.224 ops/ms BKDCodecBenchmark.legacy:asm 24 true thrpt NaN --- BKDCodecBenchmark.legacy 24 false thrpt 5 27.915 ± 0.911 ops/ms BKDCodecBenchmark.legacy:asm 24 false thrpt NaN --- ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org