Re: [PR] Speed up vectorutil float scalar methods, unroll properly, use fma where possible [lucene]

via GitHub Tue, 31 Oct 2023 11:16:52 -0700


uschindler commented on PR #12737:
URL: https://github.com/apache/lucene/pull/12737#issuecomment-1787740992


   Hi,
   on my older Ryzen it is faster with FMA enabled (I downgraded your branch 
and also verified that the system prints "FMA enabled". Here is my full 
benchmark:
   
   ```
   main, AMD Ryzen 7 3700X 8-Core Processor
   INFO: Java vector incubator API enabled; uses preferredBitSize=256; FMA 
enabled
   Benchmark                                  (size)   Mode  Cnt   Score   
Error   Units
   VectorUtilBenchmark.floatCosineScalar        1024  thrpt    5   1.155 ± 
0.012  ops/us
   VectorUtilBenchmark.floatCosineVector        1024  thrpt    5  10.602 ± 
0.213  ops/us
   VectorUtilBenchmark.floatDotProductScalar    1024  thrpt    5   3.675 ± 
0.010  ops/us
   VectorUtilBenchmark.floatDotProductVector    1024  thrpt    5  18.656 ± 
0.109  ops/us
   VectorUtilBenchmark.floatSquareScalar        1024  thrpt    5   2.598 ± 
0.023  ops/us
   VectorUtilBenchmark.floatSquareVector        1024  thrpt    5  20.843 ± 
0.205  ops/us
   
   Robert latest, no FMA, AMD Ryzen 7 3700X 8-Core Processor
   INFO: Java vector incubator API enabled; uses preferredBitSize=256
   Benchmark                                  (size)   Mode  Cnt   Score   
Error   Units
   VectorUtilBenchmark.floatCosineScalar        1024  thrpt    5   1.487 ± 
0.015  ops/us
   VectorUtilBenchmark.floatCosineVector        1024  thrpt    5  11.810 ± 
0.336  ops/us
   VectorUtilBenchmark.floatDotProductScalar    1024  thrpt    5   3.910 ± 
0.126  ops/us
   VectorUtilBenchmark.floatDotProductVector    1024  thrpt    5  18.885 ± 
0.238  ops/us
   VectorUtilBenchmark.floatSquareScalar        1024  thrpt    5   3.067 ± 
0.049  ops/us
   VectorUtilBenchmark.floatSquareVector        1024  thrpt    5  16.874 ± 
0.325  ops/us
   
   Robert 93fed5fe22bec39a7a3683f50b4632756f4b1c13, enforced FMA, AMD Ryzen 7 
3700X 8-Core Processor
   INFO: Java vector incubator API enabled; uses preferredBitSize=256; FMA 
enabled
   Benchmark                                  (size)   Mode  Cnt   Score   
Error   Units
   VectorUtilBenchmark.floatCosineScalar        1024  thrpt    5   1.562 ± 
0.016  ops/us
   VectorUtilBenchmark.floatCosineVector        1024  thrpt    5  10.669 ± 
0.188  ops/us
   VectorUtilBenchmark.floatDotProductScalar    1024  thrpt    5   3.206 ± 
0.018  ops/us
   VectorUtilBenchmark.floatDotProductVector    1024  thrpt    5  18.474 ± 
0.071  ops/us
   VectorUtilBenchmark.floatSquareScalar        1024  thrpt    5   3.125 ± 
0.119  ops/us
   VectorUtilBenchmark.floatSquareVector        1024  thrpt    5  20.247 ± 
0.415  ops/us
   ```
   
   For completeness here my Intel-Laptop:
   
   ```
   main, Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz, 1992 MHz
   INFORMATION: Java vector incubator API enabled; uses preferredBitSize=256; 
FMA enabled
   Benchmark                                  (size)   Mode  Cnt   Score   
Error   Units
   VectorUtilBenchmark.floatCosineScalar        1024  thrpt    5   0,712 ± 
0,354  ops/us
   VectorUtilBenchmark.floatCosineVector        1024  thrpt    5   9,134 ± 
0,204  ops/us
   VectorUtilBenchmark.floatDotProductScalar    1024  thrpt    5   2,383 ± 
0,125  ops/us
   VectorUtilBenchmark.floatDotProductVector    1024  thrpt    5  14,116 ± 
1,304  ops/us
   VectorUtilBenchmark.floatSquareScalar        1024  thrpt    5   1,663 ± 
0,062  ops/us
   VectorUtilBenchmark.floatSquareVector        1024  thrpt    5  14,735 ± 
0,258  ops/us
   
   Robert, Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz, 1992 MHz
   INFORMATION: Java vector incubator API enabled; uses preferredBitSize=256; 
FMA enabled
   Benchmark                                  (size)   Mode  Cnt   Score   
Error   Units
   VectorUtilBenchmark.floatCosineScalar        1024  thrpt    5   1,840 ± 
0,021  ops/us
   VectorUtilBenchmark.floatCosineVector        1024  thrpt    5   9,253 ± 
0,296  ops/us
   VectorUtilBenchmark.floatDotProductScalar    1024  thrpt    5   3,160 ± 
0,449  ops/us
   VectorUtilBenchmark.floatDotProductVector    1024  thrpt    5  13,943 ± 
1,281  ops/us
   VectorUtilBenchmark.floatSquareScalar        1024  thrpt    5   2,140 ± 
1,072  ops/us
   VectorUtilBenchmark.floatSquareVector        1024  thrpt    5  12,089 ± 
4,551  ops/us
   ```
   
   So in my opinion, FMA is fine (as it is more precise). Just because one of 
the CPUs slows, we cannot say "all AMD are bad".


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Speed up vectorutil float scalar methods, unroll properly, use fma where possible [lucene]

Reply via email to