Re: [PR] Speed up vectorutil float scalar methods, unroll properly, use fma where possible [lucene]

via GitHub Fri, 03 Nov 2023 23:51:43 -0700


rmuir commented on PR #12737:
URL: https://github.com/apache/lucene/pull/12737#issuecomment-1793362865


   I tweaked the FMA logic for AMD cpus, to only avoid the high-latency scalar 
FMA where necessary. Should appease germans to get that extra ulp or whatever.
   
   sysprops default to "auto" so you can override however you want, without 
fear of involving BigDecimal :)
   
   I can test the intel and arm families in the same way and try to tighten it 
up tomorrow.
   
   AMD Zen4: EPYC 9R14 (family 0x19)
   ```
   Main:
   Benchmark                                  (size)   Mode  Cnt   Score    
Error   Units
   VectorUtilBenchmark.floatCosineScalar        1024  thrpt   15   0.842 ±  
0.001  ops/us
   VectorUtilBenchmark.floatCosineVector        1024  thrpt   75  13.497 ±  
0.171  ops/us
   VectorUtilBenchmark.floatDotProductScalar    1024  thrpt   15   3.540 ±  
0.002  ops/us
   VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  16.441 ±  
0.424  ops/us
   VectorUtilBenchmark.floatSquareScalar        1024  thrpt   15   2.540 ±  
0.008  ops/us
   VectorUtilBenchmark.floatSquareVector        1024  thrpt   75  16.655 ±  
0.575  ops/us
   
   Patch:
   Benchmark                                  (size)   Mode  Cnt   Score   
Error   Units
   VectorUtilBenchmark.floatCosineScalar        1024  thrpt   15   1.763 ± 
0.001  ops/us
   VectorUtilBenchmark.floatCosineVector        1024  thrpt   75  13.477 ± 
0.168  ops/us
   VectorUtilBenchmark.floatDotProductScalar    1024  thrpt   15   3.583 ± 
0.003  ops/us
   VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  16.438 ± 
0.493  ops/us
   VectorUtilBenchmark.floatSquareScalar        1024  thrpt   15   3.560 ± 
0.009  ops/us
   VectorUtilBenchmark.floatSquareVector        1024  thrpt   75  15.778 ± 
0.114  ops/us
   ```
   
   AMD Zen3: EPYC 7R13 (family 0x19)
   ```
   Main:
   Benchmark                                   (size)   Mode  Cnt   Score   
Error   Units
   VectorUtilBenchmark.floatCosineScalar         1024  thrpt   15   0.982 ± 
0.001  ops/us
   VectorUtilBenchmark.floatCosineVector         1024  thrpt   75  10.476 ± 
0.026  ops/us
   VectorUtilBenchmark.floatDotProductScalar     1024  thrpt   15   3.246 ± 
0.015  ops/us
   VectorUtilBenchmark.floatDotProductVector     1024  thrpt   75  16.959 ± 
0.480  ops/us
   VectorUtilBenchmark.floatSquareScalar         1024  thrpt   15   2.298 ± 
0.010  ops/us
   VectorUtilBenchmark.floatSquareVector         1024  thrpt   75  16.342 ± 
0.508  ops/us
   
   Patch:
   Benchmark                                  (size)   Mode  Cnt   Score   
Error   Units
   VectorUtilBenchmark.floatCosineScalar        1024  thrpt   15   1.344 ± 
0.001  ops/us
   VectorUtilBenchmark.floatCosineVector        1024  thrpt   75  10.445 ± 
0.048  ops/us
   VectorUtilBenchmark.floatDotProductScalar    1024  thrpt   15   3.405 ± 
0.006  ops/us
   VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  16.486 ± 
0.374  ops/us
   VectorUtilBenchmark.floatSquareScalar        1024  thrpt   15   2.995 ± 
0.002  ops/us
   VectorUtilBenchmark.floatSquareVector        1024  thrpt   75  16.374 ± 
0.462  ops/us
   ```
   
   AMD Zen2: EPYC 7R32 (family 0x17)
   ```
   Main:
   Benchmark                                   (size)   Mode  Cnt   Score    
Error   Units
   VectorUtilBenchmark.floatCosineScalar         1024  thrpt   15   0.922 ±  
0.005  ops/us
   VectorUtilBenchmark.floatCosineVector         1024  thrpt   75   8.519 ±  
0.020  ops/us
   VectorUtilBenchmark.floatDotProductScalar     1024  thrpt   15   2.968 ±  
0.020  ops/us
   VectorUtilBenchmark.floatDotProductVector     1024  thrpt   75  15.950 ±  
0.486  ops/us
   VectorUtilBenchmark.floatSquareScalar         1024  thrpt   15   2.015 ±  
0.012  ops/us
   VectorUtilBenchmark.floatSquareVector         1024  thrpt   75  15.894 ±  
0.331  ops/us
   
   Patch:
   Benchmark                                  (size)   Mode  Cnt   Score   
Error   Units
   VectorUtilBenchmark.floatCosineScalar        1024  thrpt   15   1.200 ± 
0.005  ops/us
   VectorUtilBenchmark.floatCosineVector        1024  thrpt   75   8.520 ± 
0.018  ops/us
   VectorUtilBenchmark.floatDotProductScalar    1024  thrpt   15   3.114 ± 
0.021  ops/us
   VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  15.671 ± 
0.439  ops/us
   VectorUtilBenchmark.floatSquareScalar        1024  thrpt   15   2.490 ± 
0.030  ops/us
   VectorUtilBenchmark.floatSquareVector        1024  thrpt   75  15.189 ± 
0.170  ops/us
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Speed up vectorutil float scalar methods, unroll properly, use fma where possible [lucene]

Reply via email to