Re: [PR] Speed up vectorutil float scalar methods, unroll properly, use fma where possible [lucene]

via GitHub Sat, 04 Nov 2023 09:25:40 -0700


rmuir commented on PR #12737:
URL: https://github.com/apache/lucene/pull/12737#issuecomment-1793488056


   Here are the ARMs. I had to tweak ARM to use FMA more aggressively to fully 
utilize the gravitons. The problem there is just apple silicon, it is good we 
did not move forwards with benchmarks based solely on some macs. You may not 
like my detector, but I think it is quite practical and prevents slow execution.
   
   Graviton 3
   ```
   Main:
   Benchmark                                  (size)   Mode  Cnt   Score    
Error   Units
   VectorUtilBenchmark.floatCosineScalar        1024  thrpt   15   0.682 ±  
0.001  ops/us
   VectorUtilBenchmark.floatCosineVector        1024  thrpt   75   5.500 ±  
0.004  ops/us
   VectorUtilBenchmark.floatDotProductScalar    1024  thrpt   15   2.411 ±  
0.037  ops/us
   VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  11.522 ±  
0.234  ops/us
   VectorUtilBenchmark.floatSquareScalar        1024  thrpt   15   2.169 ±  
0.005  ops/us
   VectorUtilBenchmark.floatSquareVector        1024  thrpt   75   8.632 ±  
0.084  ops/us
   
   Patch:
   Benchmark                                  (size)   Mode  Cnt   Score   
Error   Units
   VectorUtilBenchmark.floatCosineScalar        1024  thrpt   15   1.422 ± 
0.001  ops/us
   VectorUtilBenchmark.floatCosineVector        1024  thrpt   75   6.911 ± 
0.039  ops/us
   VectorUtilBenchmark.floatDotProductScalar    1024  thrpt   15   3.751 ± 
0.007  ops/us
   VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  11.498 ± 
0.418  ops/us
   VectorUtilBenchmark.floatSquareScalar        1024  thrpt   15   3.202 ± 
0.007  ops/us
   VectorUtilBenchmark.floatSquareVector        1024  thrpt   75  10.795 ± 
0.154  ops/us
   ```
   
   
   Graviton 2
   ```
   Main:
   Benchmark                                  (size)   Mode  Cnt  Score   Error 
  Units
   VectorUtilBenchmark.floatCosineScalar        1024  thrpt   15  0.647 ± 0.002 
 ops/us
   VectorUtilBenchmark.floatCosineVector        1024  thrpt   75  2.599 ± 0.002 
 ops/us
   VectorUtilBenchmark.floatDotProductScalar    1024  thrpt   15  1.430 ± 0.007 
 ops/us
   VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  6.192 ± 0.098 
 ops/us
   VectorUtilBenchmark.floatSquareScalar        1024  thrpt   15  1.194 ± 0.003 
 ops/us
   VectorUtilBenchmark.floatSquareVector        1024  thrpt   75  4.797 ± 0.088 
 ops/us
   
   Patch:
   Benchmark                                  (size)   Mode  Cnt  Score    
Error   Units
   VectorUtilBenchmark.floatCosineScalar        1024  thrpt   15  1.571 ±  
0.001  ops/us
   VectorUtilBenchmark.floatCosineVector        1024  thrpt   75  5.408 ±  
0.013  ops/us
   VectorUtilBenchmark.floatDotProductScalar    1024  thrpt   15  2.055 ±  
0.066  ops/us
   VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  6.673 ±  
0.260  ops/us
   VectorUtilBenchmark.floatSquareScalar        1024  thrpt   15  1.753 ±  
0.001  ops/us
   VectorUtilBenchmark.floatSquareVector        1024  thrpt   75  6.179 ±  
0.070  ops/us
   ```
   
   Mac M1
   ```
   Main:
   Benchmark                                  (size)   Mode  Cnt   Score   
Error   Units
   VectorUtilBenchmark.floatCosineScalar        1024  thrpt   15   1.077 ± 
0.002  ops/us
   VectorUtilBenchmark.floatCosineVector        1024  thrpt   75   7.651 ± 
0.032  ops/us
   VectorUtilBenchmark.floatDotProductScalar    1024  thrpt   15   3.606 ± 
0.032  ops/us
   VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  16.296 ± 
0.268  ops/us
   VectorUtilBenchmark.floatSquareScalar        1024  thrpt   15   3.197 ± 
0.001  ops/us
   VectorUtilBenchmark.floatSquareVector        1024  thrpt   75  14.185 ± 
0.099  ops/us
   
   Patch:
   Benchmark                                  (size)   Mode  Cnt   Score   
Error   Units
   VectorUtilBenchmark.floatCosineScalar        1024  thrpt   15   2.062 ± 
0.006  ops/us
   VectorUtilBenchmark.floatCosineVector        1024  thrpt   75   7.644 ± 
0.030  ops/us
   VectorUtilBenchmark.floatDotProductScalar    1024  thrpt   15   4.273 ± 
0.003  ops/us
   VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  16.110 ± 
0.283  ops/us
   VectorUtilBenchmark.floatSquareScalar        1024  thrpt   15   3.770 ± 
0.007  ops/us
   VectorUtilBenchmark.floatSquareVector        1024  thrpt   75  14.184 ± 
0.100  ops/us
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Speed up vectorutil float scalar methods, unroll properly, use fma where possible [lucene]

Reply via email to