ChrisHegarty commented on PR #12743:
URL: https://github.com/apache/lucene/pull/12743#issuecomment-1788889191

   I see reasonable speedups on both x64 and ARM, but sadly see no 
vectorization in the disassembly.  The speed seems to come from the 2x 
instructions/pipelining (?) per strip mined loop iteration, rather than any 
vectorization (which is not happening).  [ I included the Panama/Vector output 
just for comparison purposes ]
   
   Mac M2:
   
   main:
   ```
   VectorUtilBenchmark.binaryDotProductScalar    1024  thrpt    5  3.151 ± 
0.044  ops/us
   VectorUtilBenchmark.binaryDotProductVector    1024  thrpt    5  7.112 ± 
0.030  ops/us
   ```
   
   PR branch:
   ```
   VectorUtilBenchmark.binaryDotProductScalar    1024  thrpt    5  4.722 ± 
0.031  ops/us
   VectorUtilBenchmark.binaryDotProductVector    1024  thrpt    5  7.127 ± 
0.042  ops/us
   ```
   
   I see a reasonable speed up on Linux / x64. 
   
   main:
   ```
   VectorUtilBenchmark.binaryDotProductScalar    1024  thrpt    5   2.121 ± 
0.010  ops/us
   VectorUtilBenchmark.binaryDotProductVector    1024  thrpt    5  21.341 ± 
0.039  ops/us
   ```
   
   PR branch:
   ```
   VectorUtilBenchmark.binaryDotProductScalar    1024  thrpt    5   2.799 ± 
0.022  ops/us
   VectorUtilBenchmark.binaryDotProductVector    1024  thrpt    5  21.337 ± 
0.014  ops/us
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to