ChrisHegarty commented on PR #12743: URL: https://github.com/apache/lucene/pull/12743#issuecomment-1788889191
I see reasonable speedups on both x64 and ARM, but sadly see no vectorization in the disassembly. The speed seems to come from the 2x instructions/pipelining (?) per strip mined loop iteration, rather than any vectorization (which is not happening). [ I included the Panama/Vector output just for comparison purposes ] Mac M2: main: ``` VectorUtilBenchmark.binaryDotProductScalar 1024 thrpt 5 3.151 ± 0.044 ops/us VectorUtilBenchmark.binaryDotProductVector 1024 thrpt 5 7.112 ± 0.030 ops/us ``` PR branch: ``` VectorUtilBenchmark.binaryDotProductScalar 1024 thrpt 5 4.722 ± 0.031 ops/us VectorUtilBenchmark.binaryDotProductVector 1024 thrpt 5 7.127 ± 0.042 ops/us ``` I see a reasonable speed up on Linux / x64. main: ``` VectorUtilBenchmark.binaryDotProductScalar 1024 thrpt 5 2.121 ± 0.010 ops/us VectorUtilBenchmark.binaryDotProductVector 1024 thrpt 5 21.341 ± 0.039 ops/us ``` PR branch: ``` VectorUtilBenchmark.binaryDotProductScalar 1024 thrpt 5 2.799 ± 0.022 ops/us VectorUtilBenchmark.binaryDotProductVector 1024 thrpt 5 21.337 ± 0.014 ops/us ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org