rmuir opened a new pull request, #12731: URL: https://github.com/apache/lucene/pull/12731
The intel fma is nice, and its easier to reason about when looking at assembly. We basically reduce the error for free where its available. Along with another change (reducing the unrolling for cosine, since it has 3 fma ops already), we can speed up cosine from 6 -> 8 uops/us. On the arm the fma leads to slight slowdowns, so we don't use it. Its not much, just something like 10%, but seems like the wrong tradeoff. If you run the code with `-XX-UseFMA` there's no slowdown, but no speedup either. And obviously, no changes for ARM here. ``` Skylake AVX-256 Main: Benchmark (size) Mode Cnt Score Error Units VectorUtilBenchmark.floatCosineScalar 1024 thrpt 5 0.624 ± 0.041 ops/us VectorUtilBenchmark.floatCosineVector 1024 thrpt 5 5.988 ± 0.111 ops/us VectorUtilBenchmark.floatDotProductScalar 1024 thrpt 5 1.959 ± 0.032 ops/us VectorUtilBenchmark.floatDotProductVector 1024 thrpt 5 12.058 ± 0.920 ops/us VectorUtilBenchmark.floatSquareScalar 1024 thrpt 5 1.422 ± 0.018 ops/us VectorUtilBenchmark.floatSquareVector 1024 thrpt 5 9.837 ± 0.154 ops/us Patch: Benchmark (size) Mode Cnt Score Error Units VectorUtilBenchmark.floatCosineScalar 1024 thrpt 5 0.638 ± 0.006 ops/us VectorUtilBenchmark.floatCosineVector 1024 thrpt 5 8.164 ± 0.084 ops/us VectorUtilBenchmark.floatDotProductScalar 1024 thrpt 5 1.997 ± 0.027 ops/us VectorUtilBenchmark.floatDotProductVector 1024 thrpt 5 12.486 ± 0.163 ops/us VectorUtilBenchmark.floatSquareScalar 1024 thrpt 5 1.445 ± 0.014 ops/us VectorUtilBenchmark.floatSquareVector 1024 thrpt 5 11.682 ± 0.129 ops/us Patch (with -jvmArgsAppend '-XX:-UseFMA'): Benchmark (size) Mode Cnt Score Error Units VectorUtilBenchmark.floatCosineScalar 1024 thrpt 5 0.641 ± 0.005 ops/us VectorUtilBenchmark.floatCosineVector 1024 thrpt 5 6.102 ± 0.053 ops/us VectorUtilBenchmark.floatDotProductScalar 1024 thrpt 5 1.997 ± 0.007 ops/us VectorUtilBenchmark.floatDotProductVector 1024 thrpt 5 12.177 ± 0.170 ops/us VectorUtilBenchmark.floatSquareScalar 1024 thrpt 5 1.450 ± 0.027 ops/us VectorUtilBenchmark.floatSquareVector 1024 thrpt 5 10.464 ± 0.154 ops/us ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org