[PR] Speedup float cosine vectors, use FMA where fast and available to reduce error [lucene]

via GitHub Fri, 27 Oct 2023 20:50:05 -0700


rmuir opened a new pull request, #12731:
URL: https://github.com/apache/lucene/pull/12731


   The intel fma is nice, and its easier to reason about when looking at 
assembly. We basically reduce the error for free where its available. Along 
with another change (reducing the unrolling for cosine, since it has 3 fma ops 
already), we can speed up cosine from 6 -> 8 uops/us.
   
   On the arm the fma leads to slight slowdowns, so we don't use it. Its not 
much, just something like 10%, but seems like the wrong tradeoff.
   
   If you run the code with `-XX-UseFMA` there's no slowdown, but no speedup 
either. And obviously, no changes for ARM here.
   
   ```
   Skylake AVX-256
   
   Main:
   Benchmark                                  (size)   Mode  Cnt   Score   
Error   Units
   VectorUtilBenchmark.floatCosineScalar        1024  thrpt    5   0.624 ± 
0.041  ops/us
   VectorUtilBenchmark.floatCosineVector        1024  thrpt    5   5.988 ± 
0.111  ops/us
   VectorUtilBenchmark.floatDotProductScalar    1024  thrpt    5   1.959 ± 
0.032  ops/us
   VectorUtilBenchmark.floatDotProductVector    1024  thrpt    5  12.058 ± 
0.920  ops/us
   VectorUtilBenchmark.floatSquareScalar        1024  thrpt    5   1.422 ± 
0.018  ops/us
   VectorUtilBenchmark.floatSquareVector        1024  thrpt    5   9.837 ± 
0.154  ops/us
   
   Patch:
   Benchmark                                  (size)   Mode  Cnt   Score   
Error   Units
   VectorUtilBenchmark.floatCosineScalar        1024  thrpt    5   0.638 ± 
0.006  ops/us
   VectorUtilBenchmark.floatCosineVector        1024  thrpt    5   8.164 ± 
0.084  ops/us
   VectorUtilBenchmark.floatDotProductScalar    1024  thrpt    5   1.997 ± 
0.027  ops/us
   VectorUtilBenchmark.floatDotProductVector    1024  thrpt    5  12.486 ± 
0.163  ops/us
   VectorUtilBenchmark.floatSquareScalar        1024  thrpt    5   1.445 ± 
0.014  ops/us
   VectorUtilBenchmark.floatSquareVector        1024  thrpt    5  11.682 ± 
0.129  ops/us
   
   Patch (with -jvmArgsAppend '-XX:-UseFMA'):
   Benchmark                                  (size)   Mode  Cnt   Score   
Error   Units
   VectorUtilBenchmark.floatCosineScalar        1024  thrpt    5   0.641 ± 
0.005  ops/us
   VectorUtilBenchmark.floatCosineVector        1024  thrpt    5   6.102 ± 
0.053  ops/us
   VectorUtilBenchmark.floatDotProductScalar    1024  thrpt    5   1.997 ± 
0.007  ops/us
   VectorUtilBenchmark.floatDotProductVector    1024  thrpt    5  12.177 ± 
0.170  ops/us
   VectorUtilBenchmark.floatSquareScalar        1024  thrpt    5   1.450 ± 
0.027  ops/us
   VectorUtilBenchmark.floatSquareVector        1024  thrpt    5  10.464 ± 
0.154  ops/us
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[PR] Speedup float cosine vectors, use FMA where fast and available to reduce error [lucene]

Reply via email to