Re: [PR] Speed up vectorutil float scalar methods, unroll properly, use fma where possible [lucene]

via GitHub Fri, 03 Nov 2023 16:18:25 -0700


rmuir commented on PR #12737:
URL: https://github.com/apache/lucene/pull/12737#issuecomment-1793231388


   vector results for this AMD CPU are unchanged by this PR.
   
   Float-relevant performance info from avxturbo.
   This CPU doesn't downclock but 512-bit FMA is 2x as slow as 256-bit FMA, so 
i did some experiments...
   ```
   Cores | ID                  | Description                       | OVRLP3 |  
Mops | A/M-ratio | A/M-MHz | M/tsc-ratio
   1     | avx128_fma_t        | 128-bit parallel DP FMAs          |  1.000 |  
7402 |      1.42 |    3700 |        1.00
   1     | avx256_fma_t        | 256-bit parallel DP FMAs          |  1.000 |  
7402 |      1.42 |    3700 |        1.00
   1     | avx512_fma_t        | 512-bit parallel DP FMAs          |  1.000 |  
3700 |      1.42 |    3700 |        1.00
   ```
   
   Float:
   INFO: Java vector incubator API enabled; uses preferredBitSize=512; FMA 
enabled
   ```
   Benchmark                                  (size)   Mode  Cnt   Score   
Error   Units
   VectorUtilBenchmark.floatCosineVector        1024  thrpt   75  13.397 ± 
0.205  ops/us
   VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  16.226 ± 
0.434  ops/us
   VectorUtilBenchmark.floatSquareVector        1024  thrpt   75  16.147 ± 
0.394  ops/us
   ```
   
   Float (avoiding AVX-512 entirely by passing -XX:MaxVectorSize=32)
   INFO: Java vector incubator API enabled; uses preferredBitSize=256; FMA 
enabled
   ```
   Benchmark                                  (size)   Mode  Cnt   Score   
Error   Units
   VectorUtilBenchmark.floatCosineVector        1024  thrpt   75  11.234 ± 
0.041  ops/us
   VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  17.045 ± 
0.436  ops/us
   VectorUtilBenchmark.floatSquareVector        1024  thrpt   75  16.876 ± 
0.351  ops/us
   ```
   
   Binary-relevant performance info from avxturbo:
   ```
   Cores | ID                  | Description                       | OVRLP3 |  
Mops | A/M-ratio | A/M-MHz | M/tsc-ratio
   1     | avx128_imul         | 128-bit integer muls (vpmuldq)    |  1.000 |  
1233 |      1.42 |    3700 |        1.00
   1     | avx256_imul         | 256-bit integer muls (vpmuldq)    |  1.000 |  
1233 |      1.42 |    3700 |        1.00
   1     | avx512_imul         | 512-bit integer muls (vpmuldq)    |  1.000 |  
1233 |      1.42 |    3700 |        1.00
   ```
   
   Binary:
   INFO: Java vector incubator API enabled; uses preferredBitSize=512; FMA 
enabled
   ```
   Benchmark                                   (size)   Mode  Cnt   Score   
Error   Units
   VectorUtilBenchmark.binaryCosineVector        1024  thrpt   15   8.769 ± 
0.083  ops/us
   VectorUtilBenchmark.binaryDotProductVector    1024  thrpt   15  22.362 ± 
0.054  ops/us
   VectorUtilBenchmark.binarySquareVector        1024  thrpt   15  18.080 ± 
0.171  ops/us
   ```
   
   Binary (512-bit vectors but disabling Intel-specific downclock-protection / 
doing 32-bit vpmul)
   INFO: Java vector incubator API enabled; uses preferredBitSize=512; FMA 
enabled
   ```
   Benchmark                                   (size)   Mode  Cnt   Score   
Error   Units
   VectorUtilBenchmark.binaryCosineVector        1024  thrpt   15  10.669 ± 
0.242  ops/us
   VectorUtilBenchmark.binaryDotProductVector    1024  thrpt   15  21.148 ± 
0.087  ops/us
   VectorUtilBenchmark.binarySquareVector        1024  thrpt   15  18.048 ± 
0.142  ops/us
   ```
   
   Binary (avoiding AVX-512 entirely by passing -XX:MaxVectorSize=32)
   INFO: Java vector incubator API enabled; uses preferredBitSize=256; FMA 
enabled
   ```
   Benchmark                                   (size)   Mode  Cnt   Score   
Error   Units
   VectorUtilBenchmark.binaryCosineVector        1024  thrpt   15   8.773 ± 
0.006  ops/us
   VectorUtilBenchmark.binaryDotProductVector    1024  thrpt   15  17.484 ± 
0.022  ops/us
   VectorUtilBenchmark.binarySquareVector        1024  thrpt   15  14.930 ± 
0.018  ops/us
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Speed up vectorutil float scalar methods, unroll properly, use fma where possible [lucene]

Reply via email to