ChrisHegarty commented on issue #14042: URL: https://github.com/apache/lucene/issues/14042#issuecomment-2546160112
Hotspot will unroll the loops that are using the Vector API to do floating-point arithmetic. On my Intel box `dotProductBody` gets unrolled 4x, and since it is already hand-unrolled 4x, we get effectively 16x unrolling. e.g. ``` ;; B49: # out( B49 B50 ) <- in( B48 B49 ) Loop( B49-B49 inner main of N191 strip mined) Freq: 272.018 0x000079609ff2fa11: vmovdqu32 zmm0,ZMMWORD PTR [rdx+rax*4+0x310] 0x000079609ff2fa1c: vmovdqu32 zmm1,ZMMWORD PTR [rdx+rax*4+0x210] 0x000079609ff2fa27: vmovdqu32 zmm4,ZMMWORD PTR [rdx+rax*4+0xd0] 0x000079609ff2fa32: vfmadd231ps zmm6,zmm4,ZMMWORD PTR [rcx+rax*4+0xd0] 0x000079609ff2fa3d: vmovdqu32 zmm4,ZMMWORD PTR [rdx+rax*4+0x1d0] 0x000079609ff2fa48: vfmadd231ps zmm6,zmm4,ZMMWORD PTR [rcx+rax*4+0x1d0] 0x000079609ff2fa53: vmovdqu32 zmm4,ZMMWORD PTR [rdx+rax*4+0x2d0] 0x000079609ff2fa5e: vfmadd231ps zmm6,zmm4,ZMMWORD PTR [rcx+rax*4+0x2d0] 0x000079609ff2fa69: vmovdqu32 zmm4,ZMMWORD PTR [rdx+rax*4+0x3d0] 0x000079609ff2fa74: vfmadd231ps zmm6,zmm4,ZMMWORD PTR [rcx+rax*4+0x3d0] 0x000079609ff2fa7f: vmovdqu32 zmm4,ZMMWORD PTR [rdx+rax*4+0x90] 0x000079609ff2fa8a: vfmadd231ps zmm5,zmm4,ZMMWORD PTR [rcx+rax*4+0x90] 0x000079609ff2fa95: vmovdqu32 zmm4,ZMMWORD PTR [rdx+rax*4+0x190] 0x000079609ff2faa0: vfmadd231ps zmm5,zmm4,ZMMWORD PTR [rcx+rax*4+0x190] 0x000079609ff2faab: vmovdqu32 zmm4,ZMMWORD PTR [rdx+rax*4+0x290] 0x000079609ff2fab6: vfmadd231ps zmm5,zmm4,ZMMWORD PTR [rcx+rax*4+0x290] 0x000079609ff2fac1: vmovdqu32 zmm4,ZMMWORD PTR [rdx+rax*4+0x390] 0x000079609ff2facc: vfmadd231ps zmm5,zmm4,ZMMWORD PTR [rcx+rax*4+0x390] 0x000079609ff2fad7: vmovdqu32 zmm4,ZMMWORD PTR [rdx+rax*4+0x50] 0x000079609ff2fae2: vfmadd231ps zmm3,zmm4,ZMMWORD PTR [rcx+rax*4+0x50] 0x000079609ff2faed: vmovdqu32 zmm4,ZMMWORD PTR [rdx+rax*4+0x150] 0x000079609ff2faf8: vfmadd231ps zmm3,zmm4,ZMMWORD PTR [rcx+rax*4+0x150] 0x000079609ff2fb03: vmovdqu32 zmm4,ZMMWORD PTR [rdx+rax*4+0x250] 0x000079609ff2fb0e: vfmadd231ps zmm3,zmm4,ZMMWORD PTR [rcx+rax*4+0x250] 0x000079609ff2fb19: vmovdqu32 zmm4,ZMMWORD PTR [rdx+rax*4+0x350] 0x000079609ff2fb2f: vmovdqu32 zmm4,ZMMWORD PTR [rdx+rax*4+0x10] 0x000079609ff2fb3a: vfmadd231ps zmm2,zmm4,ZMMWORD PTR [rcx+rax*4+0x10] 0x000079609ff2fb45: vmovdqu32 zmm4,ZMMWORD PTR [rdx+rax*4+0x110] 0x000079609ff2fb50: vfmadd231ps zmm2,zmm4,ZMMWORD PTR [rcx+rax*4+0x110] 0x000079609ff2fb5b: vfmadd231ps zmm2,zmm1,ZMMWORD PTR [rcx+rax*4+0x210] 0x000079609ff2fb66: vfmadd231ps zmm2,zmm0,ZMMWORD PTR [rcx+rax*4+0x310] 0x000079609ff2fb71: add eax,0x100 0x000079609ff2fb76: cmp eax,ebp 0x000079609ff2fb78: jl 0x000079609ff2fa11 ;; B50: # out( B48 B51 ) <- in( B49 ) Freq: 16.0002 ``` Reducing the unrolling of `dotProductBody`, to 2x (e.g. [draft PR]( #14071)) gives me a bit of an improvement. Linux ``` Benchmark (size) Mode Cnt Score Error Units main VectorUtilBenchmark.floatDotProductVector 768 thrpt 75 31.888 ± 0.812 ops/us VectorUtilBenchmark.floatDotProductVector 1024 thrpt 75 26.240 ± 0.550 ops/us reduce unroll to x2 VectorUtilBenchmark.floatDotProductVector 768 thrpt 75 35.129 ± 0.749 ops/us VectorUtilBenchmark.floatDotProductVector 1024 thrpt 75 28.060 ± 0.619 ops/us reduce unroll to x2 AND first is mul (rather than FMA) VectorUtilBenchmark.floatDotProductVector 768 thrpt 75 37.100 ± 0.726 ops/us VectorUtilBenchmark.floatDotProductVector 1024 thrpt 75 29.172 ± 0.514 ops/us ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org