https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95899
Bug ID: 95899 Summary: -funroll-loops does not duplicate accumulators when calculating reductions, failing to break up dependency chains Product: gcc Version: 10.1.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: elrodc at gmail dot com Target Milestone: --- Created attachment 48784 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48784&action=edit cc -march=skylake-avx512 -mprefer-vector-width=512 -Ofast -funroll-loops -S dot.c -o dot.s Sample code: ``` double dot(double* a, double* b, long N){ double s = 0.0; for (long n = 0; n < N; n++){ s += a[n] * b[n]; } return s; } ``` Relevant part of the asm: ``` .L4: vmovupd (%rdi,%r11), %zmm8 vmovupd 64(%rdi,%r11), %zmm9 vfmadd231pd (%rsi,%r11), %zmm8, %zmm0 vmovupd 128(%rdi,%r11), %zmm10 vmovupd 192(%rdi,%r11), %zmm11 vmovupd 256(%rdi,%r11), %zmm12 vmovupd 320(%rdi,%r11), %zmm13 vfmadd231pd 64(%rsi,%r11), %zmm9, %zmm0 vmovupd 384(%rdi,%r11), %zmm14 vmovupd 448(%rdi,%r11), %zmm15 vfmadd231pd 128(%rsi,%r11), %zmm10, %zmm0 vfmadd231pd 192(%rsi,%r11), %zmm11, %zmm0 vfmadd231pd 256(%rsi,%r11), %zmm12, %zmm0 vfmadd231pd 320(%rsi,%r11), %zmm13, %zmm0 vfmadd231pd 384(%rsi,%r11), %zmm14, %zmm0 vfmadd231pd 448(%rsi,%r11), %zmm15, %zmm0 addq $512, %r11 cmpq %r8, %r11 jne .L4 ``` Skylake-AVX512's vfmaddd should have a throughput of 2/cycle, but a latency of 4 cycles. Because each unrolled instance accumulates into `%zmm0`, we are limited by the dependency chain to 1 fma every 4 cycles. It should use separate accumulators. Additionally, if the loads are aligned, it would have a throughput of 2 loads/cycle. Because we need 2 loads per fma, that limits us to only 1 fma per cycle. If the dependency chain were the primary motivation for unrolling, we'd only want to unroll by 4, not 8. 4 cycles of latency, 1 fma per cycle -> 4 simultaneous / OoO fmas. Something like a sum (1 load per add) would perform better with the 8x unrolling seen here (at least, from 100 or so elements until it becomes memory bound).