https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95899
Bug ID: 95899
Summary: -funroll-loops does not duplicate accumulators when
calculating reductions, failing to break up dependency
chains
Product: gcc
Version: 10.1.1
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: middle-end
Assignee: unassigned at gcc dot gnu.org
Reporter: elrodc at gmail dot com
Target Milestone: ---
Created attachment 48784
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48784&action=edit
cc -march=skylake-avx512 -mprefer-vector-width=512 -Ofast -funroll-loops -S
dot.c -o dot.s
Sample code:
```
double dot(double* a, double* b, long N){
double s = 0.0;
for (long n = 0; n < N; n++){
s += a[n] * b[n];
}
return s;
}
```
Relevant part of the asm:
```
.L4:
vmovupd (%rdi,%r11), %zmm8
vmovupd 64(%rdi,%r11), %zmm9
vfmadd231pd (%rsi,%r11), %zmm8, %zmm0
vmovupd 128(%rdi,%r11), %zmm10
vmovupd 192(%rdi,%r11), %zmm11
vmovupd 256(%rdi,%r11), %zmm12
vmovupd 320(%rdi,%r11), %zmm13
vfmadd231pd 64(%rsi,%r11), %zmm9, %zmm0
vmovupd 384(%rdi,%r11), %zmm14
vmovupd 448(%rdi,%r11), %zmm15
vfmadd231pd 128(%rsi,%r11), %zmm10, %zmm0
vfmadd231pd 192(%rsi,%r11), %zmm11, %zmm0
vfmadd231pd 256(%rsi,%r11), %zmm12, %zmm0
vfmadd231pd 320(%rsi,%r11), %zmm13, %zmm0
vfmadd231pd 384(%rsi,%r11), %zmm14, %zmm0
vfmadd231pd 448(%rsi,%r11), %zmm15, %zmm0
addq $512, %r11
cmpq %r8, %r11
jne .L4
```
Skylake-AVX512's vfmaddd should have a throughput of 2/cycle, but a latency of
4 cycles.
Because each unrolled instance accumulates into `%zmm0`, we are limited by the
dependency chain to 1 fma every 4 cycles.
It should use separate accumulators.
Additionally, if the loads are aligned, it would have a throughput of 2
loads/cycle. Because we need 2 loads per fma, that limits us to only 1 fma per
cycle. If the dependency chain were the primary motivation for unrolling, we'd
only want to unroll by 4, not 8. 4 cycles of latency, 1 fma per cycle -> 4
simultaneous / OoO fmas.
Something like a sum (1 load per add) would perform better with the 8x
unrolling seen here (at least, from 100 or so elements until it becomes memory
bound).