https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95899

            Bug ID: 95899
           Summary: -funroll-loops does not duplicate accumulators when
                    calculating reductions, failing to break up dependency
                    chains
           Product: gcc
           Version: 10.1.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: middle-end
          Assignee: unassigned at gcc dot gnu.org
          Reporter: elrodc at gmail dot com
  Target Milestone: ---

Created attachment 48784
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48784&action=edit
cc -march=skylake-avx512 -mprefer-vector-width=512 -Ofast -funroll-loops -S
dot.c -o dot.s

Sample code:

```
double dot(double* a, double* b, long N){
  double s = 0.0;
  for (long n = 0; n < N; n++){
    s += a[n] * b[n];
  }
  return s;
}
```

Relevant part of the asm:
```
.L4:
        vmovupd (%rdi,%r11), %zmm8
        vmovupd 64(%rdi,%r11), %zmm9
        vfmadd231pd     (%rsi,%r11), %zmm8, %zmm0
        vmovupd 128(%rdi,%r11), %zmm10
        vmovupd 192(%rdi,%r11), %zmm11
        vmovupd 256(%rdi,%r11), %zmm12
        vmovupd 320(%rdi,%r11), %zmm13
        vfmadd231pd     64(%rsi,%r11), %zmm9, %zmm0
        vmovupd 384(%rdi,%r11), %zmm14
        vmovupd 448(%rdi,%r11), %zmm15
        vfmadd231pd     128(%rsi,%r11), %zmm10, %zmm0
        vfmadd231pd     192(%rsi,%r11), %zmm11, %zmm0
        vfmadd231pd     256(%rsi,%r11), %zmm12, %zmm0
        vfmadd231pd     320(%rsi,%r11), %zmm13, %zmm0
        vfmadd231pd     384(%rsi,%r11), %zmm14, %zmm0
        vfmadd231pd     448(%rsi,%r11), %zmm15, %zmm0
        addq    $512, %r11
        cmpq    %r8, %r11
        jne     .L4

```

Skylake-AVX512's vfmaddd should have a throughput of 2/cycle, but a latency of
4 cycles.

Because each unrolled instance accumulates into `%zmm0`, we are limited by the
dependency chain to 1 fma every 4 cycles.

It should use separate accumulators.

Additionally, if the loads are aligned, it would have a throughput of 2
loads/cycle. Because we need 2 loads per fma, that limits us to only 1 fma per
cycle. If the dependency chain were the primary motivation for unrolling, we'd
only want to unroll by 4, not 8. 4 cycles of latency, 1 fma per cycle -> 4
simultaneous / OoO fmas.

Something like a sum (1 load per add) would perform better with the 8x
unrolling seen here (at least, from 100 or so elements until it becomes memory
bound).

Reply via email to