https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97147

            Bug ID: 97147
           Summary: GCC uses vhaddpd which is bad for latency
           Product: gcc
           Version: 11.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: rguenth at gcc dot gnu.org
  Target Milestone: ---

typedef double v2df __attribute__((vector_size(16)));
double foo (v2df x, double y)
{
  return x[0] + x[1] + y;
}
double bar (v2df x, double y)
{
  return y + x[0] + x[1];
}

with -O2 -mavx2 -mtune=znver2 ends up generating

foo:
.LFB0:
        .cfi_startproc
        vhaddpd %xmm0, %xmm0, %xmm0
        vaddsd  %xmm1, %xmm0, %xmm0
        ret

bar:
.LFB1:
        .cfi_startproc
        vmovapd %xmm0, %xmm2
        vaddsd  %xmm1, %xmm0, %xmm0
        vunpckhpd       %xmm2, %xmm2, %xmm2
        vaddsd  %xmm2, %xmm0, %xmm0
        ret

where bar should be a _lot_ better according to Agner which says
that vhaddpd has a 4 uops, a latency of 7 cycles and a throughput of only
one per two cycles while both vunpckhpd and vaddsd fare a lot better here.
Coffee-lake isn't much better here.

Maybe we want to disable the V2DF instructions for most tunings somehow?

Reply via email to