https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97147
Bug ID: 97147
Summary: GCC uses vhaddpd which is bad for latency
Product: gcc
Version: 11.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: rguenth at gcc dot gnu.org
Target Milestone: ---
typedef double v2df __attribute__((vector_size(16)));
double foo (v2df x, double y)
{
return x[0] + x[1] + y;
}
double bar (v2df x, double y)
{
return y + x[0] + x[1];
}
with -O2 -mavx2 -mtune=znver2 ends up generating
foo:
.LFB0:
.cfi_startproc
vhaddpd %xmm0, %xmm0, %xmm0
vaddsd %xmm1, %xmm0, %xmm0
ret
bar:
.LFB1:
.cfi_startproc
vmovapd %xmm0, %xmm2
vaddsd %xmm1, %xmm0, %xmm0
vunpckhpd %xmm2, %xmm2, %xmm2
vaddsd %xmm2, %xmm0, %xmm0
ret
where bar should be a _lot_ better according to Agner which says
that vhaddpd has a 4 uops, a latency of 7 cycles and a throughput of only
one per two cycles while both vunpckhpd and vaddsd fare a lot better here.
Coffee-lake isn't much better here.
Maybe we want to disable the V2DF instructions for most tunings somehow?