https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97147
Bug ID: 97147 Summary: GCC uses vhaddpd which is bad for latency Product: gcc Version: 11.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: rguenth at gcc dot gnu.org Target Milestone: --- typedef double v2df __attribute__((vector_size(16))); double foo (v2df x, double y) { return x[0] + x[1] + y; } double bar (v2df x, double y) { return y + x[0] + x[1]; } with -O2 -mavx2 -mtune=znver2 ends up generating foo: .LFB0: .cfi_startproc vhaddpd %xmm0, %xmm0, %xmm0 vaddsd %xmm1, %xmm0, %xmm0 ret bar: .LFB1: .cfi_startproc vmovapd %xmm0, %xmm2 vaddsd %xmm1, %xmm0, %xmm0 vunpckhpd %xmm2, %xmm2, %xmm2 vaddsd %xmm2, %xmm0, %xmm0 ret where bar should be a _lot_ better according to Agner which says that vhaddpd has a 4 uops, a latency of 7 cycles and a throughput of only one per two cycles while both vunpckhpd and vaddsd fare a lot better here. Coffee-lake isn't much better here. Maybe we want to disable the V2DF instructions for most tunings somehow?