https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89582
Bug ID: 89582 Summary: Suboptimal code generated for floating point struct in -O3 compare to -O2 Product: gcc Version: 8.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: yyc1992 at gmail dot com Target Milestone: --- When testing the code for https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89581 on linux, I noticed that the code seems suboptimum when compiled under -O3 rather than -O2 on linux x64. ``` typedef struct { double x1; double x2; } vdouble __attribute__((aligned(16))); vdouble f(vdouble x, vdouble y) { return (vdouble){x.x1 + y.x1, x.x2 + y.x2}; } ``` Compiled with `-O2` produces ``` f: addsd %xmm3, %xmm1 addsd %xmm2, %xmm0 ret ``` With `-O3` or `-Ofast`, however, the code produced is, ``` f: movq %xmm0, -40(%rsp) movq %xmm1, -32(%rsp) movapd -40(%rsp), %xmm4 movq %xmm2, -24(%rsp) movq %xmm3, -16(%rsp) addpd -24(%rsp), %xmm4 movaps %xmm4, -40(%rsp) movsd -32(%rsp), %xmm1 movsd -40(%rsp), %xmm0 ret ``` It seems that gcc tries to use the vector instruction but had to use the stack for that. I did a quick benchmark which confirms that the -O3 version is much slower than the -O2 version. Clang produces ``` f: addsd %xmm2, %xmm0 addsd %xmm3, %xmm1 retq ``` As long as any optimizations are on, which seems appropriate.