https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121230
Bug ID: 121230 Summary: x86: Inefficient code generation with -m3dnow -msse since GCC 12 Product: gcc Version: 15.1.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: manx-bugzilla at problemloesungsmaschine dot de Target Milestone: --- Consider the following C code: ``` typedef struct { float a; float b; } f32_2; f32_2 add32_2(f32_2 x, f32_2 y) { return (f32_2){ x.a + y.a, x.b + y.b}; } ``` Godbolt link: https://godbolt.org/z/T6To8qbe1 GCC 15.1 -m32 -march=athlon-xp -std=c11 -O3 generates: ``` (top left) add32_2: sub esp, 12 fld DWORD PTR [esp+28] fadd DWORD PTR [esp+20] mov eax, DWORD PTR [esp+16] fstp DWORD PTR [esp+4] fld DWORD PTR [esp+32] fadd DWORD PTR [esp+24] movss xmm0, DWORD PTR [esp+4] fstp DWORD PTR [esp+4] movss xmm1, DWORD PTR [esp+4] unpcklps xmm0, xmm1 movlps QWORD PTR [eax], xmm0 add esp, 12 ret 4 ``` which unnecessarily channels the return value through an XMM register. This does not happen with GCC 15.1 -m32 -march=pentium3 -std=c11 -O3: ``` (top center) add32_2: fld DWORD PTR [esp+20] fadd DWORD PTR [esp+12] mov eax, DWORD PTR [esp+4] fld DWORD PTR [esp+16] fadd DWORD PTR [esp+8] fstp DWORD PTR [eax] fstp DWORD PTR [eax+4] ret 4 ``` or with GCC 11.4 -m32 -march=athlon-xp -std=c11 -O3: ``` (top right) add32_2: fld DWORD PTR [esp+20] fadd DWORD PTR [esp+12] mov eax, DWORD PTR [esp+4] fld DWORD PTR [esp+16] fadd DWORD PTR [esp+8] fstp DWORD PTR [eax] fstp DWORD PTR [eax+4] ret 4 ``` Note: Athlon-XP supports MMX, 3DNOW, SSE1, while Pentium3 supports MMX, SSE1, and apparently GCC choose -mfpmath=387 instead of -mfpmath=sse for both (which is probably fine, and not subject of this issue). Even if I force -mfpmath=sse, the code generation still looks a bit weird: GCC 15.1 -m32 -march=i686 -mmmx -m3dnow -msse -msse2 -mfpmath=sse -std=c11 -O3: ``` (bottom left) add32_2: movss xmm0, DWORD PTR [esp+8] movss xmm1, DWORD PTR [esp+12] addss xmm0, DWORD PTR [esp+16] addss xmm1, DWORD PTR [esp+20] mov eax, DWORD PTR [esp+4] unpcklps xmm0, xmm1 movlps QWORD PTR [eax], xmm0 ret 4 ``` compared to without -m3dnow GCC 15.1 -m32 -march=i686 -mmmx -msse -msse2 -mfpmath=sse -std=c11 -O3: ``` (bottom center) add32_2: movss xmm0, DWORD PTR [esp+12] movss xmm1, DWORD PTR [esp+8] mov eax, DWORD PTR [esp+4] addss xmm0, DWORD PTR [esp+20] addss xmm1, DWORD PTR [esp+16] movss DWORD PTR [eax+4], xmm0 movss DWORD PTR [eax], xmm1 ret 4 ``` Clang for comparison does default to generating SSE1 instructions, and does not even support -mfpmath=387 with -msse, or -m3dnow at all. Clang 20.1.0 -m32 -march=i686 -mmmx -msse -msse2 -std=c11 -O3: ``` (bottom right) add32_2: mov eax, dword ptr [esp + 4] movsd xmm0, qword ptr [esp + 8] movsd xmm1, qword ptr [esp + 16] addps xmm1, xmm0 movlps qword ptr [eax], xmm1 ret 4 ``` As far as I know, GCC does not generate 3DNOW instructions by itself, which makes it even more weird that -m3dnow appears to influence (and worsen) code generation of both x87 and SSE1 instructions. The problem appears to have first appeared with GCC 12.