This is spinoff #1 of PR 17619: Take this simple piece of code: --------------------- float a[2],b[2]; float foobar () { return a[0] * b[0] + a[1] * b[1]; } --------------------- Compiled with -O3 -funroll-loops -msse3 -mtune=pentium4 -march=pentium4 -mfpmath=387 we get this code: --------------------- pushl %ebp movl %esp, %ebp flds b fmuls a flds b+4 fmuls a+4 faddp %st, %st(1) popl %ebp ret ----------------------------- That's certainly optimal. On the other hand, if we let the compiler use sse registers as well (though we do not force it, we simply want the most efficient code), the code we get with flags -O3 -funroll-loops -msse3 -mtune=pentium4 -march=pentium4 -mfpmath=387,sse looks like this: ----------------------------- pushl %ebp movl %esp, %ebp subl $4, %esp flds b fmuls a movss b+4, %xmm0 mulss a+4, %xmm0 movss %xmm0, -4(%ebp) flds -4(%ebp) faddp %st, %st(1) leave ret --------------------------- The code is almost equivalent except for the fact that we have one stack push and pop more to satisfy the system ABI that return values are passed through st(0). In essence, the compiler should just generate the first code sequence, even if given the flag -mfpmath=387,sse. W.
-- Summary: Inefficient code with -mfpmath=387,sse Product: gcc Version: 4.0.0 Status: UNCONFIRMED Severity: normal Priority: P2 Component: target AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: bangerth at dealii dot org CC: gcc-bugs at gcc dot gnu dot org http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18766