When scanning an array of float values for minima and maxima, among other tasks, the compiler will correctly use "minss" instruction, however it does this by:
movaps %samplereg,%tempreg minss %minreg,%tempreg movaps %tempreg,%minreg This could be done simply as: minss %samplereg,%minreg without a need for a temporary register, and associated delay slot shadows. Code is roughly: { float minreg = ... float maxreg = ... float sumreg = 0.0; int sumcount = 0; loopconstruct { float samplereg = source[idx]; minreg = (minreg > samplereg) ? samplereg : minreg; maxreg = (maxreg < samplereg) ? samplereg : maxreg; sumreg += samplereg; ++sumcount; } // ... use results } Oddly, addition to sumreg is done without above mentioned register shuffles. Furthermore, math-library function fminf()/fmaxf() (and fmin()/fmax() for double) would benefit from map to intrinsic minss/maxss processing. Now they cause math library calls, where they are implemented as minss/maxss. Another optimization adventure would be to be able to unroll that loop, and use packed float values in xmm registers to do up to 4 operations in parallel. minreg/maxreg/sumreg could be described at C level as: float minreg[4]; and code would have an explicit loop from 0 to 3 processing sample sets. -- Summary: floating point optimizations needlessly shuffle xmm registers Product: gcc Version: 4.4.2 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: matti dot aarnio--gcc-bugs at zmailer dot org GCC build triplet: x86_64-redhat-linux GCC host triplet: x86_64-redhat-linux GCC target triplet: x86_64-redhat-linux http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42682