When scanning an array of float values for minima and maxima, among other
tasks, the compiler will correctly use "minss" instruction, however it does
this by:

    movaps %samplereg,%tempreg
    minss  %minreg,%tempreg
    movaps %tempreg,%minreg

This could be done simply as:

    minss %samplereg,%minreg

without a need for a temporary register, and associated delay slot shadows.

Code is roughly:

{
   float minreg = ...
   float maxreg = ...
   float sumreg = 0.0;
   int   sumcount = 0;
   loopconstruct {
      float samplereg = source[idx];
      minreg = (minreg > samplereg) ? samplereg : minreg;
      maxreg = (maxreg < samplereg) ? samplereg : maxreg;
      sumreg += samplereg;
      ++sumcount;
   }
   // ... use results
}


Oddly,  addition to sumreg is done without above mentioned register shuffles.

Furthermore,  math-library function  fminf()/fmaxf() (and fmin()/fmax() for
double) would benefit from map to intrinsic   minss/maxss processing.  Now they
cause math library calls, where they are implemented as  minss/maxss.

Another optimization adventure would be to be able to unroll that loop, and use
packed float values in xmm registers to do up to 4 operations in parallel. 
minreg/maxreg/sumreg  could be described at C level as:
   float minreg[4];
and code would have an explicit loop from 0 to 3 processing sample sets.


-- 
           Summary: floating point optimizations needlessly shuffle xmm
                    registers
           Product: gcc
           Version: 4.4.2
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c
        AssignedTo: unassigned at gcc dot gnu dot org
        ReportedBy: matti dot aarnio--gcc-bugs at zmailer dot org
 GCC build triplet: x86_64-redhat-linux
  GCC host triplet: x86_64-redhat-linux
GCC target triplet: x86_64-redhat-linux


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42682

Reply via email to