On the fatigue benchmark of the polyhedron benchmark suite, gfortran spends ~8%
of the total runtime on the line

 40215  7.8856     :      generalized_constitutive_tensor(:,:) = 0.0_LONGreal

(The numbers are oprofile data)

generalized_constitutive tensor is a local 6x6 double precision array. Now,
thanks to a patch by Roger Sayle
(http://gcc.gnu.org/ml/gcc-patches/2006-12/msg01271.html) the double loop is
replaced by a call to __builtin_memset, which does reduce the timing
significantly compared to a double loop using packed SSE2 instructions
(compiled with -mfpmath=sse -ftree-vectorize -funroll-loops -ffast-math
-march=athlon64) which uses ~15% time on the same line:

 96576 15.1487  :      generalized_constitutive_tensor(:,:) = 0.0_LONGreal

However, a certain commercial Fortran compiler still spends less than half the
time that we do:

 17144  2.4210  :      generalized_constitutive_tensor(:,:) = 0.0_LONGreal

Looking at the asm profile, gcc expands the __builtin_memset to

                               :      generalized_constitutive_tensor(:,:) =
0.0_LONGreal
   600  0.1200     0       0   : 8048b09:       lea    0xfffffe68(%ebp),%edx
                               : 8048b0f:       mov    0x18(%ecx),%ecx
   710  0.1420     0       0   : 8048b12:       mov    %edx,0xfffffe34(%ebp)
                               : 8048b18:       mov    0xfffffe34(%ebp),%edi
                               : 8048b1e:       mov    %esi,0xfffffe54(%ebp)
   664  0.1328     0       0   : 8048b24:       mov    %eax,%esi
                               : 8048b26:       mov    %ecx,0xfffffe50(%ebp)
                               : 8048b2c:       sub    %ecx,%esi
   651  0.1302     0       0   : 8048b2e:       xor    %eax,%eax
                               : 8048b30:       mov    $0x48,%ecx
 38704  7.7411     0       0   : 8048b35:       rep stos %eax,%es:(%edi)

i.e. the classical "rep stos". What the commercial compiler does is:

                               :      generalized_constitutive_tensor(:,:) =
0.0_LONGreal
    71  0.0100     0       0   : 80518ca:       pxor   %xmm0,%xmm0
   608  0.0859     0       0   : 80518ce:       mov    0x14(%ebp),%esi
  2388  0.3372     0       0   : 80518d1:       movapd %xmm0,0x80bb780(%ebx)
  3856  0.5445     0       0   : 80518d9:       movapd %xmm0,0x80bb790(%ebx)
  5494  0.7758     0       0   : 80518e1:       add    $0x20,%ebx
  3237  0.4571     0       0   : 80518e4:       cmp    $0x120,%ebx
  2098  0.2963     0       0   : 80518ea:       jb     80518d1
<perdida_m_mp_generalized_hookes_law_.+0x13>


-- 
           Summary: Suboptimal builtin_memset on x86 with SSE
           Product: gcc
           Version: 4.3.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: middle-end
        AssignedTo: unassigned at gcc dot gnu dot org
        ReportedBy: jb at gcc dot gnu dot org
GCC target triplet: i686-pc-linux-gnu


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31750

Reply via email to