On the fatigue benchmark of the polyhedron benchmark suite, gfortran spends ~8% of the total runtime on the line
40215 7.8856 : generalized_constitutive_tensor(:,:) = 0.0_LONGreal (The numbers are oprofile data) generalized_constitutive tensor is a local 6x6 double precision array. Now, thanks to a patch by Roger Sayle (http://gcc.gnu.org/ml/gcc-patches/2006-12/msg01271.html) the double loop is replaced by a call to __builtin_memset, which does reduce the timing significantly compared to a double loop using packed SSE2 instructions (compiled with -mfpmath=sse -ftree-vectorize -funroll-loops -ffast-math -march=athlon64) which uses ~15% time on the same line: 96576 15.1487 : generalized_constitutive_tensor(:,:) = 0.0_LONGreal However, a certain commercial Fortran compiler still spends less than half the time that we do: 17144 2.4210 : generalized_constitutive_tensor(:,:) = 0.0_LONGreal Looking at the asm profile, gcc expands the __builtin_memset to : generalized_constitutive_tensor(:,:) = 0.0_LONGreal 600 0.1200 0 0 : 8048b09: lea 0xfffffe68(%ebp),%edx : 8048b0f: mov 0x18(%ecx),%ecx 710 0.1420 0 0 : 8048b12: mov %edx,0xfffffe34(%ebp) : 8048b18: mov 0xfffffe34(%ebp),%edi : 8048b1e: mov %esi,0xfffffe54(%ebp) 664 0.1328 0 0 : 8048b24: mov %eax,%esi : 8048b26: mov %ecx,0xfffffe50(%ebp) : 8048b2c: sub %ecx,%esi 651 0.1302 0 0 : 8048b2e: xor %eax,%eax : 8048b30: mov $0x48,%ecx 38704 7.7411 0 0 : 8048b35: rep stos %eax,%es:(%edi) i.e. the classical "rep stos". What the commercial compiler does is: : generalized_constitutive_tensor(:,:) = 0.0_LONGreal 71 0.0100 0 0 : 80518ca: pxor %xmm0,%xmm0 608 0.0859 0 0 : 80518ce: mov 0x14(%ebp),%esi 2388 0.3372 0 0 : 80518d1: movapd %xmm0,0x80bb780(%ebx) 3856 0.5445 0 0 : 80518d9: movapd %xmm0,0x80bb790(%ebx) 5494 0.7758 0 0 : 80518e1: add $0x20,%ebx 3237 0.4571 0 0 : 80518e4: cmp $0x120,%ebx 2098 0.2963 0 0 : 80518ea: jb 80518d1 <perdida_m_mp_generalized_hookes_law_.+0x13> -- Summary: Suboptimal builtin_memset on x86 with SSE Product: gcc Version: 4.3.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: middle-end AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: jb at gcc dot gnu dot org GCC target triplet: i686-pc-linux-gnu http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31750