https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69274
--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> --- Samples: 2M of event 'cycles', Event count (approx.): 1928893785632 36.40% gromacs_base.am gromacs_base.amd64-m64-gcc42-nn [.] inl1130_ 28.60% gromacs_peak.am gromacs_peak.amd64-m64-gcc42-nn [.] inl1130_ 7.51% gromacs_base.am gromacs_base.amd64-m64-gcc42-nn [.] search_neighbour 7.38% gromacs_peak.am gromacs_peak.amd64-m64-gcc42-nn [.] search_neighbour 2.00% gromacs_base.am gromacs_base.amd64-m64-gcc42-nn [.] inl1100_ 2.00% gromacs_peak.am gromacs_peak.amd64-m64-gcc42-nn [.] inl1100_ so that's innerf.f Ok, so I spot one non-scheduling/RA difference: - vmovss 52(%rsp), %xmm6 - vsubss -4(%r13,%rdi,4), %xmm6, %xmm6 -.LVL250: + vsubss -4(%r13,%rdi,4), %xmm6, %xmm4 +.LVL253: leaq (%r15,%rsi,4), %r12 .loc 1 662 0 - vmulss %xmm4, %xmm4, %xmm2 - vmovss %xmm4, 24(%rsp) - vmovss %xmm5, 20(%rsp) - vmovss %xmm6, 16(%rsp) - vfmadd231ss %xmm5, %xmm5, %xmm2 - vfmadd231ss %xmm6, %xmm6, %xmm2 -.LVL251: + vmulss %xmm2, %xmm2, %xmm1 + vmovss %xmm2, 24(%rsp) + vmovss %xmm3, 20(%rsp) + vmovss %xmm4, 16(%rsp) + vfmadd231ss %xmm3, %xmm3, %xmm1 + vmovaps %xmm1, %xmm7 + vfmadd231ss %xmm4, %xmm4, %xmm7 +.LVL254: thus there seems to be some more spilling.