http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51497
--- Comment #1 from Dominique d'Humieres <dominiq at lps dot ens.fr> 2011-12-10 18:39:15 UTC --- The profiles are without -flto: + 34.6%, nf3dprecon.2105.constprop.1, a.out | 34.6%, nf2dprecon.2116, a.out 33.5%, spmmult.2139, a.out + 29.8%, nfcg_, a.out | + 7.6%, nf3dprecon.2105.constprop.1, a.out | | 0.4%, nf2dprecon.2116, a.out | 0.4%, nf2dprecon.2116, a.out 0.9%, mattest_, a.out and with -flto + 37.7%, nf3dprecon.2105.2457.constprop.1.2435, a.out | 37.7%, nf2dprecon.2116.2442.2436, a.out 32.7%, spmmult.2139.2426.2446, a.out + 27.6%, nfcg_, a.out | + 7.0%, nf3dprecon.2105.2457.constprop.1.2435, a.out | | 0.4%, nf2dprecon.2116.2442.2436, a.out | 0.4%, nf2dprecon.2116.2442.2436, a.out | 0.0%, free, libSystem.B.dylib 0.8%, mattest_, a.out So the slow routines are nf2dprecon, accounting for ~1.2s, and spmmult, accounting for ~0.5s. If I am reading the assembly correctly, in nf2dprecon, the implicit loop x(i:i+nx-1) = x(i:i+nx-1) - au2(i-nx:i-1)*x(i-nx:i-1) is unrolled eight times without -flto and four times with -flto. In spmmult, the implicit loop b = ad*x is unrolled four times and vectorized without -flto and eight times, but not vectorized, with -flto. Note that --param max-unroll-times=4 does not change the times.