https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87561
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |matz at gcc dot gnu.org --- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> --- OK, so re-running perf gives me a more reasonable result (-march=native on Haswell): Overhead Samples Command Shared Object Symbol 15.59% 754868 gamess_peak.amd gamess_peak.amd64-m64-gcc42-nn [.] forms_ 15.55% 749452 gamess_base.amd gamess_base.amd64-m64-gcc42-nn [.] forms_ 10.77% 496796 gamess_base.amd gamess_base.amd64-m64-gcc42-nn [.] twotff_ 7.58% 377894 gamess_base.amd gamess_base.amd64-m64-gcc42-nn [.] dirfck_ 7.57% 375587 gamess_peak.amd gamess_peak.amd64-m64-gcc42-nn [.] dirfck_ 7.01% 328685 gamess_peak.amd gamess_peak.amd64-m64-gcc42-nn [.] twotff_ 4.98% 243101 gamess_base.amd gamess_base.amd64-m64-gcc42-nn [.] xyzint_ 4.03% 197815 gamess_peak.amd gamess_peak.amd64-m64-gcc42-nn [.] xyzint_ with the already noticed loop where there's appearantly not enough iterations warranting the vectorization and the cost model check comes in the way. xyzint_ looks simiar. Note that DO 30 MK=1,NOC DO 30 ML=1,MK MKL = MKL+1 XPQKL(MPQ,MKL) = XPQKL(MPQ,MKL) + * VAL1*(CO(MS,MK)*CO(MR,ML)+CO(MS,ML)*CO(MR,MK)) XPQKL(MRS,MKL) = XPQKL(MRS,MKL) + * VAL3*(CO(MQ,MK)*CO(MP,ML)+CO(MQ,ML)*CO(MP,MK)) 30 CONTINUE shows the inner loop will first iterate once, then twice, then ... that makes hoisting the cost model check not possible and also it makes the alias check not invariant in the outer loop. That would mean if we'd code-generate the iteration cost-model then loop splitting might get the idea of splitting the outer loop ... (but loop splitting runs before vectorization of course). So in this very case if we analyze the scalar evolution of the niter of the loop we want to vectorize we get back {0, +, 1}_5 -- that's certainly something we could factor in when computing the vectorization cost. It would increase the prologue/epilogue cost but it wouldn't make vectorization never profitable (we know nothing about the upper bound of the number of iterations).