https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81303
--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> --- Looks like we peel for alignment which, for the loop is quite pointless at it only runs 5 times, so for AVX256 we're likely running into peel for alignment, no vector iteration, epilogue. Need to tame down that damn alignment peeling more ... It peels 'x' btw. block_solver.f:178:0: note: Cost model analysis: Vector inside of loop cost: 76 Vector prologue cost: 61 Vector epilogue cost: 62 Scalar iteration cost: 28 Scalar outside cost: 7 Vector outside cost: 123 prologue iterations: 2 epilogue iterations: 2 Calculated minimum iters for profitability: 5 block_solver.f:178:0: note: Runtime profitability threshold = 4 block_solver.f:178:0: note: Static estimate profitability threshold = 5 but that doesn't take into account that we eventually spend 3 scalar iterations in the alignment prologue and thus with niter < 7 we'll eventually never enter the vector loop. The static estimate is similarly affected by this.