https://gcc.gnu.org/bugzilla/show_bug.cgi?id=37150
--- Comment #21 from Richard Biener <rguenth at gcc dot gnu.org> --- Ok, so fixing the accounting to disregard obviously dead loads gets us to t.f90:158:0: note: Cost model analysis: Vector inside of basic block cost: 1224 Vector prologue cost: 0 Vector epilogue cost: 0 Scalar cost of basic block: 616 t.f90:158:0: note: not vectorized: vectorization is not profitable. that still doesn't account for the redundant ones... (we still emit those so we conservatively assume no CSE here). I suppose the "simple" way of costing permutation might be the real issue here though. Permutations like { 58, 58, 58, 58 } are also vectorized badly (and costed accordingly). Likewise { 4, 5, 4, 5 } is costed as permutation. Not counting non-permutations improves things to t.f90:158:0: note: Cost model analysis: Vector inside of basic block cost: 1080 Vector prologue cost: 0 Vector epilogue cost: 0 Scalar cost of basic block: 616 t.f90:158:0: note: not vectorized: vectorization is not profitable. So there is room for improvement but this was the "easy" parts (for the rest also more analysis is required). Likely there's some CSE inbetween the SLP instances involved.