https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84037
--- Comment #10 from Richard Biener <rguenth at gcc dot gnu.org> --- So strided stores are costed as /* Costs of the stores. */ if (memory_access_type == VMAT_ELEMENTWISE || memory_access_type == VMAT_GATHER_SCATTER) { /* N scalar stores plus extracting the elements. */ unsigned int assumed_nunits = vect_nunits_for_cost (vectype); inside_cost += record_stmt_cost (body_cost_vec, ncopies * assumed_nunits, scalar_store, stmt_info, 0, vect_body); } ... if (memory_access_type == VMAT_ELEMENTWISE || memory_access_type == VMAT_STRIDED_SLP) { /* N scalar stores plus extracting the elements. */ unsigned int assumed_nunits = vect_nunits_for_cost (vectype); inside_cost += record_stmt_cost (body_cost_vec, ncopies * assumed_nunits, vec_to_scalar, stmt_info, 0, vect_body); } there's the issue of "overloading" vec_to_scalar with extraction. It's costed as generic sse_op which IMHO is reasonable here (vextract*). The scalar cost is 12 for each of the following stmts _66 = *_150[_65]; d1.76_67 = d1; _160 = d1.76_67 * _73; _74 = _66 * _160; *_150[_65] = _74; the vector variant is adding the construction/extraction cost compared to the scalar variant and wins with the two multiplications being costed once instead of four times. We don't actually factor in the "win" by hoisting the vectorized load of 'd1' only in the vector case. With AVX2 things become even more "cheap" vectorized. And we of course peel the epilogue completely. Ideally we'd interchange this specific loop but interchange doesn't do anything here because we get niters that might be zero. Later dependences would probably wreck things but here this also is a missed optimization. We have two paths running into the loop loading ng1 and checking it against zero properly but the PHI result doesn't have this range info merged (well, VRP sets the info but it needs LIM / PRE to see the opportunity so it's only set by late VRP).