https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107451
--- Comment #8 from Richard Biener <rguenth at gcc dot gnu.org> --- Peeling for gaps also isn't a good fix here. One could envision a case with even three iterations ahead load with for(i = 0; i < n; i++) { dot[0] += x[ix] * y[ix] ; dot[1] += x[ix] * y[ix] ; dot[2] += x[ix] * y[ix] ; dot[3] += x[ix] * y[ix] ; ix += inc_x ; } or similar. The root cause is how we generate code for VMAT_STRIDED_SLP where we first generate loads to fill a contiguous output vector but only then create the permute using the pieces that are actually necessary. We could simply fail if 'nloads' is bigger than 'vf', or cap 'nloads' and fail if we the cannot generate the permutation. When we force VMAT_ELEMENTWISE the very same issue arises but later optimization will eliminate the unnecessary loads, avoiding the problem: _62 = *ivtmp_64; _61 = MEM[(const double *)ivtmp_64 + 8B]; ivtmp_60 = ivtmp_64 + _75; _59 = *ivtmp_60; _58 = MEM[(const double *)ivtmp_60 + 8B]; ivtmp_57 = ivtmp_60 + _75; vect_cst__48 = {_62, _61, _59, _58}; vect__4.12_47 = VEC_PERM_EXPR <vect_cst__48, vect_cst__48, { 1, 0, 1, 0 }>; that just becomes _62 = MEM[(const double *)ivtmp_64]; _61 = MEM[(const double *)ivtmp_64 + 8B]; ivtmp_60 = ivtmp_64 + _75; vect__4.12_47 = {_61, _62, _61, _62}; with cost modeling and VMAT_ELEMENTWISE we fall back to SSE vectorization which works fine. I fear the proper fix is to integrate load emission with vect_transform_slp_perm_load somehow, we shouldn't rely on followup simplifications to fix what the vectorizer emits here. Since we have no fallback detecting the situation and avoiding it completely would mean to not vectorize the code (with AVX).