https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107451

--- Comment #8 from Richard Biener <rguenth at gcc dot gnu.org> ---
Peeling for gaps also isn't a good fix here.  One could envision a case with
even three iterations ahead load with

        for(i = 0; i < n; i++) {
                dot[0] += x[ix]   * y[ix]   ;
                dot[1] += x[ix] * y[ix] ;
                dot[2] += x[ix]   * y[ix] ;
                dot[3] += x[ix] * y[ix]   ;
                ix += inc_x ;
        }

or similar.  The root cause is how we generate code for VMAT_STRIDED_SLP
where we first generate loads to fill a contiguous output vector but only
then create the permute using the pieces that are actually necessary.

We could simply fail if 'nloads' is bigger than 'vf', or cap 'nloads' and
fail if we the cannot generate the permutation.

When we force VMAT_ELEMENTWISE the very same issue arises but later
optimization will eliminate the unnecessary loads, avoiding the problem:

  _62 = *ivtmp_64;
  _61 = MEM[(const double *)ivtmp_64 + 8B];
  ivtmp_60 = ivtmp_64 + _75;
  _59 = *ivtmp_60;
  _58 = MEM[(const double *)ivtmp_60 + 8B];
  ivtmp_57 = ivtmp_60 + _75;
  vect_cst__48 = {_62, _61, _59, _58};
  vect__4.12_47 = VEC_PERM_EXPR <vect_cst__48, vect_cst__48, { 1, 0, 1, 0 }>;

that just becomes

  _62 = MEM[(const double *)ivtmp_64];
  _61 = MEM[(const double *)ivtmp_64 + 8B];
  ivtmp_60 = ivtmp_64 + _75;
  vect__4.12_47 = {_61, _62, _61, _62};

with cost modeling and VMAT_ELEMENTWISE we fall back to SSE vectorization
which works fine.

I fear the proper fix is to integrate load emission with
vect_transform_slp_perm_load somehow, we shouldn't rely on followup
simplifications to fix what the vectorizer emits here.

Since we have no fallback detecting the situation and avoiding it completely
would mean to not vectorize the code (with AVX).

Reply via email to