https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91103

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #1)
> So when the vectorizer has the need to use strided stores it would be
> cheapest
> to spill the vector and do N element loads and stores?  I guess we can easily
> get bottle-necked by the load/store op bandwith here?  That is, the
> vectorizer needs
> 
>   for (lane)
>     dest[stride * lane] = vector[lane];
> 
> thus store a specific (constant) lane of a vector to memory, for each
> vector lane.  (we could use a scatter store here but only AVX512 has that
> and builing the index vector could be tricky and not supported for all
> element types)

Indeed ICC seems to spill for AVX and AVX512 for

typedef int vsi __attribute__((vector_size(SIZE)));
void foo (vsi v, int *p, int *o)
{
  for (int i = 0; i < sizeof(vsi)/4; ++i)
    p[o[i]] = v[i];
}

Reply via email to