https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91103
--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> --- (In reply to Richard Biener from comment #1) > So when the vectorizer has the need to use strided stores it would be > cheapest > to spill the vector and do N element loads and stores? I guess we can easily > get bottle-necked by the load/store op bandwith here? That is, the > vectorizer needs > > for (lane) > dest[stride * lane] = vector[lane]; > > thus store a specific (constant) lane of a vector to memory, for each > vector lane. (we could use a scatter store here but only AVX512 has that > and builing the index vector could be tricky and not supported for all > element types) Indeed ICC seems to spill for AVX and AVX512 for typedef int vsi __attribute__((vector_size(SIZE))); void foo (vsi v, int *p, int *o) { for (int i = 0; i < sizeof(vsi)/4; ++i) p[o[i]] = v[i]; }