https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65832
--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> --- Testcase simulating all qi vector cases the vectorizer may create char a[1024]; char b[1024]; void foobar (int s) { for (int i = 0; i < 16; ++i) { b[i] = a[s*i]; } } void foo (int s) { for (int i = 0; i < 8; ++i) { b[2*i] = a[s*i]; b[2*i + 1] = a[s*i + 1]; } } void bar (int s) { for (int i = 0; i < 4; ++i) { b[4*i] = a[s*i]; b[4*i + 1] = a[s*i + 1]; b[4*i + 2] = a[s*i + 2]; b[4*i + 3] = a[s*i + 3]; } } void baz(int s) { for (int i = 0; i < 2; ++i) { b[8*i] = a[s*i]; b[8*i + 1] = a[s*i + 1]; b[8*i + 2] = a[s*i + 2]; b[8*i + 3] = a[s*i + 3]; b[8*i + 4] = a[s*i + 4]; b[8*i + 5] = a[s*i + 5]; b[8*i + 6] = a[s*i + 6]; b[8*i + 7] = a[s*i + 7]; } } Compile with -fdisable-tree-cunrolli. foobar creates absymal code and baz needlessly goes through the stack. For plain -msse2 all code-gen isn't great but for foo which ends up using pinsrw. baz fails to use pinsrq and foobar fails to use pinsrq with -msse4.