https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68655
--- Comment #11 from rguenther at suse dot de <rguenther at suse dot de> --- On Thu, 3 Dec 2015, jakub at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68655 > > Jakub Jelinek <jakub at gcc dot gnu.org> changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > Attachment #36897|0 |1 > is obsolete| | > > --- Comment #9 from Jakub Jelinek <jakub at gcc dot gnu.org> --- > Created attachment 36898 > --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=36898&action=edit > gcc6-pr68655.patch > > Second attempt, this one tries it only for a single insn if we couldn't get a > single insn otherwise, and then as a final fallback if nothing else worked. > Running the same command, I see only beneficial changes this time: > vshuf-v16qi.c -msse2 test_2, scalar to punpcklqdq > vshuf-v64qi.c -mavx512bw test_2 > - vpermi2w %zmm1, %zmm1, %zmm0 > - vpshufb .LC3(%rip), %zmm0, %zmm1 > - vpshufb .LC4(%rip), %zmm0, %zmm0 > - vporq %zmm0, %zmm1, %zmm0 > + vpermi2q %zmm1, %zmm1, %zmm0 > vshuf-v8hi.c -msse2 test_2, scalar to punpcklqdq > and that is it. > Also on: > typedef unsigned short v8hi __attribute__((vector_size(16))); > typedef int v4si __attribute__((vector_size(16))); > typedef long long v2di __attribute__((vector_size(16))); > > v2di > f1 (v2di a, v2di b) > { > return __builtin_shuffle (a, b, (v2di) { 0, 2 }); > } > > v8hi > f2 (v8hi a, v8hi b) > { > return __builtin_shuffle (a, b, (v8hi) { 0, 1, 2, 3, 8, 9, 10, 11 }); > } > > v4si > f3 (v4si a, v4si b) > { > return __builtin_shuffle (a, b, (v4si) { 0, 1, 4, 5 }); > } > > v8hi > f4 (v8hi a, v8hi b) > { > return __builtin_shuffle (a, b, (v8hi) { 0, 1, 2, 3, 12, 13, 14, 15 }); > } > > with -O2 -msse2 (or -msse3) the diff in f2 is scalar code to punpcklqdq, > with -O2 -mssse3 the diff is f2: > - punpcklwd %xmm1, %xmm0 > - pshufb .LC0(%rip), %xmm0 > + punpcklqdq %xmm1, %xmm0 > f4: > - palignr $8, %xmm1, %xmm0 > - palignr $8, %xmm0, %xmm0 > + shufpd $2, %xmm1, %xmm0 > for -O2 -msse4 just the f2 change, for each of -O2 -mavx{,2,512f,512vl} f2: > - vpunpcklwd %xmm1, %xmm0, %xmm1 > - vpshufb .LC0(%rip), %xmm1, %xmm0 > + vpunpcklqdq %xmm1, %xmm0, %xmm0 LGTM