https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68655
--- Comment #7 from Jakub Jelinek <jakub at gcc dot gnu.org> --- I guess it needs analysis. Some examples of changes: vshuf-v16qi.c -msse2 test_2, scalar code vs. punpcklqdq, clear win vshuf-v16qi.c -msse4 test_2, pshufb -> punpcklqdq (is this a win or not?) (similarly for -mavx, -mavx2, -mavx512f, -mavx512bw) vshuf-v16si.c -mavx512{f,bw} test_2: - vpermi2d %zmm1, %zmm1, %zmm0 + vmovdqa64 .LC2(%rip), %zmm0 + vpermi2q %zmm1, %zmm1, %zmm0 looks like pessimization. vshuf-v32hi.c -mavx512bw test_2, similar pessimization. vshuf-v32hi.c -mavx512bw test_2, similarly: - vpermi2w %zmm1, %zmm1, %zmm0 + vmovdqa64 .LC2(%rip), %zmm0 + vpermi2q %zmm1, %zmm1, %zmm0 vshuf-v4si.c -msse2 test_183, another pessimization: - pshufd $78, %xmm0, %xmm1 + movdqa %xmm0, %xmm1 movd b(%rip), %xmm4 pshufd $255, %xmm0, %xmm2 + shufpd $1, %xmm0, %xmm1 vshuf-v4si.c -msse4 test_183, another pessimization: - pshufd $78, %xmm1, %xmm0 + movdqa %xmm1, %xmm0 + palignr $8, %xmm0, %xmm0 vshuf-v4si.c -mavx test_183: - vpshufd $78, %xmm1, %xmm0 + vpalignr $8, %xmm1, %xmm1, %xmm0 vshuf-v64qi.c -mavx512bw, desirable change: - vpermi2w %zmm1, %zmm1, %zmm0 - vpshufb .LC3(%rip), %zmm0, %zmm1 - vpshufb .LC4(%rip), %zmm0, %zmm0 - vporq %zmm0, %zmm1, %zmm0 + vpermi2q %zmm1, %zmm1, %zmm0 vshuf-v8hi.c -msse2 test_1 another scalar to punpcklqdq, win vshuf-v8hi.c -msse4 test_2 (supposedly a win): - pshufb .LC3(%rip), %xmm0 + punpcklqdq %xmm0, %xmm0 vshuf-v8hi.c -mavx test_2, similarly: - vpshufb .LC3(%rip), %xmm0, %xmm0 + vpunpcklqdq %xmm0, %xmm0, %xmm0 vshuf-v8si.c -mavx2 test_2, another win: - vmovdqa a(%rip), %ymm0 - vperm2i128 $0, %ymm0, %ymm0, %ymm0 + vpermq $68, a(%rip), %ymm0 vshuf-v8si.c -mavx2 test_5, another win: - vmovdqa .LC6(%rip), %ymm0 - vmovdqa .LC7(%rip), %ymm1 - vmovdqa %ymm0, -48(%rbp) vmovdqa a(%rip), %ymm0 - vpermd %ymm0, %ymm1, %ymm1 - vpshufb .LC8(%rip), %ymm0, %ymm3 - vpshufb .LC10(%rip), %ymm0, %ymm0 - vmovdqa %ymm1, c(%rip) - vmovdqa b(%rip), %ymm1 - vpermq $78, %ymm3, %ymm3 - vpshufb .LC9(%rip), %ymm1, %ymm2 - vpshufb .LC11(%rip), %ymm1, %ymm1 - vpor %ymm3, %ymm0, %ymm0 - vpermq $78, %ymm2, %ymm2 - vpor %ymm2, %ymm1, %ymm1 - vpor %ymm1, %ymm0, %ymm0 + vmovdqa .LC7(%rip), %ymm2 + vmovdqa .LC6(%rip), %ymm1 + vpermd %ymm0, %ymm2, %ymm2 + vpermd b(%rip), %ymm1, %ymm3 + vmovdqa %ymm1, -48(%rbp) + vmovdqa %ymm2, c(%rip) + vpermd %ymm0, %ymm1, %ymm0 + vmovdqa .LC8(%rip), %ymm2 + vpand %ymm2, %ymm1, %ymm1 + vpcmpeqd %ymm2, %ymm1, %ymm1 + vpblendvb %ymm1, %ymm3, %ymm0, %ymm0 vshuf-v8si.c -mavx512f test_2, another win? - vmovdqa a(%rip), %ymm0 - vperm2i128 $0, %ymm0, %ymm0, %ymm0 + vpermq $68, a(%rip), %ymm0 The above does not list all changes, I've been often ignoring further changes in the file if say one change adds or removes a .LC*, then everything else is renumbered (and doesn't sometimes list cases where the same or similar change appears with multiple ISAs). So the results are clearly mixed. Perhaps I should just try doing this at the end of expand_vec_perm_1 (i.e. if we (most likely) couldn't get a single insn normally, see if we would get it otherwise), and at the end of ix86_expand_vec_perm_const_1 (as the fallback after all sequences). It won't catch some beneficial one insn to one insn changes (e.g. where in the original case the insn needs a constant operand in memory) though.