https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101846
--- Comment #4 from Hongtao.liu <crazylht at gmail dot com> --- But when upper bits is not used, vpmovdw version seems better. v4hi bar_dw_128 (v8hi x) { return __builtin_shufflevector (x, x, 0, 2, 4, 6);// 4, 5, 6, 7); } - vpshufb .LC2(%rip), %xmm0, %xmm0 + vpmovdw %xmm0, %xmm0