https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112384
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Target|aarch64 |aarch64, x86_64-*-* Ever confirmed|0 |1 Status|UNCONFIRMED |NEW Last reconfirmed| |2023-11-06 --- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> --- Confirmed. Note for f2 the target needs to support .VEC_EXTRACT with variable index. OTOH we miss to transform i_4 = VIEW_CONVERT_EXPR<int[4]>(t)[i_2]; tt_5 = {i_4, i_4, i_4, i_4}; into tt_3 = {i_2, i_2, i_2, i_2}; r_6 = VEC_PERM_EXPR <t_4(D), t_4(D), tt_3>; but the complication is that 't' isn't in SSA form (which is also why it goes through memory here). On x86_64 with SSE4.1 we get f1: .LFB0: .cfi_startproc andl $3, %edi movd %edi, %xmm2 pshufd $0, %xmm2, %xmm1 pslld $2, %xmm1 pshufb .LC1(%rip), %xmm1 paddb .LC2(%rip), %xmm1 pshufb %xmm1, %xmm0 ret f2: .LFB1: .cfi_startproc andl $3, %edi movaps %xmm0, -24(%rsp) movd -24(%rsp,%rdi,4), %xmm1 pshufd $0, %xmm1, %xmm0 ret I suspect the memory case is actually faster. With AVX512VL this improves to f1: .LFB0: .cfi_startproc andl $3, %edi vmovdqa %xmm0, %xmm1 vpbroadcastd %edi, %xmm0 vpermi2d %xmm1, %xmm1, %xmm0 ret f2: .LFB1: .cfi_startproc andl $3, %edi vmovdqa %xmm0, -24(%rsp) vpbroadcastd -24(%rsp,%rdi,4), %xmm0 ret AVX2 has the odd f1: .LFB0: .cfi_startproc andl $3, %edi vinserti128 $1, %xmm0, %ymm0, %ymm0 vmovd %edi, %xmm2 vpbroadcastd %xmm2, %xmm1 vinserti128 $1, %xmm1, %ymm1, %ymm1 vpermd %ymm0, %ymm1, %ymm0 vzeroupper ret where sth feels wrong - f2 is similar to AVX512. It's not clear whether the f1 IL is better in the end.