https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106081
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Depends on| |96208 --- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> --- PR96208 is the SLP of non-grouped loads. We now can convert short -> double and we get with the grouped load hacked and -march=znver3: .L2: vmovdqu (%rax), %ymm0 vpermq $27, -24(%rdi), %ymm10 addq $32, %rax subq $32, %rdi vpshufb %ymm7, %ymm0, %ymm0 vpermpd $85, %ymm10, %ymm9 vpermpd $170, %ymm10, %ymm8 vpermpd $255, %ymm10, %ymm6 vpmovsxwd %xmm0, %ymm1 vextracti128 $0x1, %ymm0, %xmm0 vbroadcastsd %xmm10, %ymm10 vcvtdq2pd %xmm1, %ymm11 vextracti128 $0x1, %ymm1, %xmm1 vpmovsxwd %xmm0, %ymm0 vcvtdq2pd %xmm1, %ymm1 vfmadd231pd %ymm10, %ymm11, %ymm5 vfmadd231pd %ymm9, %ymm1, %ymm2 vcvtdq2pd %xmm0, %ymm1 vextracti128 $0x1, %ymm0, %xmm0 vcvtdq2pd %xmm0, %ymm0 vfmadd231pd %ymm8, %ymm1, %ymm4 vfmadd231pd %ymm6, %ymm0, %ymm3 cmpq %rax, %rdx jne .L2 that is, the 'short' data type forces a higher VF to us and the splat codegen I hacked in is sub-optimal still. Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96208 [Bug 96208] non-grouped load can be SLP vectorized for 2-element vectors case