https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111166
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Depends on| |101926 --- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> --- Your benchmark confirms the vectorized variant is slower, on a 7900X it's both the memory roundtrip and the gpr->xmm move causing it. perf shows | turn_into_struct(): 1 | movd %edi,%xmm1 3 | movd %esi,%xmm4 4 | movd %edx,%xmm0 95 | movd %ecx,%xmm3 6 | punpckldq %xmm4,%xmm1 2 | punpckldq %xmm3,%xmm0 1 | movdqa %xmm1,%xmm2 | punpcklqdq %xmm0,%xmm2 5 | movaps %xmm2,-0x18(%rsp) 63 | mov -0x18(%rsp),%rdi 70 | mov -0x10(%rsp),%rsi 47 | jmp 400630 <do_smth_with_4_u32> note the situation is difficult to rectify - ideally the vectorizer would see that we require two 64bit register pieces but it doesn't - it sees we store into memory. I'll note the non-vectorized code is also far from optimal. clang produces the following which is faster by more of the delta that the vectorized version is slower compared to the scalar GCC variant. turn_into_struct: # @turn_into_struct .cfi_startproc # %bb.0: # kill: def $ecx killed $ecx def $rcx # kill: def $esi killed $esi def $rsi shlq $32, %rsi movl %edi, %edi orq %rsi, %rdi shlq $32, %rcx movl %edx, %esi orq %rcx, %rsi jmp do_smth_with_4_u32 # TAILCALL Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101926 [Bug 101926] [meta-bug] struct/complex/other argument passing and return should be improved