https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111166

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Depends on|                            |101926

--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
Your benchmark confirms the vectorized variant is slower, on a 7900X it's
both the memory roundtrip and the gpr->xmm move causing it.  perf shows

       |    turn_into_struct():
     1 |      movd       %edi,%xmm1
     3 |      movd       %esi,%xmm4
     4 |      movd       %edx,%xmm0
    95 |      movd       %ecx,%xmm3
     6 |      punpckldq  %xmm4,%xmm1
     2 |      punpckldq  %xmm3,%xmm0
     1 |      movdqa     %xmm1,%xmm2
       |      punpcklqdq %xmm0,%xmm2
     5 |      movaps     %xmm2,-0x18(%rsp)
    63 |      mov        -0x18(%rsp),%rdi
    70 |      mov        -0x10(%rsp),%rsi
    47 |      jmp        400630 <do_smth_with_4_u32>

note the situation is difficult to rectify - ideally the vectorizer
would see that we require two 64bit register pieces but it doesn't - it sees
we store into memory.

I'll note the non-vectorized code is also far from optimal.  clang
produces the following which is faster by more of the delta that
the vectorized version is slower compared to the scalar GCC variant.

turn_into_struct:                       # @turn_into_struct
        .cfi_startproc
# %bb.0:
                                        # kill: def $ecx killed $ecx def $rcx
                                        # kill: def $esi killed $esi def $rsi
        shlq    $32, %rsi
        movl    %edi, %edi
        orq     %rsi, %rdi
        shlq    $32, %rcx
        movl    %edx, %esi
        orq     %rcx, %rsi
        jmp     do_smth_with_4_u32      # TAILCALL


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101926
[Bug 101926] [meta-bug] struct/complex/other argument passing and return should
be improved

Reply via email to