https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104412

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
On x86_64 we get

        movq    %rsi, %xmm0
        movq    %rdi, %xmm1
        punpcklqdq      %xmm1, %xmm0
        ret

> ./cc1 -quiet t.c -I include -O2 -fopt-info-vec 
t.c:10:7: optimized: basic block part vectorized using 16 byte vectors

and at -O1

        movq    %rsi, -24(%rsp)
        movq    %rdi, -16(%rsp)
        movdqa  -24(%rsp), %xmm0
        ret

costing is a bit difficult since we get

t.c:10:7: note: Cost model analysis:
i2_4(D) 1 times scalar_store costs 12 in body
i1_6(D) 1 times scalar_store costs 12 in body
i2_4(D) 1 times vector_store costs 12 in body
<unknown> 1 times vec_construct costs 8 in prologue
t.c:10:7: note: Cost model analysis for part in loop 0:
  Vector cost: 20
  Scalar cost: 24

as we do not have an idea how costly the construction is (depends on
calling conventions) or how the return d.v allows us to elide the store
and the load.

Reply via email to