https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104412
--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> --- On x86_64 we get movq %rsi, %xmm0 movq %rdi, %xmm1 punpcklqdq %xmm1, %xmm0 ret > ./cc1 -quiet t.c -I include -O2 -fopt-info-vec t.c:10:7: optimized: basic block part vectorized using 16 byte vectors and at -O1 movq %rsi, -24(%rsp) movq %rdi, -16(%rsp) movdqa -24(%rsp), %xmm0 ret costing is a bit difficult since we get t.c:10:7: note: Cost model analysis: i2_4(D) 1 times scalar_store costs 12 in body i1_6(D) 1 times scalar_store costs 12 in body i2_4(D) 1 times vector_store costs 12 in body <unknown> 1 times vec_construct costs 8 in prologue t.c:10:7: note: Cost model analysis for part in loop 0: Vector cost: 20 Scalar cost: 24 as we do not have an idea how costly the construction is (depends on calling conventions) or how the return d.v allows us to elide the store and the load.