https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91546
--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
Yes, I believe this is done on purpose. With -Os we generate
test2:
.LFB5270:
.cfi_startproc
vmovd %edx, %xmm2
vmovd %edi, %xmm3
vpinsrd $1, %ecx, %xmm2, %xmm1
vpinsrd $1, %esi, %xmm3, %xmm0
movl %edi, -16(%rsp)
movl %edx, -12(%rsp)
vpunpcklqdq %xmm1, %xmm0, %xmm0
ret
eh...
For -Os the variant with three vpinsrd would be 2 bytes shorter. I think
both Intel and AMD have two pipes capable of doing vpinsrd. The code is
also latency bound at least on Zen where both movd and pinsrd have a latency
of 3 cycles, so it's 6 + unpck in the GCC variant compared to 12
in the clang variant. The ISA is certainly lacking a bit here
(a insert-multiple from a contiguous GPR range, at least two inputs should
be doable easily with a destructive init).