https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91546
--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> --- Yes, I believe this is done on purpose. With -Os we generate test2: .LFB5270: .cfi_startproc vmovd %edx, %xmm2 vmovd %edi, %xmm3 vpinsrd $1, %ecx, %xmm2, %xmm1 vpinsrd $1, %esi, %xmm3, %xmm0 movl %edi, -16(%rsp) movl %edx, -12(%rsp) vpunpcklqdq %xmm1, %xmm0, %xmm0 ret eh... For -Os the variant with three vpinsrd would be 2 bytes shorter. I think both Intel and AMD have two pipes capable of doing vpinsrd. The code is also latency bound at least on Zen where both movd and pinsrd have a latency of 3 cycles, so it's 6 + unpck in the GCC variant compared to 12 in the clang variant. The ISA is certainly lacking a bit here (a insert-multiple from a contiguous GPR range, at least two inputs should be doable easily with a destructive init).