https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91546

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
Yes, I believe this is done on purpose.  With -Os we generate

test2:
.LFB5270:
        .cfi_startproc
        vmovd   %edx, %xmm2
        vmovd   %edi, %xmm3
        vpinsrd $1, %ecx, %xmm2, %xmm1
        vpinsrd $1, %esi, %xmm3, %xmm0
        movl    %edi, -16(%rsp)
        movl    %edx, -12(%rsp)
        vpunpcklqdq     %xmm1, %xmm0, %xmm0
        ret

eh...

For -Os the variant with three vpinsrd would be 2 bytes shorter.  I think
both Intel and AMD have two pipes capable of doing vpinsrd.  The code is
also latency bound at least on Zen where both movd and pinsrd have a latency
of 3 cycles, so it's 6 + unpck in the GCC variant compared to 12
in the clang variant.  The ISA is certainly lacking a bit here
(a insert-multiple from a contiguous GPR range, at least two inputs should
be doable easily with a destructive init).

Reply via email to