PINSRQ

rguenth at gcc dot gnu.org Mon, 26 Aug 2019 03:30:14 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91546


--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
Yes, I believe this is done on purpose.  With -Os we generate

test2:
.LFB5270:
        .cfi_startproc
        vmovd   %edx, %xmm2
        vmovd   %edi, %xmm3
        vpinsrd $1, %ecx, %xmm2, %xmm1
        vpinsrd $1, %esi, %xmm3, %xmm0
        movl    %edi, -16(%rsp)
        movl    %edx, -12(%rsp)
        vpunpcklqdq     %xmm1, %xmm0, %xmm0
        ret

eh...

For -Os the variant with three vpinsrd would be 2 bytes shorter.  I think
both Intel and AMD have two pipes capable of doing vpinsrd.  The code is
also latency bound at least on Zen where both movd and pinsrd have a latency
of 3 cycles, so it's 6 + unpck in the GCC variant compared to 12
in the clang variant.  The ISA is certainly lacking a bit here
(a insert-multiple from a contiguous GPR range, at least two inputs should
be doable easily with a destructive init).

[Bug target/91546] Better solution for VEC_INIT under TARGET_SSE4_1 since PINSRB/PINSRD/PINSRQ

Reply via email to