https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856

--- Comment #18 from Richard Biener <rguenth at gcc dot gnu.org> ---
There's another thing - we end up with

        vmovq   %rax, %xmm3
        vpinsrq $1, %rdx, %xmm3, %xmm0

but that has way worse latency than the alternative you'd get w/o SSE 4.1:

        vmovq   %rax, %xmm3
        vmovq   %rdx, %xmm7
        punpcklqdq  %xmm7, %xmm3

for example on Zen3 vmovq and vpisnrq have latencies of 3 while punpck
has a latency of only one.  So the second variant should have 2 cycles
less latency.

Testcase:

typedef long v2di __attribute__((vector_size(16)));

v2di foo (long a, long b)
{
  return (v2di){a, b};
}

Even on Skylake it's 2 (movq) + 3 (vpinsr), so there it's 6 vs. 3.  Not
sure if we should somehow do this late somehow (peephole or splitter) since
it requires one more %xmm register.

Reply via email to