[Bug target/87599] Broadcasting scalar to vector uses stack unnecessarily on x86

amonakov at gcc dot gnu.org Sat, 13 Oct 2018 03:37:22 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87599


--- Comment #5 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
I think we should use punpcklqdq here rather than movddup, because (at least on
Intel) it has same latency, and same-or-better throughput. It may be ok to use
movddup when broadcasting from a memory source, but for reg-to-reg broadcasting
we really should prefer punpcklqdq.

Why isn't IRA using the first alternative? If I tweak the testcase like this I
get the expected code, so why isn't it working properly without the asm?

typedef long T __attribute__((vector_size(16)));
T f(long v)
{
    asm("# %0" :: "x"(v));
    return (T){v, v};
}

gcc -O2 -mtune=intel -msse3

f:
        movq    %rdi, %xmm0
#APP
        # %xmm0
#NO_APP
        punpcklqdq      %xmm0, %xmm0
        ret

[Bug target/87599] Broadcasting scalar to vector uses stack unnecessarily on x86

Reply via email to