https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87599
--- Comment #5 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
I think we should use punpcklqdq here rather than movddup, because (at least on
Intel) it has same latency, and same-or-better throughput. It may be ok to use
movddup when broadcasting from a memory source, but for reg-to-reg broadcasting
we really should prefer punpcklqdq.
Why isn't IRA using the first alternative? If I tweak the testcase like this I
get the expected code, so why isn't it working properly without the asm?
typedef long T __attribute__((vector_size(16)));
T f(long v)
{
asm("# %0" :: "x"(v));
return (T){v, v};
}
gcc -O2 -mtune=intel -msse3
f:
movq %rdi, %xmm0
#APP
# %xmm0
#NO_APP
punpcklqdq %xmm0, %xmm0
ret