https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87599
--- Comment #6 from H.J. Lu <hjl.tools at gmail dot com> ---
(In reply to Alexander Monakov from comment #5)
> I think we should use punpcklqdq here rather than movddup, because (at least
> on Intel) it has same latency, and same-or-better throughput. It may be ok
> to use movddup when broadcasting from a memory source, but for reg-to-reg
> broadcasting we really should prefer punpcklqdq.
>
> Why isn't IRA using the first alternative? If I tweak the testcase like this
> I get the expected code, so why isn't it working properly without the asm?
>
> typedef long T __attribute__((vector_size(16)));
> T f(long v)
> {
> asm("# %0" :: "x"(v));
> return (T){v, v};
> }
>
> gcc -O2 -mtune=intel -msse3
>
> f:
> movq %rdi, %xmm0
> #APP
> # %xmm0
> #NO_APP
> punpcklqdq %xmm0, %xmm0
> ret
When SSE3 is enabled, memory source has lower
cost since the SSE3 alternative doesn't allow
register source.