https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87599
--- Comment #5 from Alexander Monakov <amonakov at gcc dot gnu.org> --- I think we should use punpcklqdq here rather than movddup, because (at least on Intel) it has same latency, and same-or-better throughput. It may be ok to use movddup when broadcasting from a memory source, but for reg-to-reg broadcasting we really should prefer punpcklqdq. Why isn't IRA using the first alternative? If I tweak the testcase like this I get the expected code, so why isn't it working properly without the asm? typedef long T __attribute__((vector_size(16))); T f(long v) { asm("# %0" :: "x"(v)); return (T){v, v}; } gcc -O2 -mtune=intel -msse3 f: movq %rdi, %xmm0 #APP # %xmm0 #NO_APP punpcklqdq %xmm0, %xmm0 ret