[Bug target/87599] Broadcasting scalar to vector uses stack unnecessarily on x86

hjl.tools at gmail dot com Sat, 13 Oct 2018 19:09:01 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87599


--- Comment #6 from H.J. Lu <hjl.tools at gmail dot com> ---
(In reply to Alexander Monakov from comment #5)
> I think we should use punpcklqdq here rather than movddup, because (at least
> on Intel) it has same latency, and same-or-better throughput. It may be ok
> to use movddup when broadcasting from a memory source, but for reg-to-reg
> broadcasting we really should prefer punpcklqdq.
> 
> Why isn't IRA using the first alternative? If I tweak the testcase like this
> I get the expected code, so why isn't it working properly without the asm?
> 
> typedef long T __attribute__((vector_size(16)));
> T f(long v)
> {
>     asm("# %0" :: "x"(v));
>     return (T){v, v};
> }
> 
> gcc -O2 -mtune=intel -msse3
> 
> f:
>         movq    %rdi, %xmm0
> #APP
>         # %xmm0
> #NO_APP
>         punpcklqdq      %xmm0, %xmm0
>         ret

When SSE3 is enabled, memory source has lower
cost since the SSE3 alternative doesn't allow
register source.

[Bug target/87599] Broadcasting scalar to vector uses stack unnecessarily on x86

Reply via email to