https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108141

--- Comment #3 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Yeah.
For the PR64110:
typedef short V __attribute__((vector_size (32)));
V
foo (short x)
{
  return (V) { x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x };
}
we emit with -m64 -O2 -mavx2
        vmovd   %edi, %xmm0
        vpbroadcastw    %xmm0, %ymm0
which is I think right and for -m32 -O2 -mavx2
        vpbroadcastw    8(%ebp), %ymm0
which again is optimal.
Though, in the pr64110.c test we went with -O3 -march=core-avx2 -m32
from r219682:
        movzwl  (%ecx), %esi
        movw    %si, -28(%ebp)
        vpbroadcastw    -28(%ebp), %ymm1
        vmovdqa %ymm1, -88(%ebp)
to r219683:
        movzwl  (%ecx), %esi
        vmovd   %esi, %xmm1
        vpbroadcastw    %xmm1, %ymm1
        vmovdqa %ymm1, -88(%ebp)
to r13-4726:
        movzwl  8(%ebp), %edi
        vmovd   %edi, %xmm1
        vpbroadcastw    %xmm1, %xmm1
        vmovdqa %xmm1, 32(%esp)
to r13-4727:
        movzwl  8(%ebp), %eax
        movw    %ax, 30(%esp)
        vpbroadcastw    30(%esp), %xmm2
        vmovdqa %xmm2, (%esp)
while
        vpbroadcastw    8(%ebp), %xmm2
        vmovdqa %xmm2, (%esp)
would be best.  From this POV I think r13-4727 is actually a step backwards
because previously we were at least loading it into GPR, moving to SSE and
broadcasting there,
while now we move into GPR, spill to memory and broadcast from memory.
Before combine we have:
(insn 2 8 3 2 (set (reg:SI 120 [ x ])
        (mem/c:SI (reg/f:SI 16 argp) [2 x+0 S4 A32])) "pr64110.c":11:1 83
{*movsi_internal}
     (nil))
(insn 3 2 4 2 (set (reg/v:HI 119 [ x ])
        (subreg:HI (reg:SI 120 [ x ]) 0)) "pr64110.c":11:1 84 {*movhi_internal}
     (expr_list:REG_DEAD (reg:SI 120 [ x ])
        (nil)))
...
and in another bb
(insn 63 140 35 3 (set (reg:V8HI 140)
        (vec_duplicate:V8HI (reg/v:HI 119 [ x ]))) "pr64110.c":16:7 7985
{*vec_dupv8hi}
     (nil))
(insn 35 63 18 3 (set (reg:V16HI 141 [ vect_cst__52 ])
        (vec_duplicate:V16HI (reg/v:HI 119 [ x ]))) 7984 {*vec_dupv16hi}
     (nil))
so I bet that is the reason why combine doesn't merge those into just the
broadcast.
As for the xmm vs. ymm, it is only loop-invariant that moves those 2 dups (insn
63 and 35) next to each other, and the question is what kind of optimization
pass could figure out that insn 35 is a superset of insn 63 and change it into
insn 35 + lowpart subreg to set pseudo 140 from low half of 141.

Reply via email to