https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108141
--- Comment #3 from Jakub Jelinek <jakub at gcc dot gnu.org> --- Yeah. For the PR64110: typedef short V __attribute__((vector_size (32))); V foo (short x) { return (V) { x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x }; } we emit with -m64 -O2 -mavx2 vmovd %edi, %xmm0 vpbroadcastw %xmm0, %ymm0 which is I think right and for -m32 -O2 -mavx2 vpbroadcastw 8(%ebp), %ymm0 which again is optimal. Though, in the pr64110.c test we went with -O3 -march=core-avx2 -m32 from r219682: movzwl (%ecx), %esi movw %si, -28(%ebp) vpbroadcastw -28(%ebp), %ymm1 vmovdqa %ymm1, -88(%ebp) to r219683: movzwl (%ecx), %esi vmovd %esi, %xmm1 vpbroadcastw %xmm1, %ymm1 vmovdqa %ymm1, -88(%ebp) to r13-4726: movzwl 8(%ebp), %edi vmovd %edi, %xmm1 vpbroadcastw %xmm1, %xmm1 vmovdqa %xmm1, 32(%esp) to r13-4727: movzwl 8(%ebp), %eax movw %ax, 30(%esp) vpbroadcastw 30(%esp), %xmm2 vmovdqa %xmm2, (%esp) while vpbroadcastw 8(%ebp), %xmm2 vmovdqa %xmm2, (%esp) would be best. From this POV I think r13-4727 is actually a step backwards because previously we were at least loading it into GPR, moving to SSE and broadcasting there, while now we move into GPR, spill to memory and broadcast from memory. Before combine we have: (insn 2 8 3 2 (set (reg:SI 120 [ x ]) (mem/c:SI (reg/f:SI 16 argp) [2 x+0 S4 A32])) "pr64110.c":11:1 83 {*movsi_internal} (nil)) (insn 3 2 4 2 (set (reg/v:HI 119 [ x ]) (subreg:HI (reg:SI 120 [ x ]) 0)) "pr64110.c":11:1 84 {*movhi_internal} (expr_list:REG_DEAD (reg:SI 120 [ x ]) (nil))) ... and in another bb (insn 63 140 35 3 (set (reg:V8HI 140) (vec_duplicate:V8HI (reg/v:HI 119 [ x ]))) "pr64110.c":16:7 7985 {*vec_dupv8hi} (nil)) (insn 35 63 18 3 (set (reg:V16HI 141 [ vect_cst__52 ]) (vec_duplicate:V16HI (reg/v:HI 119 [ x ]))) 7984 {*vec_dupv16hi} (nil)) so I bet that is the reason why combine doesn't merge those into just the broadcast. As for the xmm vs. ymm, it is only loop-invariant that moves those 2 dups (insn 63 and 35) next to each other, and the question is what kind of optimization pass could figure out that insn 35 is a superset of insn 63 and change it into insn 35 + lowpart subreg to set pseudo 140 from low half of 141.