https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108141
--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> --- (In reply to Jakub Jelinek from comment #3) [...] ... From this POV I think r13-4727 is actually a step backwards > because previously we were at least loading it into GPR, moving to SSE and > broadcasting there, > while now we move into GPR, spill to memory and broadcast from memory. > Before combine we have: > (insn 2 8 3 2 (set (reg:SI 120 [ x ]) > (mem/c:SI (reg/f:SI 16 argp) [2 x+0 S4 A32])) "pr64110.c":11:1 83 > {*movsi_internal} > (nil)) > (insn 3 2 4 2 (set (reg/v:HI 119 [ x ]) > (subreg:HI (reg:SI 120 [ x ]) 0)) "pr64110.c":11:1 84 > {*movhi_internal} > (expr_list:REG_DEAD (reg:SI 120 [ x ]) > (nil))) > ... > and in another bb > (insn 63 140 35 3 (set (reg:V8HI 140) > (vec_duplicate:V8HI (reg/v:HI 119 [ x ]))) "pr64110.c":16:7 7985 > {*vec_dupv8hi} > (nil)) > (insn 35 63 18 3 (set (reg:V16HI 141 [ vect_cst__52 ]) > (vec_duplicate:V16HI (reg/v:HI 119 [ x ]))) 7984 {*vec_dupv16hi} > (nil)) > so I bet that is the reason why combine doesn't merge those into just the > broadcast. Yep. And probably fwprop doesnt consider MEMs (or even two defs) at all. I suppose we don't want to combine insn 2 + 3 into a HImode MEM by itself? OTOH there's no fwprop after combine. > As for the xmm vs. ymm, it is only loop-invariant that moves those 2 dups > (insn 63 and 35) next to each other, and the question is what kind of > optimization pass could figure out that insn 35 is a superset of insn 63 and > change it into insn 35 + lowpart subreg to set pseudo 140 from low half of > 141. There's only a peephole or alternatively scheduling heuristic + CSE (we need the V16HI duplicate before the V8HI one) I can think of. CSE could also tentatively record "larger" computations and modify the earlier stmt if uses of that larger compute appears.