https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109072
--- Comment #7 from Richard Biener <rguenth at gcc dot gnu.org> --- (In reply to rsand...@gcc.gnu.org from comment #1) > (In reply to Tamar Christina from comment #0) > > The SLP costs went from: > > > > Vector cost: 2 > > Scalar cost: 4 > > > > to: > > > > Vector cost: 12 > > Scalar cost: 4 > > > > it looks like it's no longer costing it as a duplicate but instead 4 vec > > inserts. > We do cost it as a duplicate, but we only try to vectorize up to > the stores, rather than up to the load back. So we're costing > the difference between: > > fmov s1, s0 > stp s1, s1, [x0] > stp s1, s1, [x0, 8] > > (no idea why we have an fmov, pretend we don't) and: > > fmov s1, s0 > dup v1.4s, v1.s[0] > str q1, [x0] > > If we want the latter as a general principle, the PR is > easy to fix. But if we don't, we'd need to make the > vectoriser start at the load or (alternatively) fold > to a constructor independently of vectorisation. Just to clarify, the vectorizer sees <bb 2> [local count: 1073741824]: data[0] = res_2(D); data[1] = res_2(D); data[2] = res_2(D); data[3] = res_2(D); _7 = MEM <__Float32x4_t> [(float * {ref-all})&data]; data ={v} {CLOBBER(eol)}; return _7; and indeed the SLP vectorizer does not consider vector typed loads as "sinks" to start SLP discovery from. We could handle those the same as CONSTRUCTOR but then SLP discovery isn't prepared to follow "memory edges" (for must-aliases). The question here would be whether for example SRA could have elided 'data', materializing the vector load as CONSTRUCTOR (I also have an old VN patch that would do this, but it has profitability issues so I never pushed it). Whatever you do with cost heuristics you'll find a testcase where that regresses.