https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109072
rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|UNCONFIRMED |NEW Ever confirmed|0 |1 Last reconfirmed| |2023-03-09 CC| |rguenth at gcc dot gnu.org --- Comment #1 from rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> --- (In reply to Tamar Christina from comment #0) > The SLP costs went from: > > Vector cost: 2 > Scalar cost: 4 > > to: > > Vector cost: 12 > Scalar cost: 4 > > it looks like it's no longer costing it as a duplicate but instead 4 vec > inserts. We do cost it as a duplicate, but we only try to vectorize up to the stores, rather than up to the load back. So we're costing the difference between: fmov s1, s0 stp s1, s1, [x0] stp s1, s1, [x0, 8] (no idea why we have an fmov, pretend we don't) and: fmov s1, s0 dup v1.4s, v1.s[0] str q1, [x0] If we want the latter as a general principle, the PR is easy to fix. But if we don't, we'd need to make the vectoriser start at the load or (alternatively) fold to a constructor independently of vectorisation.