https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109072

--- Comment #7 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to rsand...@gcc.gnu.org from comment #1)
> (In reply to Tamar Christina from comment #0)
> > The SLP costs went from:
> > 
> >   Vector cost: 2
> >   Scalar cost: 4
> > 
> > to:
> > 
> >   Vector cost: 12
> >   Scalar cost: 4
> > 
> > it looks like it's no longer costing it as a duplicate but instead 4 vec
> > inserts.
> We do cost it as a duplicate, but we only try to vectorize up to
> the stores, rather than up to the load back.  So we're costing
> the difference between:
> 
>         fmov    s1, s0
>         stp     s1, s1, [x0]
>         stp     s1, s1, [x0, 8]
> 
> (no idea why we have an fmov, pretend we don't) and:
> 
>         fmov    s1, s0
>         dup     v1.4s, v1.s[0]
>         str     q1, [x0]
> 
> If we want the latter as a general principle, the PR is
> easy to fix.  But if we don't, we'd need to make the
> vectoriser start at the load or (alternatively) fold
> to a constructor independently of vectorisation.

Just to clarify, the vectorizer sees

  <bb 2> [local count: 1073741824]:
  data[0] = res_2(D);
  data[1] = res_2(D);
  data[2] = res_2(D);
  data[3] = res_2(D);
  _7 = MEM <__Float32x4_t> [(float * {ref-all})&data];
  data ={v} {CLOBBER(eol)};
  return _7;

and indeed the SLP vectorizer does not consider vector typed loads as
"sinks" to start SLP discovery from.  We could handle those the same
as CONSTRUCTOR but then SLP discovery isn't prepared to follow
"memory edges" (for must-aliases).  The question here would be
whether for example SRA could have elided 'data', materializing
the vector load as CONSTRUCTOR (I also have an old VN patch that
would do this, but it has profitability issues so I never pushed it).

Whatever you do with cost heuristics you'll find a testcase where that
regresses.

Reply via email to