https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110935
--- Comment #3 from rguenther at suse dot de <rguenther at suse dot de> --- On Tue, 5 Sep 2023, rsandifo at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110935 > > --- Comment #2 from rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> > --- > If we were going to do this in vect_optimize_slp_pass, I think > we'd need a node for the reduction in the pass's internal graph. > We could then record that all input layouts have zero cost. > > What's the reason for not having an SLP node for the reduction? > Isn't it a similar kind of sink to a store or constructor? The difference is that the reduction reduces the number of incoming lanes (to one). For a loop SLP reduction chain we also do not have a SLP node for that part (because it's in the epilog). For a loop SLP reduction there isn't a reduction operation. For both cases we manage to elide permutes into them - I wondered how we do that in the new code and if we can leverage that for the BB reduction case. I did think of representing the reduction op but wondered how to do that in the most sensible way. It's kind-of a permute node with an associated operation. Or, if we use .REDUC_*_SCAL, a regular node with a scalar vectype? I'm not sure we want to overload the VEC_PERM_EXPR SLP node further. But for example with x86 we have a SAD operation with 4 incoming lanes in op0, 16 incoming lanes in op1 and 4 outgoing lanes. That said, currently the reduction node is implicit in the instance root stmt and can be identified by the SLP instance kind only.