[Bug tree-optimization/110935] Missed BB reduction vectorization because of missed eliding of a permute

rguenther at suse dot de via Gcc-bugs Tue, 12 Sep 2023 00:43:49 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110935

--- Comment #3 from rguenther at suse dot de <rguenther at suse dot de> ---
On Tue, 5 Sep 2023, rsandifo at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110935
> 
> --- Comment #2 from rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> 
> ---
> If we were going to do this in vect_optimize_slp_pass, I think
> we'd need a node for the reduction in the pass's internal graph.
> We could then record that all input layouts have zero cost.
> 
> What's the reason for not having an SLP node for the reduction?
> Isn't it a similar kind of sink to a store or constructor?

The difference is that the reduction reduces the number of incoming
lanes (to one).  For a loop SLP reduction chain we also do not have a SLP
node for that part (because it's in the epilog).  For a loop SLP
reduction there isn't a reduction operation.  For both cases we manage
to elide permutes into them - I wondered how we do that in the new code
and if we can leverage that for the BB reduction case.

I did think of representing the reduction op but wondered how to do
that in the most sensible way.  It's kind-of a permute node with
an associated operation.  Or, if we use .REDUC_*_SCAL, a regular
node with a scalar vectype?  I'm not sure we want to overload
the VEC_PERM_EXPR SLP node further.  But for example with x86
we have a SAD operation with 4 incoming lanes in op0, 16 incoming
lanes in op1 and 4 outgoing lanes.

That said, currently the reduction node is implicit in the
instance root stmt and can be identified by the SLP instance kind only.

[Bug tree-optimization/110935] Missed BB reduction vectorization because of missed eliding of a permute

Reply via email to