On Thu, 23 Oct 2025, Richard Biener wrote:

> On Wed, 22 Oct 2025, Richard Sandiford wrote:
> 
> > Richard Biener <[email protected]> writes:
> > > The following tries to addess one shortcoming of the SLP permute
> > > optimization phase which is that it tries to refuse layouts that
> > > cannot revert to layout 0 on an edge even if layout 0 isn't actually
> > > valid.  This hurts in particular on VLA vector targets where many
> > > permutes cannot be code generated.  While for fixed-length targets
> > > costs eventually sort things out in a way to chose a "no permutation"
> > > setting the above check prevents VLA targets from arriving there.
> > 
> > I'm not sure this is the right way to go.  The subpass is supposed
> > to be an optimisation pass rather than a legitimisation/"make something
> > vectorisable" pass.  The cost function is purely aimed at reducing
> > permutations rather than at turning unvectorisable graphs into
> > vectorisation ones.  If the pass happens to make something vectorisable,
> > that happens by chance rather than by design.
> 
> Note while the change turns something not vectorizable into something
> vectorizable it does so by enabling permute optimization to see that
> a load permutation can be fully elided.  This currently does not work
> because of the imposed "the reverse permute has to be code generatable"
> check(s).
> 
> The existing handling of [in_layout_i == 0 &&] out_layout_i == 0 for
> a not code generatable permute made me think that the fact that
> layout 0 isn't code generatable shouldn't prevent us from arriving
> at the optimal (just from a cost perspective) layout configuration.
> 
> To this extent, do you agree with the proposed change to
> internal_node_cost to anticipate redundant permute eliding?
> One could argue that vect_transform_slp_perm_load_1 should
> not fail for those - in this case it fails for
> group_size == dr_group_size == 16 but nunits [4,4], so
> a !repeating_p permute.
> 
> > I can see that a similar type of pass might be useful for pushing
> > permutations around until the graph is vectorisable.  But I think
> > that should be a separate subpass that runs first, with a different goal
> > function.  Perhaps it would share enough code with the optimisation pass
> > that they could both be subclasses of a common base class (not sure).
> 
> I'll note that at least for permutes on loads we're in the difficult
> situation that permute lowering only handles 
> interleaving patterns right now and other permutes are 
> not lowered but left to vectorizable_load.  Trying to legitimize
> those earlier (anticipating what vectorizable_load would do) would
> be nice but also necessarily dependent on VF (permute lowering also
> looks at VF and at a point where it's not yet final ...)
> 
> But as you say, legitimizing should be separate from optimizing but
> the issue at hand is that optimizing on a not legitimized graph
> ties its hands because of the reverse permute requirement ...
> 
> > > The patch fixes up internal_node_cost to anticipate redundant permute
> > > eliding and short-cuts the above requirement on edges to partitions
> > > that do not have a prefered layout with the idea we shouldn't have
> > > the need to revert to layout zero for those.
> > >
> > > The downside is that we now can arrive at invalid layout configurations
> > > which we deal with simply giving up.
> > 
> > Right.  I think that for other testcases that are currently successfully
> > optimised, and that are vectorisable with and without the optimisation,
> > we might lose optimisations by doing this, because the pass could get
> > stuck in a dead end.  The risk is probably higher for VLA testcases,
> > due to the same property that motivates this patch.
> > 
> > The change therefore feels like a bit of a hack to me.
> 
> It probably is, esp. singling out that -1 layout.  The question is
> what the alternative is?
> 
> What's the purpose of rejecting the layout at
> 
>                     /* Reject the layout if it would make layout 0 
> impossible
>                        for later partitions.  This amounts to testing that 
> the
>                        target supports reversing the layout change on 
> edges
>                        to later partitions.
>           
>                        In principle, it might be possible to push a layout
>                        change all the way down a graph, so that it never
>                        needs to be reversed and so that the target doesn't
>                        need to support the reverse operation.  But it 
> would
>                        be awkward to bail out if we hit a partition that
>                        does not support the new layout, especially since
>                        we are not dealing with a lattice.  */
>                     is_possible &= edge_layout_cost (ud, other_node_i, 0,
>                                                      layout_i).is_possible 
> ();
> 
> ?  It seems we want to enable the backward pass to undo everything the
> forward pass did, because that might have gone "too far"?  But we
> do not check that we can "undo", aka go back to the initial chosen
> layout (which is not always 0 but matches the reverse load permutation?)
> My alternative idea was to not require the ability to go back if
> layout 0 is "impossible" in terms of internal_node_cost?
> 
> > I've seen the emails about the PR go past but haven't had chance to look
> > at them properly.  Wanted to reply to this first in case I missed the boat.
> 
> For the case in question I looked at WRT hitting the assert in the
> backward pass it is that the backward pass computes the layout chosen
> by the forward pass is invalid (the forward pass pushed the invalid
> load permute down a bit, eventually hitting a fixed-layout store).
> But the minimum partition the backward pass would have chosen actually
> matched that of the forward pass (zero).  So I suppose a first obvious
> thing would be to relax the assert to
> 
>       gcc_assert (min_layout_i == 0 || min_layout_cost.is_possible ());
> 
> or alternatively or in addition to min_layout_i == partition.layout.
> The combined
> 
>       gcc_assert ((min_layout_i == 0 && partition.layout == min_layout_i)
>                   || min_layout_cost.is_possible ());
> 
> still runs into ICEs.  On x86-64 for min_layout_i == partition.layout == 0
> we have gcc.dg/vect/pr51000.c.  For gcc.dg/vect/vect-pr122370.c we
> have a case where the forward pass assigns layout 1, I suspect with
> the way the code is written a
> 
>       gcc_assert (min_layout_i == 0
>                   || min_layout_cost.is_possible ());
> 
> assert would always work out since we start trying with layout == 0
> (instead of the layout chosen by the forward pass for example).
> 
> If we were not special-casing layout 0 we'd run into the very same
> validation issue of course and this should get us to chose layout
> 0 everywhere as fallback, similar to what I do with bailing out.
> Note the patch doesn't keep the layout changes but instead uses
> layout 0 everywhere in such case which _should_ be what we'd arrive
> at anyway without the proposed change.

So trying with just adjusting costs, anticipating permute eliding
for the backward pass via

diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
index f2ffd1f54e8..a5dbabccac3 100644
--- a/gcc/tree-vect-slp.cc
+++ b/gcc/tree-vect-slp.cc
@@ -6686,6 +6686,16 @@ vect_optimize_slp_pass::change_layout_cost 
(slp_tree node,
   if (from_layout_i == to_layout_i)
     return 0;
 
+  if (from_layout_i == 0
+      && !SLP_TREE_LOAD_PERMUTATION (node).is_empty ())
+    {
+      auto_load_permutation_t tem (SLP_TREE_LANES (node));
+      tem.safe_splice (SLP_TREE_LOAD_PERMUTATION (node));
+      vect_slp_permute (m_perms[to_layout_i], tem, true);
+      if (is_redundant_permutation (node, tem))
+       return 0;
+    }
+
   auto_vec<slp_tree, 1> children (1);
   children.quick_push (node);
   auto_lane_permutation_t perm (SLP_TREE_LANES (node));

doesn't fix this either.  The forward pass arrives at using layout
zero everywhere but while the backward pass computes both layout
zero and layout one as valid for partition 1 they have the same
cost and thus we stay at zero, making partition 0 use layout 0
since 1 is computed not possible then.  We are also re-using
layout costs from the forward pass but will accumulate onto
that costs based on a current layout?

I think for the case in question (permuted load feeding
associatable SLP reduction chain) the forward pass is the one
that should pick layout 1 for the load since the backward pass,
for the reduction SCC, has arbitrary choice with no cost
difference (it's costs for choosing layout 0 do take into account
the "impossible" cost for partition 0 - that would be factored
in only when processing partition 0).

Richard.

> Richard.
> 
> 
> > Thanks,
> > Richard
> > 
> > > A testcase for this is gcc.dg/vect/vect-reduc-chain-4.c.
> > >
> > > Bootstrapped and tested on x86_64-unknown-linux-gnu.  I verified that
> > > with RVV we now can use VLA vectors for the testcase.  The patch
> > > will run through the riscv CI, some aarch64 testing is appreciated.
> > >
> > > Comments on the patch as well, I did try to understand the whole
> > > operation of SLP permute optimization but likely failed to capture
> > > some essential part ;)
> > >
> > > Thanks,
> > > Richard.
> > >
> > >   PR tree-optimization/120687
> > >   * tree-vect-slp.cc (vect_optimize_slp_pass::is_redundant_permutation):
> > >   New function, split out from ...
> > >   (vect_optimize_slp_pass::remove_redundant_permutations): ... here.
> > >   Use it.
> > >   (vect_optimize_slp_pass::backward_pass): Return whether
> > >   all layouts are possible.
> > >   (vect_optimize_slp_pass::internal_node_cost): When the effective
> > >   load permutation is redundant assume zero cost.
> > >   (vect_optimize_slp_pass::forward_pass): For a forward edge
> > >   to a partition without a prefered layout do not require layout
> > >   zero to be valid.
> > >   (vect_optimize_slp_pass::run): When not all layouts are possible
> > >   after the backward_pass, do nothing.
> > > ---
> > >  gcc/tree-vect-slp.cc | 131 +++++++++++++++++++++++++------------------
> > >  1 file changed, 75 insertions(+), 56 deletions(-)
> > >
> > > diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
> > > index 9698709f567..4f276771d57 100644
> > > --- a/gcc/tree-vect-slp.cc
> > > +++ b/gcc/tree-vect-slp.cc
> > > @@ -6267,13 +6267,14 @@ private:
> > >    slpg_layout_cost forward_cost (graph_edge *, unsigned int, unsigned 
> > > int);
> > >    slpg_layout_cost backward_cost (graph_edge *, unsigned int, unsigned 
> > > int);
> > >    void forward_pass ();
> > > -  void backward_pass ();
> > > +  bool backward_pass ();
> > >  
> > >    /* Rematerialization.  */
> > >    slp_tree get_result_with_layout (slp_tree, unsigned int);
> > >    void materialize ();
> > >  
> > >    /* Clean-up.  */
> > > +  bool is_redundant_permutation (slp_tree, const load_permutation_t&);
> > >    void remove_redundant_permutations ();
> > >  
> > >    /* Masked load lanes discovery.  */
> > > @@ -6748,6 +6749,46 @@ change_vec_perm_layout (slp_tree node, 
> > > lane_permutation_t &perm,
> > >      vect_slp_permute (m_perms[out_layout_i], perm, true);
> > >  }
> > >  
> > > +/* Return whether PERM is a redundant load permutation for NODE.  */
> > > +
> > > +bool
> > > +vect_optimize_slp_pass::is_redundant_permutation (slp_tree node,
> > > +                                           const load_permutation_t&
> > > +                                                                 perm)
> > > +{
> > > +  if (!is_a <loop_vec_info> (m_vinfo))
> > > +    return false;
> > > +  loop_vec_info loop_vinfo = as_a<loop_vec_info> (m_vinfo);
> > > +  bool this_load_permuted = false;
> > > +  for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j)
> > > +    if (perm[j] != j)
> > > +      {
> > > + this_load_permuted = true;
> > > + break;
> > > +      }
> > > +  /* When this isn't a grouped access we know it's single element
> > > +     and contiguous.  */
> > > +  if (!STMT_VINFO_GROUPED_ACCESS (SLP_TREE_SCALAR_STMTS (node)[0]))
> > > +    {
> > > +      if (!this_load_permuted
> > > +   && (known_eq (LOOP_VINFO_VECT_FACTOR (loop_vinfo), 1U)
> > > +       || SLP_TREE_LANES (node) == 1))
> > > + return true;
> > > +      return false;
> > > +    }
> > > +  stmt_vec_info first_stmt_info
> > > +    = DR_GROUP_FIRST_ELEMENT (SLP_TREE_SCALAR_STMTS (node)[0]);
> > > +  if (!this_load_permuted
> > > +      /* The load requires permutation when unrolling exposes
> > > +  a gap either because the group is larger than the SLP
> > > +  group-size or because there is a gap between the groups.  */
> > > +      && (known_eq (LOOP_VINFO_VECT_FACTOR (loop_vinfo), 1U)
> > > +   || ((SLP_TREE_LANES (node) == DR_GROUP_SIZE (first_stmt_info))
> > > +       && DR_GROUP_GAP (first_stmt_info) == 0)))
> > > +    return true;
> > > +  return false;
> > > +}
> > > +
> > >  /* Check whether the target allows NODE to be rearranged so that the 
> > > node's
> > >     output has layout OUT_LAYOUT_I.  Return the cost of the change if so,
> > >     in the same arbitrary units as for change_layout_cost.  Return -1 
> > > otherwise.
> > > @@ -6831,9 +6872,11 @@ vect_optimize_slp_pass::internal_node_cost 
> > > (slp_tree node, int in_layout_i,
> > >        poly_uint64 vf = 1;
> > >        if (auto loop_vinfo = dyn_cast<loop_vec_info> (m_vinfo))
> > >   vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
> > > -      unsigned int n_perms;
> > > -      if (!vect_transform_slp_perm_load_1 (m_vinfo, node, tmp_perm, 
> > > vNULL,
> > > -                                    nullptr, vf, true, false, &n_perms))
> > > +      unsigned int n_perms = 0;
> > > +      if (!is_redundant_permutation (node, tmp_perm)
> > > +   && !vect_transform_slp_perm_load_1 (m_vinfo, node, tmp_perm, vNULL,
> > > +                                       nullptr, vf, true, false,
> > > +                                       &n_perms))
> > >   {
> > >     auto rep = SLP_TREE_REPRESENTATIVE (node);
> > >     if (out_layout_i == 0)
> > > @@ -7338,19 +7381,22 @@ vect_optimize_slp_pass::forward_pass ()
> > >                 else
> > >                   layout_costs.in_cost.add_parallel_cost (cost);
> > >               }
> > > -           else
> > > -             /* Reject the layout if it would make layout 0 impossible
> > > -                for later partitions.  This amounts to testing that the
> > > -                target supports reversing the layout change on edges
> > > -                to later partitions.
> > > -
> > > -                In principle, it might be possible to push a layout
> > > -                change all the way down a graph, so that it never
> > > -                needs to be reversed and so that the target doesn't
> > > -                need to support the reverse operation.  But it would
> > > -                be awkward to bail out if we hit a partition that
> > > -                does not support the new layout, especially since
> > > -                we are not dealing with a lattice.  */
> > > +           /* Reject the layout if it would make layout 0 impossible
> > > +              for later partitions.  This amounts to testing that the
> > > +              target supports reversing the layout change on edges
> > > +              to later partitions.
> > > +
> > > +              In principle, it might be possible to push a layout
> > > +              change all the way down a graph, so that it never
> > > +              needs to be reversed and so that the target doesn't
> > > +              need to support the reverse operation.  But it would
> > > +              be awkward to bail out if we hit a partition that
> > > +              does not support the new layout, especially since
> > > +              we are not dealing with a lattice.
> > > +
> > > +              At least when the other partition doesn't have a
> > > +              prefered layout do not require the reverse permute.  */
> > > +           else if (m_partitions[other_vertex.partition].layout != -1)
> > >               is_possible &= edge_layout_cost (ud, other_node_i, 0,
> > >                                                layout_i).is_possible ();
> > >           };
> > > @@ -7426,7 +7472,7 @@ vect_optimize_slp_pass::forward_pass ()
> > >  /* Make a backward pass through the partitions, accumulating output 
> > > costs.
> > >     Make a final choice of layout for each partition.  */
> > >  
> > > -void
> > > +bool
> > >  vect_optimize_slp_pass::backward_pass ()
> > >  {
> > >    for (unsigned int partition_i = m_partitions.length (); partition_i-- 
> > > > 0;)
> > > @@ -7498,9 +7544,11 @@ vect_optimize_slp_pass::backward_pass ()
> > >       }
> > >   }
> > >  
> > > -      gcc_assert (min_layout_cost.is_possible ());
> > > +      if (!min_layout_cost.is_possible ())
> > > + return false;
> > >        partition.layout = min_layout_i;
> > >      }
> > > +  return true;
> > >  }
> > >  
> > >  /* Return a node that applies layout TO_LAYOUT_I to the original form of 
> > > NODE.
> > > @@ -7760,39 +7808,8 @@ 
> > > vect_optimize_slp_pass::remove_redundant_permutations ()
> > >   }
> > >        else
> > >   {
> > > -   loop_vec_info loop_vinfo = as_a<loop_vec_info> (m_vinfo);
> > > -   stmt_vec_info load_info;
> > > -   bool this_load_permuted = false;
> > > -   unsigned j;
> > > -   FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_STMTS (node), j, load_info)
> > > -     if (SLP_TREE_LOAD_PERMUTATION (node)[j] != j)
> > > -       {
> > > -         this_load_permuted = true;
> > > -         break;
> > > -       }
> > > -   /* When this isn't a grouped access we know it's single element
> > > -      and contiguous.  */
> > > -   if (!STMT_VINFO_GROUPED_ACCESS (SLP_TREE_SCALAR_STMTS (node)[0]))
> > > -     {
> > > -       if (!this_load_permuted
> > > -           && (known_eq (LOOP_VINFO_VECT_FACTOR (loop_vinfo), 1U)
> > > -               || SLP_TREE_LANES (node) == 1))
> > > -         SLP_TREE_LOAD_PERMUTATION (node).release ();
> > > -       continue;
> > > -     }
> > > -   stmt_vec_info first_stmt_info
> > > -     = DR_GROUP_FIRST_ELEMENT (SLP_TREE_SCALAR_STMTS (node)[0]);
> > > -   if (!this_load_permuted
> > > -       /* The load requires permutation when unrolling exposes
> > > -          a gap either because the group is larger than the SLP
> > > -          group-size or because there is a gap between the groups.  */
> > > -       && (known_eq (LOOP_VINFO_VECT_FACTOR (loop_vinfo), 1U)
> > > -           || ((SLP_TREE_LANES (node) == DR_GROUP_SIZE (first_stmt_info))
> > > -               && DR_GROUP_GAP (first_stmt_info) == 0)))
> > > -     {
> > > -       SLP_TREE_LOAD_PERMUTATION (node).release ();
> > > -       continue;
> > > -     }
> > > +   if (is_redundant_permutation (node, SLP_TREE_LOAD_PERMUTATION (node)))
> > > +     SLP_TREE_LOAD_PERMUTATION (node).release ();
> > >   }
> > >      }
> > >  }
> > > @@ -7995,10 +8012,12 @@ vect_optimize_slp_pass::run ()
> > >    if (m_perms.length () > 1)
> > >      {
> > >        forward_pass ();
> > > -      backward_pass ();
> > > -      if (dump_enabled_p ())
> > > - dump ();
> > > -      materialize ();
> > > +      if (backward_pass ())
> > > + {
> > > +   if (dump_enabled_p ())
> > > +     dump ();
> > > +   materialize ();
> > > + }
> > >        while (!m_perms.is_empty ())
> > >   m_perms.pop ().release ();
> > >      }
> > 
> 
> 

-- 
Richard Biener <[email protected]>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

Reply via email to