116575 - SLP masked load-lanes discovery

Richard Sandiford Wed, 23 Oct 2024 07:26:11 -0700

Richard Biener <rguent...@suse.de> writes:
> The following implements masked load-lane discovery for SLP.  The
> challenge here is that a masked load has a full-width mask with
> group-size number of elements when this becomes a masked load-lanes
> instruction one mask element gates all group members.  We already
> have some discovery hints in place, namely STMT_VINFO_SLP_VECT_ONLY
> to guard non-uniform masks, but we need to choose a way for SLP
> discovery to handle possible masked load-lanes SLP trees.
>
> I have this time chosen to handle load-lanes discovery where we
> have performed permute optimization already and conveniently got
> the graph with predecessor edges built.  This is because unlike
> non-masked loads masked loads with a load_permutation are never
> produced by SLP discovery (because load permutation handling doesn't
> handle un-permuting the mask) and thus the load-permutation lowering
> which handles non-masked load-lanes discovery doesn't trigger.
>
> With this SLP discovery for a possible masked load-lanes, thus
> a masked load with uniform mask, produces a splat of a single-lane
> sub-graph as the mask SLP operand.  This is a representation that
> shouldn't pessimize the mask load case and allows the masked load-lanes
> transform to simply elide this splat.


It's been too long since I did significant work on the vectoriser for
me to make a sensible comment on this, but FWIW, I agree the representation
of a splatted mask sounds good.

> This fixes the aarch64-sve.exp mask_struct_load*.c testcases with
> --param vect-force-slp=1
>
> Bootstrap and regtest running on x86_64-unknown-linux-gnu.
>
> I realize we are still quite inconsistent in how we do SLP
> discovery - mainly because of my idea to only apply minimal
> changes at this point.  I would expect that permuted masked loads
> miss the interleaving lowering performed by load permutation
> lowering.  And if we fix that we again have to decide whether
> to interleave or load-lane at the same time.  I'm also not sure
> how much good the optimize_slp passes to do VEC_PERMs in the
> SLP graph and what stops working when there are no longer any
> load_permutations in there.

Yeah, I'm also not sure about that.  The code only considers candidate
layouts that would undo a load permutation or a bijective single-input
VEC_PERM_EXPR.  It won't do anything for 2-to-1 permutes or single-input
packs.  The current layout selection is probably quite outdated at
this point.

Thanks,
Richard

> Richard.
>
>       PR tree-optimization/116575
>       * tree-vect-slp.cc (vect_get_and_check_slp_defs): Handle
>       gaps, aka NULL scalar stmt.
>       (vect_build_slp_tree_2): Allow gaps in the middle of a
>       grouped mask load.  When the mask of a grouped mask load
>       is uniform do single-lane discovery for the mask and
>       insert a splat VEC_PERM_EXPR node.
>       (vect_optimize_slp_pass::decide_masked_load_lanes): New
>       function.
>       (vect_optimize_slp_pass::run): Call it.
> ---
>  gcc/tree-vect-slp.cc | 138 ++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 135 insertions(+), 3 deletions(-)
>
> diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
> index fca9ae86d2e..037098a96cb 100644
> --- a/gcc/tree-vect-slp.cc
> +++ b/gcc/tree-vect-slp.cc
> @@ -641,6 +641,16 @@ vect_get_and_check_slp_defs (vec_info *vinfo, unsigned 
> char swap,
>    unsigned int commutative_op = -1U;
>    bool first = stmt_num == 0;
>  
> +  if (!stmt_info)
> +    {
> +      for (auto oi : *oprnds_info)
> +     {
> +       oi->def_stmts.quick_push (NULL);
> +       oi->ops.quick_push (NULL_TREE);
> +     }
> +      return 0;
> +    }
> +
>    if (!is_a<gcall *> (stmt_info->stmt)
>        && !is_a<gassign *> (stmt_info->stmt)
>        && !is_a<gphi *> (stmt_info->stmt))
> @@ -2029,9 +2039,11 @@ vect_build_slp_tree_2 (vec_info *vinfo, slp_tree node,
>                   has_gaps = true;
>             /* We cannot handle permuted masked loads directly, see
>                PR114375.  We cannot handle strided masked loads or masked
> -              loads with gaps.  */
> +              loads with gaps unless the mask is uniform.  */
>             if ((STMT_VINFO_GROUPED_ACCESS (stmt_info)
> -                && (DR_GROUP_GAP (first_stmt_info) != 0 || has_gaps))
> +                && (DR_GROUP_GAP (first_stmt_info) != 0
> +                    || (has_gaps
> +                        && STMT_VINFO_SLP_VECT_ONLY (first_stmt_info))))
>                 || STMT_VINFO_STRIDED_P (stmt_info))
>               {
>                 load_permutation.release ();
> @@ -2054,7 +2066,12 @@ vect_build_slp_tree_2 (vec_info *vinfo, slp_tree node,
>                 unsigned i = 0;
>                 for (stmt_vec_info si = first_stmt_info;
>                      si; si = DR_GROUP_NEXT_ELEMENT (si))
> -                 stmts2[i++] = si;
> +                 {
> +                   if (si != first_stmt_info)
> +                     for (unsigned k = 1; k < DR_GROUP_GAP (si); ++k)
> +                       stmts2[i++] = NULL;
> +                   stmts2[i++] = si;
> +                 }
>                 bool *matches2 = XALLOCAVEC (bool, dr_group_size);
>                 slp_tree unperm_load
>                   = vect_build_slp_tree (vinfo, stmts2, dr_group_size,
> @@ -2719,6 +2736,43 @@ out:
>         continue;
>       }
>  
> +      /* When we have a masked load with uniform mask discover this
> +      as a single-lane mask with a splat permute.  This way we can
> +      recognize this as a masked load-lane by stripping the splat.  */
> +      if (is_a <gcall *> (STMT_VINFO_STMT (stmt_info))
> +       && gimple_call_internal_p (STMT_VINFO_STMT (stmt_info),
> +                                  IFN_MASK_LOAD)
> +       && STMT_VINFO_GROUPED_ACCESS (stmt_info)
> +       && ! STMT_VINFO_SLP_VECT_ONLY (DR_GROUP_FIRST_ELEMENT (stmt_info)))
> +     {
> +       vec<stmt_vec_info> def_stmts2;
> +       def_stmts2.create (1);
> +       def_stmts2.quick_push (oprnd_info->def_stmts[0]);
> +       child = vect_build_slp_tree (vinfo, def_stmts2, 1,
> +                                    &this_max_nunits,
> +                                    matches, limit,
> +                                    &this_tree_size, bst_map);
> +       if (child)
> +         {
> +           slp_tree pnode = vect_create_new_slp_node (1, VEC_PERM_EXPR);
> +           SLP_TREE_VECTYPE (pnode) = SLP_TREE_VECTYPE (child);
> +           SLP_TREE_LANES (pnode) = group_size;
> +           SLP_TREE_SCALAR_STMTS (pnode).create (group_size);
> +           SLP_TREE_LANE_PERMUTATION (pnode).create (group_size);
> +           for (unsigned k = 0; k < group_size; ++k)
> +             {
> +               SLP_TREE_SCALAR_STMTS (pnode).quick_push (def_stmts2[0]);
> +               SLP_TREE_LANE_PERMUTATION (pnode)
> +                 .quick_push (std::make_pair (0u, 0u));
> +             }
> +           SLP_TREE_CHILDREN (pnode).quick_push (child);
> +           pnode->max_nunits = child->max_nunits;
> +           children.safe_push (pnode);
> +           oprnd_info->def_stmts = vNULL;
> +           continue;
> +         }
> +     }
> +
>        if ((child = vect_build_slp_tree (vinfo, oprnd_info->def_stmts,
>                                       group_size, &this_max_nunits,
>                                       matches, limit,
> @@ -5498,6 +5552,9 @@ private:
>    /* Clean-up.  */
>    void remove_redundant_permutations ();
>  
> +  /* Masked load lanes discovery.  */
> +  void decide_masked_load_lanes ();
> +
>    void dump ();
>  
>    vec_info *m_vinfo;
> @@ -7126,6 +7183,80 @@ vect_optimize_slp_pass::dump ()
>      }
>  }
>  
> +/* Masked load lanes discovery.  */
> +
> +void
> +vect_optimize_slp_pass::decide_masked_load_lanes ()
> +{
> +  for (auto v : m_vertices)
> +    {
> +      slp_tree node = v.node;
> +      if (SLP_TREE_DEF_TYPE (node) != vect_internal_def
> +       || SLP_TREE_CODE (node) == VEC_PERM_EXPR)
> +     continue;
> +      stmt_vec_info stmt_info = SLP_TREE_REPRESENTATIVE (node);
> +      if (! STMT_VINFO_GROUPED_ACCESS (stmt_info)
> +       /* The mask has to be uniform.  */
> +       || STMT_VINFO_SLP_VECT_ONLY (stmt_info)
> +       || ! is_a <gcall *> (STMT_VINFO_STMT (stmt_info))
> +       || ! gimple_call_internal_p (STMT_VINFO_STMT (stmt_info),
> +                                    IFN_MASK_LOAD))
> +     continue;
> +      stmt_info = DR_GROUP_FIRST_ELEMENT (stmt_info);
> +      if (STMT_VINFO_STRIDED_P (stmt_info)
> +       || compare_step_with_zero (m_vinfo, stmt_info) <= 0
> +       || vect_load_lanes_supported (SLP_TREE_VECTYPE (node),
> +                                     DR_GROUP_SIZE (stmt_info),
> +                                     true) == IFN_LAST)
> +     continue;
> +
> +      /* Uniform masks need to be suitably represented.  */
> +      slp_tree mask = SLP_TREE_CHILDREN (node)[0];
> +      if (SLP_TREE_CODE (mask) != VEC_PERM_EXPR
> +       || SLP_TREE_CHILDREN (mask).length () != 1)
> +     continue;
> +      bool match = true;
> +      for (auto perm : SLP_TREE_LANE_PERMUTATION (mask))
> +     if (perm.first != 0 || perm.second != 0)
> +       {
> +         match = false;
> +         break;
> +       }
> +      if (!match)
> +     continue;
> +
> +      /* Now see if the consumer side matches.  */
> +      for (graph_edge *pred = m_slpg->vertices[node->vertex].pred;
> +        pred; pred = pred->pred_next)
> +     {
> +       slp_tree pred_node = m_vertices[pred->src].node;
> +       /* All consumers should be a permute with a single outgoing lane.  */
> +       if (SLP_TREE_CODE (pred_node) != VEC_PERM_EXPR
> +           || SLP_TREE_LANES (pred_node) != 1)
> +         {
> +           match = false;
> +           break;
> +         }
> +       gcc_assert (SLP_TREE_CHILDREN (pred_node).length () == 1);
> +     }
> +      if (!match)
> +     continue;
> +      /* Now we can mark the nodes as to use load lanes.  */
> +      node->ldst_lanes = true;
> +      for (graph_edge *pred = m_slpg->vertices[node->vertex].pred;
> +        pred; pred = pred->pred_next)
> +     m_vertices[pred->src].node->ldst_lanes = true;
> +      /* The catch is we have to massage the mask.  We have arranged
> +      analyzed uniform masks to be represented by a splat VEC_PERM
> +      which we can now simply elide as we cannot easily re-do SLP
> +      discovery here.  */
> +      slp_tree new_mask = SLP_TREE_CHILDREN (mask)[0];
> +      SLP_TREE_REF_COUNT (new_mask)++;
> +      SLP_TREE_CHILDREN (node)[0] = new_mask;
> +      vect_free_slp_tree (mask);
> +    }
> +}
> +
>  /* Main entry point for the SLP graph optimization pass.  */
>  
>  void
> @@ -7146,6 +7277,7 @@ vect_optimize_slp_pass::run ()
>      }
>    else
>      remove_redundant_permutations ();
> +  decide_masked_load_lanes ();
>    free_graph (m_slpg);
>  }

Re: [PATCH 2/2] tree-optimization/116575 - SLP masked load-lanes discovery

Reply via email to