Re: [PATCH] [RFC] lower SLP load permutation to interleaving

Richard Sandiford Wed, 05 Jun 2024 05:29:30 -0700

Richard Biener <rguent...@suse.de> writes:
> On Tue, 4 Jun 2024, Richard Sandiford wrote:
>
>> Richard Biener <rguent...@suse.de> writes:
>> > The following emulates classical interleaving for SLP load permutes
>> > that we are unlikely handling natively.  This is to handle cases
>> > where interleaving (or load/store-lanes) is the optimal choice for
>> > vectorizing even when we are doing that within SLP.  An example
>> > would be
>> >
>> > void foo (int * __restrict a, int * b)
>> > {
>> >   for (int i = 0; i < 16; ++i)
>> >     {
>> >       a[4*i + 0] = b[4*i + 0] * 3;
>> >       a[4*i + 1] = b[4*i + 1] + 3;
>> >       a[4*i + 2] = (b[4*i + 2] * 3 + 3);
>> >       a[4*i + 3] = b[4*i + 3] * 3;
>> >     }
>> > }
>> >
>> > where currently the SLP store is merging four single-lane SLP
>> > sub-graphs but none of the loads in it can be code-generated
>> > with V4SImode vectors and a VF of four as the permutes would need
>> > three vectors.
>> 
>> Nice!
>> 
>> > The patch introduces a lowering phase after SLP discovery but
>> > before SLP pattern recognition or permute optimization that
>> > analyzes all loads from the same dataref group and creates an
>> > interleaving scheme starting from an unpermuted load.
>> >
>> > What can be handled is quite restrictive, matching only a subset
>> > of the non-SLP interleaving cases (the power-of-two group size
>> > ones, in addition only cases without gaps).  The interleaving
>> > vectorization in addition can handle size 3 and 5 - but I am not
>> > sure if it's possible to do that in a VL agnostic way.  It
>> > should be still possible to set up the SLP graph in a way that
>> > a load-lane could be matched from SLP pattern recognition.
>> 
>> Yeah, I don't think it would be possible to decompose a 3- or
>> 5-lane grouped load into a series of VLA 2-input permutes.
>> But (as I think you're saying) it seems like a load-3-lanes would just
>> be a load with a LANE_PERMUTATION of N, N+3, N+6, N+9, ... for lane N.
>> Is that right?
>
> Yes, that's how it looks without this patch.  I think we'd need
> a load node loading N, N+1, N+2, ... and then permute nodes
> with N, N+3, ... and N+1, N+4, ... and N+2, N+5 ... so we generate
> one .LOAD_LANES from the load node and the permutes pick up the
> correct vector defs?  I'm not sure yet how classification and
> code generation would work for this.
>
> The store side is already on trunk with the single SLP store node
> getting lanes via permutes.
>
> It might be we want a load/store node with N inputs/outputs as the
> best representation and use lane_permutation to indicate the
> input (for stores) and output (for loads) "permute".
>
>> > As said gaps are currently not handled - for SLP we have a
>> > representational issue that SLP_TREE_SCALAR_STMTS for "gap lanes"
>> > would need to be filled in some way (even if we just push NULL).
>> >
>> > The patch misses multi-level even/odd handling as well as CSEing
>> > intermediate generated permutes.  Both is quite straight-forward
>> > to add, but eventually there's a better or more general strategy
>> > for lowering?  The main goal of the patch is to avoid falling
>> > back to non-SLP for cases the interleaving code handles.
>> 
>> Does the multi-level thing including examples like:
>> 
>> int a[2 * 16];
>> int b[8 * 16];
>> void f()
>> {
>>   for (int i = 0; i < 16; ++i)
>>     {
>>       a[i * 2 + 0] += b[i * 8 + 0] + b[i * 8 + 1] + b[i * 8 + 2] + b[i * 8 + 
>> 3];
>>       a[i * 2 + 1] += b[i * 8 + 4] + b[i * 8 + 5] + b[i * 8 + 6] + b[i * 8 + 
>> 7];
>>     }
>> }
>> 
>> ?  For that we generate:
>> 
>>   _45 = VEC_PERM_EXPR <vect__4.9_63, vect__4.10_61, { 1, 3, 5, 7 }>;
>>   _44 = VEC_PERM_EXPR <vect__4.11_59, vect__4.12_57, { 1, 3, 5, 7 }>;
>>   _43 = VEC_PERM_EXPR <_45, _44, { 1, 3, 5, 7 }>;
>>   _49 = VEC_PERM_EXPR <vect__4.9_63, vect__4.10_61, { 0, 2, 4, 6 }>;
>>   _48 = VEC_PERM_EXPR <vect__4.11_59, vect__4.12_57, { 0, 2, 4, 6 }>;
>>   _47 = VEC_PERM_EXPR <_49, _48, { 1, 3, 5, 7 }>;
>>   _53 = VEC_PERM_EXPR <vect__4.9_63, vect__4.10_61, { 0, 1, 4, 5 }>;
>>   _52 = VEC_PERM_EXPR <vect__4.11_59, vect__4.12_57, { 0, 1, 4, 5 }>;
>>   _51 = VEC_PERM_EXPR <_53, _52, { 1, 3, 5, 7 }>;
>>   _54 = VEC_PERM_EXPR <_49, _48, { 0, 2, 4, 6 }>;
>> 
>> (two even level 1, one even level 2, one odd level 1), whereas
>> preferring 2xeven + 2xodd would avoid the third set of first-level
>> permutes:
>> 
>>   _45 = VEC_PERM_EXPR <vect__4.9_63, vect__4.10_61, { 1, 3, 5, 7 }>;
>>   _44 = VEC_PERM_EXPR <vect__4.11_59, vect__4.12_57, { 1, 3, 5, 7 }>;
>>   _43 = VEC_PERM_EXPR <_45, _44, { 1, 3, 5, 7 }>;
>>   _49 = VEC_PERM_EXPR <vect__4.9_63, vect__4.10_61, { 0, 2, 4, 6 }>;
>>   _48 = VEC_PERM_EXPR <vect__4.11_59, vect__4.12_57, { 0, 2, 4, 6 }>;
>>   _47 = VEC_PERM_EXPR <_49, _48, { 1, 3, 5, 7 }>;
>>   _51 = VEC_PERM_EXPR <_45, _44, { 0, 2, 4, 6 }>;
>>   _54 = VEC_PERM_EXPR <_49, _48, { 0, 2, 4, 6 }>;
>
> The multi-level issue is more when a single reduction to N/2
> inputs still doesn't get you to the point where you can do
> the permute with two inputs.  I think the above is more because
> each load is handled individually, not taking into account
> redundancies across loads when you have freedom in the
> even/odd, level combinations (which you usually have).
>
> I suppose it should be possible to handle this to some extent,
> not sure what the best strategy is when trying to avoid brute-force
> searching for an optimal set (esp. when multi-level interleaving
> will be involved).


Was wondering about picking the smaller of ctz_hwi (even) and
ctz_hwi (odd) (when both are nonzero).  I.e. something like:

      // HOST_BITS_PER_WIDE_INT for zero.
      auto even_bits = ctz_hwi (even);
      auto odd_bits = ctz_hwi (odd);
      if (even_bits <= odd_bits)
        {
          unsigned level = 1 << even_bits;
          gcc_assert (group_lanes % (2 * level) == 0);
          ...
        }
      else if (odd)
        {
          unsigned level = 1 << odd_bits;
          gcc_assert (group_lanes % (2 * level) == 0);
          ...
        }
      else
        gcc_unreachable ();

Picking the smaller level should guarantee that it fits.

That handles this case (and is how I generated the above),
but maybe it's worse for something else.

(And yeah, like you said in the follow-up, it's definitely better
than what we generated for GCC 14. :) )

>> > Comments and suggestions welcome, esp. what representation
>> > you'd think is suitable for SLP pattern matching to
>> > load/store-lane and how to represent that?  Maybe this lowering
>> > should happen directly in vect_lower_load_permutations?
>> 
>> If the load-lanes representation is as simple as above, it sounds like
>> it could be deferred to pattern matching.  Not sure what the result
>> would look like though.  It would be nice if (at least for costing
>> purposes) we could have a single node for all lanes of the load-lanes,
>> rather than create a separate node for each lane and rely on later CSE.
>> (Or do we already have a good representation for this?  It's been too
>> long, sorry.)
>
> Yeah, as said above having a load-lane node with multiple outputs
> would be the best match, similar for store-lane.  It's probably
> easiest to generate those right from this lowering until we
> re-implement SLP discovery from scratch.

OK.

>> Bit of trivia below:
>> 
>> > Thanks,
>> > Richard.
>> >
>> >    * tree-vect-slp.cc (vllp_cmp): New function.
>> >    (vect_lower_load_permutations): Likewise.
>> >    (vect_analyze_slp): Call it.
>> > ---
>> >  gcc/tree-vect-slp.cc | 279 +++++++++++++++++++++++++++++++++++++++++++
>> >  1 file changed, 279 insertions(+)
>> >
>> > diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
>> > index 7e3d0107b4e..766b773452f 100644
>> > --- a/gcc/tree-vect-slp.cc
>> > +++ b/gcc/tree-vect-slp.cc
>> > @@ -3839,6 +3839,279 @@ vect_analyze_slp_instance (vec_info *vinfo,
>> >    return res;
>> >  }
>> >  
>> > +/* qsort comparator ordering SLP load nodes.  */
>> > +
>> > +static int
>> > +vllp_cmp (const void *a_, const void *b_)
>> > +{
>> > +  const slp_tree a = *(const slp_tree *)a_;
>> > +  const slp_tree b = *(const slp_tree *)b_;
>> > +  stmt_vec_info a0 = SLP_TREE_SCALAR_STMTS (a)[0];
>> > +  stmt_vec_info b0 = SLP_TREE_SCALAR_STMTS (b)[0];
>> > +  if (STMT_VINFO_GROUPED_ACCESS (a0)
>> > +      && STMT_VINFO_GROUPED_ACCESS (b0)
>> > +      && DR_GROUP_FIRST_ELEMENT (a0) == DR_GROUP_FIRST_ELEMENT (b0))
>> > +    {
>> > +      /* Same group, order after lanes used.  */
>> > +      if (SLP_TREE_LANES (a) < SLP_TREE_LANES (b))
>> > +  return 1;
>> > +      else if (SLP_TREE_LANES (a) > SLP_TREE_LANES (b))
>> > +  return -1;
>> > +      else
>> > +  {
>> > +    /* Try to order loads using the same lanes together, breaking
>> > +       the tie with the lane number that first differs.  */
>> > +    if (!SLP_TREE_LOAD_PERMUTATION (a).exists ()
>> > +        && !SLP_TREE_LOAD_PERMUTATION (b).exists ())
>> > +      return 0;
>> 
>> Does the comparison need to be "stable", with a further tie-breaker
>> when a != b?  Or does our qsort not rely on that?
>
> It doesn't, there's stable_sort if we'd rely on a specific order
> on the consumer side.
>
>> > +    else if (SLP_TREE_LOAD_PERMUTATION (a).exists ()
>> > +             && !SLP_TREE_LOAD_PERMUTATION (b).exists ())
>> > +      return 1;
>> > +    else if (!SLP_TREE_LOAD_PERMUTATION (a).exists ()
>> > +             && SLP_TREE_LOAD_PERMUTATION (b).exists ())
>> > +      return -1;
>> > +    else
>> > +      {
>> > +        for (unsigned i = 0; i < SLP_TREE_LANES (a); ++i)
>> > +          if (SLP_TREE_LOAD_PERMUTATION (a)[i]
>> > +              != SLP_TREE_LOAD_PERMUTATION (b)[i])
>> > +            {
>> > +              /* In-order lane first, that's what the above case for
>> > +                 no permutation does.  */
>> > +              if (SLP_TREE_LOAD_PERMUTATION (a)[i] == i)
>> > +                return -1;
>> > +              else if (SLP_TREE_LOAD_PERMUTATION (b)[i] == i)
>> > +                return 1;
>> > +              else if (SLP_TREE_LOAD_PERMUTATION (a)[i]
>> > +                       < SLP_TREE_LOAD_PERMUTATION (b)[i])
>> > +                return -1;
>> > +              else
>> > +                return 1;
>> > +            }
>> > +        return 0;
>> > +      }
>> > +  }
>> > +    }
>> > +  else /* Different groups or non-groups.  */
>> > +    {
>> > +      /* Order groups as their first element to keep them together.  */
>> > +      if (STMT_VINFO_GROUPED_ACCESS (a0))
>> > +  a0 = DR_GROUP_FIRST_ELEMENT (a0);
>> > +      if (STMT_VINFO_GROUPED_ACCESS (b0))
>> > +  b0 = DR_GROUP_FIRST_ELEMENT (b0);
>> > +      if (a0 == b0)
>> > +  return 0;
>> > +      /* Tie using UID.  */
>> > +      else if (gimple_uid (STMT_VINFO_STMT (a0))
>> > +         < gimple_uid (STMT_VINFO_STMT (b0)))
>> > +  return -1;
>> > +      else
>> > +  {
>> > +    gcc_assert (gimple_uid (STMT_VINFO_STMT (a0))
>> > +                != gimple_uid (STMT_VINFO_STMT (b0)));
>> > +    return 1;
>> > +  }
>> > +    }
>> > +}
>> > +
>> > +/* Process the set of LOADS that are all from the same dataref group.  */
>> > +
>> > +static void
>> > +vect_lower_load_permutations (loop_vec_info loop_vinfo,
>> > +                        scalar_stmts_to_slp_tree_map_t *bst_map,
>> > +                        const array_slice<slp_tree> &loads)
>> > +{
>> > +  /* We at this point want to lower without a fixed VF or vector
>> > +     size in mind which means we cannot actually compute whether we
>> > +     need three or more vectors for a load permutation yet.  So always
>> > +     lower.  */
>> > +  stmt_vec_info first
>> > +    = DR_GROUP_FIRST_ELEMENT (SLP_TREE_SCALAR_STMTS (loads[0])[0]);
>> > +
>> > +  /* ???  In principle we have to consider a gap up to the next full
>> > +     vector, but we have to actually represent a scalar stmt for the
>> > +     gaps value so delay handling this.  The same is true for
>> > +     inbetween gaps which the load places in the load-permutation
>> > +     represent.  It's probably not worth trying an intermediate packing
>> > +     to vectors without gap even if that might handle some more cases.
>> > +     Instead get the gap case correct in some way.  */
>> > +  unsigned group_lanes = 0;
>> > +  for (stmt_vec_info s = first; s; s = DR_GROUP_NEXT_ELEMENT (s))
>> > +    {
>> > +      if ((s == first && DR_GROUP_GAP (s) != 0)
>> > +    || (s != first && DR_GROUP_GAP (s) != 1))
>> > +  return;
>> > +      group_lanes++;
>> > +    }
>> > +  /* Only a power-of-two number of lanes matches interleaving.  */
>> > +  if (exact_log2 (group_lanes) == -1)
>> > +    return;
>> > +
>> > +  for (slp_tree load : loads)
>> > +    {
>> > +      /* Leave masked or gather loads alone for now.  */
>> > +      if (!SLP_TREE_CHILDREN (load).is_empty ())
>> > +  continue;
>> > +
>> > +      /* We need to lower only loads of less than half of the groups
>> > +   lanes, including duplicate lanes.  */
>> > +      if (SLP_TREE_LANES (load) >= group_lanes / 2)
>> > +  continue;
>> > +
>> > +      /* Lower by reducing the group to half its size using an
>> > +   interleaving scheme.  For this try to compute whether all
>> > +   elements needed for this loads are in even or odd elements of
>> 
>> this load (or these loads)
>> 
>> > +   a even/odd decomposition with N consecutive elements.
>> 
>> an even/odd
>> 
>> > +   Thus { e, e, o, o, e, e, o, o } woud be an even/odd decomposition
>> > +   with N == 2.  */
>> > +      unsigned even = (1 << ceil_log2 (DR_GROUP_SIZE (first))) - 1;
>> 
>> Is this DR_GROUP_SIZE (first) conceptually different from group_lanes
>> above?  If not, I think it'd be a bit easier to follow if this line reused
>> the exact_log2 result above.
>
> Once we look at groups with gaps it's different - the load permutation
> lane indices have gaps represented, so a a[0], a[2], a[3] group
> would have a load permutation of { 0, 2, 3 }.  group_lanes is the
> number of lanes in the output of the load which has unused/gap lanes
> stripped.
>
> I've short-cut handling of groups with intermediate gaps and also with
> gaps at the end for simplicity as I have to decide what to put into
> SLP_TREE_SCALAR_STMTS for the unpermuted SLP load node which would
> have those gaps "represented" (I'm quite sure a NULL ICEs left and
> right, duplicating the previous lane sounds appealing even though
> it's wrong ...).  As said, this is more a proof-of-concept ;)
>
> For the next iteration I'm going to add some test coverage, esp.
> also for the multi-level case and will see to handle gaps.

OK, thanks, sounds good.

Richard

Re: [PATCH] [RFC] lower SLP load permutation to interleaving

Reply via email to