Re: [PATCH 1/5] Do single-lane SLP discovery for reductions

Richard Sandiford Wed, 29 May 2024 02:09:06 -0700

Richard Biener <[email protected]> writes:
> On Fri, 24 May 2024, Richard Biener wrote:
>
>> This is the second merge proposed from the SLP vectorizer branch.
>> I have again managed without adding and using --param vect-single-lane-slp
>> but instead this provides always enabled functionality.
>> 
>> This makes us use SLP reductions (a group of reductions) for the
>> case where the group size is one.  This basically means we try
>> to use SLP for all reductions.
>> 
>> I've kept the series close to changes how they are on the branch
>> but in the end I'll squash it, having separate commits for review
>> eventually helps identifying common issues we will run into.  In
>> particular we lack full SLP support for several reduction kinds
>> and the branch has more enabling patches than in this series.
>> For example 4/5 makes sure we use shifts and direct opcode
>> reductions in the reduction epilog for SLP reductions but doesn't
>> bother to try covering the general case but enables it only
>> for the single-element group case to avoid regressions
>> in gcc.dg/vect/reduc-{mul,or}_[12].c testcases.
>> 
>> Bootstrapped and tested on x86_64-unknown-linux-gnu, I've also
>> successfully built SPEC CPU 2017.  This posting should trigger
>> arm & riscv pre-checkin CI.
>> 
>> There's one ICE in gcc.target/i386/pr51235.c I discovered late
>> that I will investigate and address after the weekend.
>
> I've fixed this now.
>
> On aarch64 and arm there's
>
> FAIL: gcc.dg/vect/slp-reduc-3.c scan-tree-dump-times vect "VEC_PERM_EXPR" 
> 0
>
> which is a testism, I _think_ due to a bogus vect_load_lanes check
> in that line.  The code is as expected not using a SLP reduction of
> two lanes due to the widen-sum pattern used.  It might be that we
> somehow fail to use load-lanes when vectorizing the load with SLP
> which means that for SLP reductions we fail to consider
> load-lanes as override.  I think we should leave this FAIL, we need to
> work to get load-lanes vectorization from SLP anyway.  To fix this
> the load-permutation followup I have in the works will be necessary.


Sounds good to me FWIW.

> I also see
>
> FAIL: gcc.target/aarch64/sve/dot_1.c scan-assembler-times \\twhilelo\\t 8
> FAIL: gcc.target/aarch64/sve/reduc_4.c scan-assembler-not \\tfadd\\t
> FAIL: gcc.target/aarch64/sve/sad_1.c scan-assembler-times 
> \\tudot\\tz[0-9]+\\.s, z[0-9]+\\.b, z[0-9]+\\.b\\n 2
>
> but scan-assemblers are not my favorite.  For example dot_1.c has
> twice as many whilelo, but I'm not sure what goes wrong.
>
> There are quite some regressions reported for RISC-V, I looked at the
> ICEs and fixed them but I did not investigate any of the assembly
> scanning FAILs.
>
> I'll re-spin the series with the fixes tomorrow.
> If anybody wants to point out something I should investigate please
> speak up.

Thanks for checking the aarch64 results.  I'll look at the three SVE
failures once the patch is in.  Many of the tests are supposed to ensure
that we generate correct code for a given set of choices.  Sometimes
it's necessary to update the flags to retain the same of choices,
e.g. due to costing changes or general vectoriser improvements.

That is, the point of these tests isn't necessarily to make sure that we
get the "best" SVE code for the source -- especially since there isn't
really an abstract, objective "best" that applies to all targets.
The tests are instead reognising that we have mulitple techniques for
doing some things, and are trying to make sure that each of those
techniques works individually.

Realise that kind of test isn't popular with everyone.  The quid
pro quo is that we (AArch64 folks) get to look at the tests when
failures show up :)

Richard

>
> Thanks,
> Richard.
>
>> This change should be more straight-forward than the previous one,
>> still comments are of course welcome.  After pushed I will followup
>> with changes to enable single-lane SLP reductions for various
>> COND_EXPR reductions as well as double-reduction support and
>> in-order reduction support (also all restricted to single-lane
>> for the moment).
>>
>> Thanks,
>> Richard.
>> 
>> --
>> 
>> The following performs single-lane SLP discovery for reductions.
>> This exposes a latent issue with reduction SLP in outer loop
>> vectorization and makes gcc.dg/vect/vect-outer-4[fgkl].c FAIL
>> execution.
>> 
>>      * tree-vect-slp.cc (vect_build_slp_tree_2): Only multi-lane
>>      discoveries are reduction chains and need special backedge
>>      treatment.
>>      (vect_analyze_slp): Fall back to single-lane SLP discovery
>>      for reductions.  Make sure to try single-lane SLP reduction
>>      for all reductions as fallback.
>> ---
>>  gcc/tree-vect-slp.cc | 71 +++++++++++++++++++++++++++++++++-----------
>>  1 file changed, 54 insertions(+), 17 deletions(-)
>> 
>> diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
>> index c7ed520b629..73cc69d85ce 100644
>> --- a/gcc/tree-vect-slp.cc
>> +++ b/gcc/tree-vect-slp.cc
>> @@ -1907,7 +1907,8 @@ vect_build_slp_tree_2 (vec_info *vinfo, slp_tree node,
>>          /* Reduction chain backedge defs are filled manually.
>>             ???  Need a better way to identify a SLP reduction chain PHI.
>>             Or a better overall way to SLP match those.  */
>> -        if (all_same && def_type == vect_reduction_def)
>> +        if (stmts.length () > 1
>> +            && all_same && def_type == vect_reduction_def)
>>            skip_args[loop_latch_edge (loop)->dest_idx] = true;
>>        }
>>      else if (def_type != vect_internal_def)
>> @@ -3905,9 +3906,10 @@ vect_analyze_slp (vec_info *vinfo, unsigned 
>> max_tree_size)
>>        }
>>  
>>        /* Find SLP sequences starting from groups of reductions.  */
>> -      if (loop_vinfo->reductions.length () > 1)
>> +      if (loop_vinfo->reductions.length () > 0)
>>      {
>> -      /* Collect reduction statements.  */
>> +      /* Collect reduction statements we can combine into
>> +         a SLP reduction.  */
>>        vec<stmt_vec_info> scalar_stmts;
>>        scalar_stmts.create (loop_vinfo->reductions.length ());
>>        for (auto next_info : loop_vinfo->reductions)
>> @@ -3920,25 +3922,60 @@ vect_analyze_slp (vec_info *vinfo, unsigned 
>> max_tree_size)
>>                   reduction path.  In that case we'd have to reverse
>>                   engineer that conversion stmt following the chain using
>>                   reduc_idx and from the PHI using reduc_def.  */
>> -              && STMT_VINFO_DEF_TYPE (next_info) == vect_reduction_def
>> -              /* Do not discover SLP reductions for lane-reducing ops, that
>> -                 will fail later.  */
>> -              && (!(g = dyn_cast <gassign *> (STMT_VINFO_STMT (next_info)))
>> +              && STMT_VINFO_DEF_TYPE (next_info) == vect_reduction_def)
>> +            {
>> +              /* Do not discover SLP reductions combining lane-reducing
>> +                 ops, that will fail later.  */
>> +              if (!(g = dyn_cast <gassign *> (STMT_VINFO_STMT (next_info)))
>>                    || (gimple_assign_rhs_code (g) != DOT_PROD_EXPR
>>                        && gimple_assign_rhs_code (g) != WIDEN_SUM_EXPR
>> -                      && gimple_assign_rhs_code (g) != SAD_EXPR)))
>> -            scalar_stmts.quick_push (next_info);
>> +                      && gimple_assign_rhs_code (g) != SAD_EXPR))
>> +                scalar_stmts.quick_push (next_info);
>> +              else
>> +                {
>> +                  /* Do SLP discovery for single-lane reductions.  */
>> +                  vec<stmt_vec_info> stmts;
>> +                  vec<stmt_vec_info> roots = vNULL;
>> +                  vec<tree> remain = vNULL;
>> +                  stmts.create (1);
>> +                  stmts.quick_push (next_info);
>> +                  vect_build_slp_instance (vinfo,
>> +                                           slp_inst_kind_reduc_group,
>> +                                           stmts, roots, remain,
>> +                                           max_tree_size, &limit,
>> +                                           bst_map, NULL);
>> +                }
>> +            }
>>          }
>> -      if (scalar_stmts.length () > 1)
>> +      /* Save for re-processing on failure.  */
>> +      vec<stmt_vec_info> saved_stmts = scalar_stmts.copy ();
>> +      vec<stmt_vec_info> roots = vNULL;
>> +      vec<tree> remain = vNULL;
>> +      if (scalar_stmts.length () <= 1
>> +          || !vect_build_slp_instance (loop_vinfo,
>> +                                       slp_inst_kind_reduc_group,
>> +                                       scalar_stmts, roots, remain,
>> +                                       max_tree_size, &limit, bst_map,
>> +                                       NULL))
>>          {
>> -          vec<stmt_vec_info> roots = vNULL;
>> -          vec<tree> remain = vNULL;
>> -          vect_build_slp_instance (loop_vinfo, slp_inst_kind_reduc_group,
>> -                                   scalar_stmts, roots, remain,
>> -                                   max_tree_size, &limit, bst_map, NULL);
>> +          if (scalar_stmts.length () <= 1)
>> +            scalar_stmts.release ();
>> +          /* Do SLP discovery for single-lane reductions.  */
>> +          for (auto stmt_info : saved_stmts)
>> +            {
>> +              vec<stmt_vec_info> stmts;
>> +              vec<stmt_vec_info> roots = vNULL;
>> +              vec<tree> remain = vNULL;
>> +              stmts.create (1);
>> +              stmts.quick_push (vect_stmt_to_vectorize (stmt_info));
>> +              vect_build_slp_instance (vinfo,
>> +                                       slp_inst_kind_reduc_group,
>> +                                       stmts, roots, remain,
>> +                                       max_tree_size, &limit,
>> +                                       bst_map, NULL);
>> +            }
>> +          saved_stmts.release ();
>>          }
>> -      else
>> -        scalar_stmts.release ();
>>      }
>>      }
>>  
>>

Re: [PATCH 1/5] Do single-lane SLP discovery for reductions

Reply via email to