RE: [RFC][middle-end] SLP Early break and control flow support in GCC

Richard Biener Tue, 30 Jul 2024 02:47:11 -0700

On Tue, 30 Jul 2024, Tamar Christina wrote:

> > -----Original Message-----
> > From: Richard Biener <rguent...@suse.de>
> > Sent: Thursday, July 18, 2024 10:00 AM
> > To: Tamar Christina <tamar.christ...@arm.com>
> > Cc: GCC Patches <gcc-patches@gcc.gnu.org>; Richard Sandiford
> > <richard.sandif...@arm.com>
> > Subject: RE: [RFC][middle-end] SLP Early break and control flow support in 
> > GCC
> > 
> > On Wed, 17 Jul 2024, Tamar Christina wrote:
> > 
> > > > -----Original Message-----
> > > > From: Richard Biener <rguent...@suse.de>
> > > > Sent: Tuesday, July 16, 2024 4:08 PM
> > > > To: Tamar Christina <tamar.christ...@arm.com>
> > > > Cc: GCC Patches <gcc-patches@gcc.gnu.org>; Richard Sandiford
> > > > <richard.sandif...@arm.com>
> > > > Subject: Re: [RFC][middle-end] SLP Early break and control flow support 
> > > > in GCC
> > > >
> > > > On Mon, 15 Jul 2024, Tamar Christina wrote:
> > > >
> > > > > Hi All,
> > > > >
> > > > > This RFC document covers at a high level how to extend early break 
> > > > > support in
> > > > > GCC to support SLP and how this will be extended in the future to 
> > > > > support
> > > > > full control flow in GCC.
> > > > >
> > > > > The basic idea in this is based on the paper "All You Need Is 
> > > > > Superword-Level
> > > > > Parallelism: Systematic Control-Flow Vectorization with SLP"[1] but 
> > > > > it is
> > > > > adapted to fit within the GCC vectorizer.
> > > >
> > > > An interesting read - I think the approach is viable for loop
> > > > vectorization where we schedule the whole vectorized loop but difficult
> > > > for basic-block vectorization as they seem to re-build the whole 
> > > > function
> > > > from scratch.  They also do not address how to code-generate predicated
> > > > not vectorized code or how they decide to handle mixed vector/non-vector
> > > > code at all.  For example I don't think any current CPU architecture
> > > > supports a full set of predicated _scalar_ ops and not every scalar
> > > > op would have a vector equivalent in case one would use single-lane
> > > > vectors.
> > >
> > > Hmm I'm guessing you mean here, they don't address for BB vectorization
> > > how to deal with the fact that you may not always be able to vectorize the
> > > entire function up from the seed? I thought today we dealt with that by
> > > splitting during discovery?  Can we not do the same? i.e. treat them as
> > > externals?
> > 
> > Currently not all scalar stmts are even reached by the seeds we use.
> > They seem to simply rewrite the whole function into predicated form
> > and I don't see how that is a transform that gets you back 1:1 the old
> > code (or code of at least the same quality) if not all statements end
> > up vectorized?
> > 
> > Sure we could build up single-lane SLP instances for "everything",
> > but then how do you code-generate these predicated single-lane SLP
> > instances?
> > 
> > > >
> > > > For GCCs basic-block vectorization the main outstanding issue is one
> > > > of scheduling and dependences with scalar code (live lane extracts,
> > > > vector builds from scalars) as well.
> > > >
> > > > > What will not be covered is the support for First-Faulting Loads nor
> > > > > alignment peeling as these are done by a different engineer.
> > > > >
> > > > > == Concept and Theory ==
> > > > >
> > > > > Supporting Early Break in SLP requires the same theory as general 
> > > > > control
> > flow,
> > > > > the only difference is in that Early break can be supported for 
> > > > > non-masked
> > > > > architectures while full control flow support requires a fully masked
> > > > > architecture.
> > > > >
> > > > > In GCC 14 Early break was added for non-SLP by teaching the 
> > > > > vectorizer to
> > > > branch
> > > > > to a scalar epilogue whenever an early exit is taken.  This means the 
> > > > > vectorizer
> > > > > itself doesn't need to know how to perform side-effects or final 
> > > > > reductions.
> > > >
> > > > With current GCC we avoid the need of predicated stmts by using the 
> > > > scalar
> > > > epilog and a branch-on-all-true/false stmt.  To make that semantically
> > > > valid stmts are moved around.
> > > >
> > > > > The additional benefit of this is that it works for any target 
> > > > > providing a
> > > > > vector cbranch optab implementation since we don't require masking
> > support.
> > > > In
> > > > > order for this to work we need to have code motion that moves side 
> > > > > effects
> > > > > (primarily stores) into the latch basic block.  i.e. we delay any 
> > > > > side-effects
> > > > > up to a point where we know the full vector iteration would have been
> > > > performed.
> > > > > For this to work however we had to disable support for epilogue 
> > > > > vectorization
> > as
> > > > > when the scalar statements are moved we break the link to the 
> > > > > original BB
> > they
> > > > > were in.  This means that the stmt_vinfo for the stores that need to 
> > > > > be moved
> > > > > will no longer be valid for epilogue vectorization and the only way 
> > > > > to recover
> > > > > this would be to perform a full dataflow analysis again.  We decided 
> > > > > against
> > > > > this as the plan of record was to switch to SLP.
> > > > >
> > > > > -- Predicated SSA --
> > > > >
> > > > > The core of the proposal is to support a form of Predicated SSA 
> > > > > (PSSA) in the
> > > > > vectorizer [2]. The idea is to assign a control predicate to every SSA
> > > > > statement.  This control predicate denotes their dependencies wrt to 
> > > > > the BB
> > they
> > > > > are in and defines the dependencies between statements.
> > > > >
> > > > > As an example the following early break code:
> > > > >
> > > > > unsigned test4(unsigned x)
> > > > > {
> > > > >  unsigned ret = 0;
> > > > >  for (int i = 0; i < N; i++)
> > > > >  {
> > > > >    vect_b[i] = x + i;
> > > > >    if (vect_a[i]*2 != x)
> > > > >      break;
> > > > >    vect_a[i] = x;
> > > > >
> > > > >  }
> > > > >  return ret;
> > > > > }
> > > > >
> > > > > is transformed into
> > > > >
> > > > > unsigned test4(unsigned x)
> > > > > {
> > > > >  unsigned ret = 0;
> > > > >  for (int i = 0; i < N; i++)
> > > > >  {
> > > > >    vect_b[i] = x + i;      :: !(vect_a[i]*2 != x)
> > > >
> > > > why's this not 'true'?
> > > >
> > >
> > > I think there are three cases here:
> > >
> > > 1. control flow with no exit == true, in this case we
> > >     can perform statements with side-effects in place as
> > >     you never exit the loop early.
> > > 2. early break non-masked, this one requires the store
> > >      to be moved to the latch, or the block with the last
> > >      early break.  This is what we do today, and the predicate
> > >      would cause it to be moved.
> > > 3.  early break masked, In this case I think we need an exit
> > >      block that performs any side effects masked.  Since on every
> > >      exit you must still perform the stores, but based on the union
> > >      of the masks you have so far.  The final one should still be moved
> > >      to the latch block.
> > >
> > > > >    if (vect_a[i]*2 != x)   :: true
> > > > >      break;                :: vect_a[i]*2 != x
> > > > >    vect_a[i] = x;          :: !(vect_a[i]*2 != x)
> > > > >
> > > > >  }
> > > > >  return ret;
> > > > > }
> > > > >
> > > > > A more complicated example:
> > > > >
> > > > > int foo (int n)
> > > > > {
> > > > >     int res = 0;
> > > > >     for (int i = 0; i < n; i++)
> > > > >       {
> > > > >         y[i] = x[i] * 2;      :: true
> > > > >         res += x[i] + y[i];   :: true
> > > > >
> > > > >         if (x[i] > 5)         :: true
> > > > >           break;              :: x[i] > 5
> > > > >
> > > > >         if (z[i] > 5)         :: !(x[i] > 5)
> > > > >           break;              :: !(x[i] > 5) && (z[i] > 5)
> > > > >       }
> > > > >
> > > > >     return res;
> > > > > }
> > > > >
> > > > > any statement guarded by a block with a non-true predicate can be 
> > > > > simplified,
> > so
> > > > > the above can be turned into
> > > > >
> > > > > int foo (int n)
> > > > > {
> > > > >     int res = 0;
> > > > >     for (int i = 0; i < n; i++)
> > > > >       {
> > > > >         y[i] = x[i] * 2;      :: true
> > > > >         res += x[i] + y[i];   :: true
> > > > >
> > > > >         if (x[i] > 5)         :: true
> > > > >           break;              :: x[i] > 5
> > > > >
> > > > >         if (z[i] > 5)         :: !(x[i] > 5)
> > > > >           break;              :: (z[i] > 5)
> > > > >       }
> > > > >
> > > > >     return res;
> > > > > }
> > > > >
> > > > > if we make the predicates a *gimple, where true is just NULL;
> > > >
> > > > I'd make predicates a SSA name instead (or possibly abstract this
> > > > so we can more easily change our minds) - the disadvantage is
> > > > that gcond * doesn't have a SSA def.
> > > >
> > > > We might want to look into tree-if-conv.cc (remotely possibly into
> > > > gimple-predicate-analysis.{h,cc}) since there are multiple places in
> > > > GCC where we need to maintain a composition of individual predicates
> > > > (and possibly compare / simplify such compositions).
> > > >
> > >
> > > Aha, good shout.
> > >
> > > > > Lastly examples such as:
> > > > >
> > > > > unsigned test4(unsigned x, unsigned y)
> > > > > {
> > > > >  unsigned ret = 0;
> > > > >  unsigned sum = 0;
> > > > >  for (int i = 0; i < N; i++)
> > > > >  {
> > > > >    vect_b[i] = x + i;
> > > > >    if (vect_a[i] > x)
> > > > >      return vect_a[i];
> > > > >
> > > > >   vect_b[i] += x + i;
> > > > >   if (vect_a[i] > y)
> > > > >       return sum;
> > > > >   sum += vect_a[i];
> > > > >   vect_a[i] = x;
> > > > >  }
> > > > >  return ret;
> > > > > }
> > > > >
> > > > > are tagged as:
> > > > >
> > > > > unsigned test4(unsigned x, unsigned y)
> > > > > {
> > > > >  unsigned ret = 0;
> > > > >  unsigned sum = 0;
> > > > >  for (int i = 0; i < N; i++)
> > > > >  {
> > > > >    vect_b[i] = x + i;    :: !(vect_a[i] > y)
> > > > >    if (vect_a[i] > x)    :: true
> > > > >      return vect_a[i];   :: vect_a[i] > x
> > > > >
> > > > >   vect_b[i] += x + i;    :: !(vect_a[i] > y)
> > > > >   if (vect_a[i] > y)     :: !(vect_a[i] > x)
> > > > >       return sum;        :: vect_a[i] > y
> > > > >   sum += vect_a[i];      :: !(vect_a[i] > y)
> > > > >   vect_a[i] = x;         :: !(vect_a[i] > y)
> > > > >  }
> > > > >  return ret;
> > > > > }
> > > >
> > > > I'll note you have all early-exit examples.  Whatever we build should
> > > > ideally able to handle cases we currently if-convert.
> > > >
> > >
> > > Yeah, I think cases we currently if-convert and don't if-convert can be
> > > handled approximately the same way.  Though we might want to have
> > > some kind of lowering phase that transforms the block into conditional
> > > statements when we can.  Probably in optimize_slp?
> > 
> > But isn't this handled by the rewrite to predicated form?  Of course
> > you would need to handle PHIs then (with early exit we have/allow no
> > in-loop PHIs), but those are simply COND_EXPRs (or blends when
> > vectorized).
> > 
> > > > > This representation has a number of benefits:
> > > > >
> > > > > 1. code-motion becomes just SLP scheduling, where scheduling has to 
> > > > > take
> > > > account
> > > > >    not to schedule blocks across eachother which have a data 
> > > > > dependency. This
> > > > >    becomes simpler when we build the SLP tree we can merge blocks 
> > > > > with the
> > > > same
> > > > >    control dependencies.
> > > > > 2. this representation also helps for the case where we want to stay 
> > > > > inside the
> > > > >    vector loop, as the predicate tells us which mask we need to use 
> > > > > to compute
> > > > >    the reduction values.
> > > > > 3. this representation is easily extendable to general control flow 
> > > > > support
> > > > >    since the statements will all be explicitly guarded by  a 
> > > > > predicate.  The PHI
> > > > >    node can be given a special type, to indicate to codegen that a 
> > > > > branch
> > should
> > > > >    be generated.
> > > >
> > > > Indeed.
> > > >
> > > > Now, we're for example currently missing the dependence on the loop
> > > > mask computation for fully-masked loops as we're not representing its
> > > > computation nor the implicit uses on SLP nodes.
> > > >
> > > > Each predicate would be a dependence on the predicate computation 
> > > > itself,
> > > > is that predicate computation supposed to be a SLP node?
> > >
> > > This is where I was hoping to get some feedback into Richard S's plans.
> > > My understanding is that he'll be working on BB SLP support for VLA and it
> > > looks to me like we could use the same bookkeeping here.
> > 
> > I think you are talking about allowing a subset of active lanes from the
> > real vector?  For BB SLP the static constant mask is kind-of implicit
> > with the number of lanes in the SLP node, the main issue we have here
> > currently is that we have to mask (or fill with safe values) the inactive
> > lanes when we have a "gap in the vector".  But sure, we could trigger
> > the masking logic itself by adding a all-ones mask (all lanes in the
> > SLP node are active), but the main issue will be to support "gaps"
> > at the end of vectors.
> > 
> > > My initial thought was to indeed just have a new node similar to how
> > > TWO_OPERANDS is implemented.  Except instead of having a permute node
> > > it would have a mask node.    This should also be usable when we're 
> > > building
> > > an SLP code from statements where we could build the tree if masking is 
> > > applied:
> > > i.e.
> > >
> > > a[i] = b[i]
> > > a[i+1] = b[i+1] * 5
> > >
> > > should be SLP'able by masking out the second operand.
> > >
> > > I think an SLP node makes sense because we then don't need to keep a 
> > > separate
> > CFG.
> > 
> > Good.
> > 
> > > >
> > > > > conditions which read from the same sources can be merged (SLP'ed) if 
> > > > > the
> > > > > statements in the early exit block are the same.
> > > > >
> > > > > That is, the above can be handled as:
> > > > >
> > > > > unsigned test4(unsigned x, unsigned y)
> > > > > {
> > > > >  unsigned ret = 0;
> > > > >  unsigned sum = 0;
> > > > >  for (int i = 0; i < N; i++)
> > > > >  {
> > > > >    if (vect_a[i] > x)    :: true
> > > > >      return vect_a[i];   :: vect_a[i] > x
> > > > >
> > > > >    if (vect_a[i] > y)    :: !(vect_a[i] > x)
> > > > >      return sum;         :: vect_a[i] > y
> > > > >
> > > > >    vect_b[i] = x + i;    :: !(vect_a[i] > y)
> > > > >    vect_b[i] += x + i;   :: !(vect_a[i] > y)
> > > > >    sum += vect_a[i];     :: !(vect_a[i] > y)
> > > > >    vect_a[i] = x;        :: !(vect_a[i] > y)
> > > > >  }
> > > > >  return ret;
> > > > > }
> > > > >
> > > > > But the following:
> > > > >
> > > > > unsigned test4(unsigned x, unsigned y)
> > > > > {
> > > > >  unsigned ret = 0;
> > > > >  unsigned sum = 0;
> > > > >  for (int i = 0; i < N; i++)
> > > > >  {
> > > > >    vect_b[i] = x + i;    :: !(vect_a[i] > y)
> > > > >    if (vect_a[i] > x)    :: true
> > > > >      break;              :: vect_a[i] > x
> > > > >
> > > > >   vect_b[i] += x + i;    :: !(vect_a[i] > y)
> > > > >   if (vect_a[i] > y)     :: !(vect_a[i] > x)
> > > > >      break;              :: vect_a[i] > y
> > > > >   sum += vect_a[i];      :: !(vect_a[i] > y)
> > > > >   vect_a[i] = x;         :: !(vect_a[i] > y)
> > > > >  }
> > > > >  return ret;
> > > > > }
> > > > >
> > > > > becomes:
> > > > >
> > > > > unsigned test4(unsigned x, unsigned y)
> > > > > {
> > > > >  unsigned ret = 0;
> > > > >  unsigned sum = 0;
> > > > >  for (int i = 0; i < N; i++)
> > > > >  {
> > > > >    if (vect_a[i] > x || vect_a[i] > y)    :: true
> > > > >      break;              :: vect_a[i] > x
> > > > >
> > > > >    vect_b[i] = x + i;    :: !(vect_a[i] > y)
> > > > >    vect_b[i] += x + i;   :: !(vect_a[i] > y)
> > > > >    sum += vect_a[i];     :: !(vect_a[i] > y)
> > > > >    vect_a[i] = x;        :: !(vect_a[i] > y)
> > > > >  }
> > > > >  return ret;
> > > > > }
> > > > >
> > > > > In addition for this to work the vectorizer must be changed to 
> > > > > generate code
> > > > > solely based on the SLP tree and not directly from the BB.
> > > >
> > > > Yep - see my comments on the paper above and about the existing issues
> > > > with the BB vectorizer.  I'll remind myself that in future SLP build
> > > > should start with a single-lane SLP node for each SSA def so we'd have
> > > > SLP nodes for both the scalar and the vector part of a BB which might
> > > > "solve" this if we chose to never "schedule" scalar SLP nodes
> > > > (we currently tend to create cyclic dependences in some cases).
> > > >
> > > > > Note: that the the control predicates should be easy to determine 
> > > > > through
> > the
> > > > > post-order dominance tree.
> > > >
> > > > The paper (and the talk) talks about loops and the need to represent
> > > > them with a special "AST" node.  I think without doing that this 
> > > > restricts
> > > > us to non-loopy CFGs (like no outer loop vectorization)?
> > >
> > > Yeah,  the outer loop vectorization approach from the paper also looked
> > interesting.
> > > The approach would theoretically allow for loop merging as well.  Though 
> > > I think
> > that
> > > requires you to try BB SLP before loop vectorization.
> > 
> > Note in the talk (I didn't read the paper thoroughly) they "cheat" and
> > perform loop fusion after rewriting into conditional IL and only then
> > perform SLP rather than doing the fusion on-demand.
> > 
> > As said I'd first see predication as a way to get rid of loop
> > if-conversion, merging that with the vectorizer itself, and of course
> > to get early break supported with SLP.  I'd look at expanding BB
> > vectorization later - similar for the idea of rewriting SLP discovery
> > (which predication will likely require to some extent).  I'm not sure
> > we want to go fully no-CFG, I'd rather have some SLP nodes "tied" to
> > a spot in the CFG for now and get "CFG aware SLP scheduling" this way.
> 
> I guess the problem here is that this way we won't be able to "SLP" CFG,
> At the moment, in GIMPLE compound expressions get broken up into
> multiple exits. So if (a[i] != 0 ||a[i+1] !=0) return;  becomes two exits.
> 
> The idea with going CFG-less is that we should be able to SLP the control
> flow.  We should also be able to re-arrange the order of exits to e.g. test
> the cheapest one first.


Note that this is an optimization problem even for scalar code, having
a unified view of conditions with control flow and (partially)
if-converted sub-conditions and the ability to re-associate that and
emit control dependent statements in correct places is something we
cannot do there either.

> So I think I agree that at least for version 0 we should
> stick with requiring the CFG, no explicit CFG is probably better for future
> versions?

It's something worth exploring.

> > For example dependence analysis relies on a vector load to be emitted
> > at the earliers scalar load position of the set of loads vectorized
> > and for the store to be emitted at the latest scalar store position
> > of the set of stores vectorized.  Without also representing memory
> > data dependences in the SLP graph we have to preserve that
> > (you'd have to add SLP edges for virtual operands).  The paper doesn't
> > really cover this detail very explicitly AFAICS (how memory dependences
> > are preserved when re-writing the IL).
> 
> Yeah, I noticed this gap as well, they don't seem to handle statements with
> side-effects.   So for that I had extended it with my own scheme. I believe
> this essentially just requires different predicates for statements with 
> side-effects.
> 
> At least on paper it works out, so it's a simple extension of the idea.

Sure - you just need to model all kinds of dependences as extra predicates
or data dependences.  The simplified .MEM_n SSA is a linear number of
edges but "true" dependences are of course quadratic.

> > 
> > >
> > > >
> > > > > == Implementation ==
> > > > >
> > > > > The high level steps for implementing this implementation and the 
> > > > > order I'll
> > > > > work is:
> > > > >
> > > > > 1. Detach SLP codegen from BB
> > > > > 2. Teach build SLP to assign control predicates to relevant 
> > > > > statements. It might
> > > > >    be easier to do this during loop analysis,  but I don't think 
> > > > > that'll work
> > > > >    for BB SLP. Thoughts?
> > > > > 3. Teach optimize SLP to merge blocks with similar control predicates
> > > > > 4. Teach SLP scheduling to move blocks
> > > > > 5. Teach vectorizable_early_break to produce new CFG
> > > >
> > > > Let me ask you to start with 0. the minimum viable "hack" to make
> > > > SLP of early break work.  For 1. to 5. I'd first concentrate on loop SLP
> > > > vectorization because there "detached codegen" is trivial (ha, joking
> > > > of course).  I'd probably go as far as to re-do SLP discovery between
> > > > 0. and tackling 1. - SLP discovery for loops, that is.
> > >
> > > Sure, that's fair enough.  I have some Arm specific patches to get out 
> > > first,
> > > which I'll do this week and can start on this next Tuesday.
> > >
> > > I just wanted to outline what I'm thinking for the longer term plan.
> > 
> > Sounds good.  Just to re-iterate my plan is to separate SLP discovery
> > for loop vectorization and BB vectorization - I'll motivate that
> > with some of the SLP-only patches that are queued to after when I
> > have managed to make myself happy with the load/store permute handling ...
> > 
> > > >
> > > > I think for 0. we want to be able to attach a [set of] predicates to
> > > > each SLP node.  With that we can represent loop masking and early
> > > > break conditions.  Those [set of] predicates would be an alternate
> > > > unsorted vector of "SLP children" and the SLP children would be
> > > > the predicate computation SLP nodes.  We'd have the chance to
> > > > combine the set of predicates down to one by adding intermediate
> > > > merging SLP nodes (in the end code generation just accepts one
> > > > predicate but discovery for simplicity(?) would add many, or two).
> > > >
> > >
> > > So if I understood correctly, during discovery do we maintain breaks
> > > in the same instance or do we represent them using these merge nodes
> > > which are essentially a sort of PHI node?  So they represent your branch
> > > in control flow and the predicates the jump locations?
> > 
> > I think stmts after a break will pick up the predicate SLP nodes from
> > the break but yes, the break stmt itself (the cbranch) would be the
> > seed of a separate SLP instance.  Caching of the predicate SLP
> > nodes should probably be tied to the basic-block (similar to how
> > if-conversion does this).  Whether to build SLP nodes to "and"
> > two exit predicates or whether to record them in a vector of
> > and_predicates in each SLP node is probably something that experiments
> > will tell - having a representation of predicates not "lowered" to
> > SLP operations is probably more convenient for simplification and/or
> > merging with loop mask predicates, but for codegen and costing having
> > the actual predicate merging ops spelled out is prefered.
> > 
> > > > For early break and without 1. we'd have to "fix" the position of
> > > > some SLP nodes in the CFG, namely the "cbranch" SLP node (maybe
> > > > that's just a predicate compute node of special kind).  I think
> > > > I mentioned adding an optional gimple_stmt_iterator to SLP nodes
> > > > to support this (or special-case the cbranch node during scheduling
> > > > by looking at the gcond * that's inevitably there).  The loop
> > > > mask compute would get a gsi_after_labels for example.
> > > >
> > > > In principle if we have this as 0. then 1. should become just
> > > > an engineering exercise.
> > >
> > > OK, and so if I understand correctly, code motion becomes a matter of
> > > pushing nodes down the instance until the merge predicate simplifies?
> > >
> > > Or do we keep them in place, and just materialize them in the correct
> > > BB?
> > >
> > > I imagine that for 0 not doing 1 would be easier as we can just re-use
> > > the scalar BB as is today and not have to go down the path of being
> > > able to generate new CFG yet.
> > 
> > Yes, as said I'd first do only 0 and record fixed scheduling points
> > for SLP nodes where that's necessary (like for the cbranch node).
> > It might be that the code motion that's required for correctness(!)
> > happens "magically", but I'm not sure we want to rely on that ;)
> > ISTR we only push stmts down after the exits, so predicating them
> > with the in-loop block predicate should get them scheduled appropriately?
> > 
> > > >
> > > > I'm trying to make you focus on 0. as I'm really eager to make
> > > > the only-SLP transition this stage1 and doing all the neat stuff
> > > > will take longer.
> > >
> > > That is fair, I'll move it up my priority list. I'm hoping to get as much 
> > > of
> > > this done during this stage-1 (spent a lot of time thinking about it).
> > >
> > > So I'll get my AArch64 patches out and start on 0.  I'll continue on
> > > IV opts in the background.
> > 
> > Thanks a lot.
> > 
> > > >
> > > > > == Switch vectorization support ==
> > > > >
> > > > > The following code
> > > > >
> > > > > short a[100];
> > > > >
> > > > > int foo(int n, int counter)
> > > > > {
> > > > >    for (int i = 0; i < n; i++)
> > > > >      {
> > > > >         if (a[i] == 1 || a[i] == 2 || a[i] == 7 || a[i] == 4)
> > > > >           return 1;
> > > > >      }
> > > > >     return 0;
> > > > > }
> > > > >
> > > > > fails to vectorize at -O3 -march=armv9-a because in GIMPLE the if is 
> > > > > rewritten
> > > > > into a switch statement:
> > > > >
> > > > >   <bb 2> [local count: 114863530]:
> > > > >   if (n_6(D) > 0)
> > > > >     goto <bb 6>; [94.50%]
> > > > >   else
> > > > >     goto <bb 11>; [5.50%]
> > > > >
> > > > >   <bb 6> [local count: 108546036]:
> > > > >
> > > > >   <bb 3> [local count: 1014686025]:
> > > > >   # i_3 = PHI <i_8(7), 0(6)>
> > > > >   _1 = a[i_3];
> > > > >   switch (_1) <default: <L9> [94.50%], case 1 ... 2: <L11> [5.50%], 
> > > > > case 4:
> > <L11>
> > > > [5.50%], case 7: <L11> [5.50%]>
> > > > >
> > > > >   <bb 9> [local count: 55807731]:
> > > > > <L11>:
> > > > >   goto <bb 5>; [100.00%]
> > > > >
> > > > >   <bb 4> [local count: 958878295]:
> > > > > <L9>:
> > > > >   i_8 = i_3 + 1;
> > > > >   if (n_6(D) > i_8)
> > > > >     goto <bb 7>; [94.50%]
> > > > >   else
> > > > >     goto <bb 12>; [5.50%]
> > > > >
> > > > >   <bb 12> [local count: 52738306]:
> > > > >   goto <bb 8>; [100.00%]
> > > > >
> > > > >   <bb 7> [local count: 906139989]:
> > > > >   goto <bb 3>; [100.00%]
> > > > >
> > > > >   <bb 11> [local count: 6317494]:
> > > > >
> > > > >   <bb 8> [local count: 59055800]:
> > > > >
> > > > >   <bb 5> [local count: 114863531]:
> > > > >   # _5 = PHI <1(9), 0(8)>
> > > > > <L10>:
> > > > >   return _5;
> > > > >
> > > > > However such switch statements, where all the entries lead to the same
> > > > > destination are easy to vectorize.  In SVE we have the MATCH 
> > > > > instruction that
> > > > > can be used here, and for others we can duplicate the constants and 
> > > > > lower
> > the
> > > > > switch to a series of compare and ORRs.
> > > > >
> > > > > This is similar as what's done for when the values don't fit inside a 
> > > > > switch:
> > > > >
> > > > > short a[100];
> > > > >
> > > > > int foo(int n, int counter, short x, short b, short c)
> > > > > {
> > > > >    for (int i = 0; i < n; i++)
> > > > >      {
> > > > >         if (a[i] == x || a[i] == b || a[i] == 7 || a[i] == c)
> > > > >           return 1;
> > > > >      }
> > > > >     return 0;
> > > > > }
> > > > >
> > > > > is vectorized as:
> > > > >
> > > > > .L4:
> > > > >         incb    x5
> > > > >         incw    z30.s, all, mul #2
> > > > >         cmp     w6, w1
> > > > >         bcc     .L15
> > > > > .L6:
> > > > >         ld1h    z31.h, p7/z, [x5]
> > > > >         cmpeq   p15.h, p7/z, z31.h, z27.h
> > > > >         cmpeq   p11.h, p7/z, z31.h, z28.h
> > > > >         cmpeq   p14.h, p7/z, z31.h, #7
> > > > >         cmpeq   p12.h, p7/z, z31.h, z29.h
> > > > >         orr     p15.b, p7/z, p15.b, p11.b
> > > > >         orr     p14.b, p7/z, p14.b, p12.b
> > > > >         inch    x1
> > > > >         orr     p15.b, p7/z, p15.b, p14.b
> > > > >         ptest   p13, p15.b
> > > > >         b.none  .L4
> > > > >         umov    w1, v30.s[0]
> > > > > .L3:
> > > > >         sxtw    x1, w1
> > > > >         b       .L7
> > > > >
> > > > > which is great! but should be an SVE MATCH instruction as well.
> > > > >
> > > > > This kind of code shows up through the use of std::find_if as well:
> > > > >
> > > > > #include <bits/stdc++.h>
> > > > > using namespace std;
> > > > >
> > > > > bool pred(int i) { return i == 1 || i == 2 || i == 7 || i == 4; }
> > > > >
> > > > > int foo(vector<int> vec)
> > > > > {
> > > > >     vector<int>::iterator it;
> > > > >
> > > > >     it = find_if(vec.begin(), vec.end(), pred);
> > > > >
> > > > >     return *it;
> > > > > }
> > > > >
> > > > > and once the unrolled loop is gone we should be able to vectorize it.
> > > > >
> > > > > My proposal is to create two new IFNs and optabs:
> > > > >
> > > > > IFN_MATCH, IFN_NMATCH and have a vectorizer pattern recognize the ORR
> > > > reduction
> > > > > cases into MATCH and have ifcvt lower the switch into ORR statements.
> > > > >
> > > > > Such switch statements only have two branches and so are identical to
> > cbranch
> > > > > for codegen and vectorization.
> > > > >
> > > > > The unrolled ORR are replace by match and so the gimple becomes:
> > > > >
> > > > > a = .MATCH (...)
> > > > > if (a != {0,..})
> > > > > and so no other change is needed for codegen.
> > > >
> > > > This sounds separate from the rest, I think teaching if-conversion to
> > > > lower (small, or simple by some measure) switches is the way to go
> > > > for now.  There's existing code to produce ifs from switch and those
> > > > can be if-converted in the classical way (in the if (vectorized) loop
> > > > copy only, of course).
> > >
> > > Ack, it was just something we noticed in some cases which blocked 
> > > vectorization
> > > which seemed easy to address 😊
> > >
> > > >
> > > > An alternative with IFN_MATCH sounds good in principle.
> > > >
> > > > > == Better cbranch support ==
> > > > >
> > > > > At the moment, one of the big downsides of re-using the existing 
> > > > > cbranch is
> > that
> > > > > in the target we can't tell whether the result of the cbranch is 
> > > > > actually needed
> > > > > or not.
> > > > >
> > > > > i.e. for SVE we can't tell if the predicate is needed.  For the cases 
> > > > > where we
> > > > > don't stay inside the vector loop we can generate more efficient code 
> > > > > if we
> > know
> > > > > that the loop only cares about any or all bits set and doesn't need 
> > > > > to know
> > > > > which one.
> > > > >
> > > > > For this reason I propose adding new optabs cbranch_any_ and 
> > > > > branch_all_
> > and
> > > > > have emit_cmp_and_jump_insns lower to these when appropriate.
> > > >
> > > > Hmm, but isn't this then more a vec_cmpeq_any that produces a scalar
> > > > rather than a vector and then a regular scalar compare-and-jump?  That 
> > > > is,
> > > > does SVE have such a compare instruction?  Can you show the current
> > > > code-gen and how the improved one would look like?
> > >
> > > Ah so the point here is that we don't need the scalar results, the optabs 
> > > would
> > only
> > > care about the CC flags.
> > >
> > > Consider the following loop:
> > >
> > > #ifndef N
> > > #define N 800
> > > #endif
> > > unsigned vect_a[N];
> > > unsigned vect_b[N];
> > >
> > > unsigned test4(unsigned x)
> > > {
> > >  unsigned ret = 0;
> > >  for (int i = 0; i < N; i++)
> > >  {
> > >    vect_b[i] = x + i;
> > >    if (vect_a[i]*2 != x)
> > >      break;
> > >    vect_a[i] = x;
> > >
> > >  }
> > >  return ret;
> > > }
> > >
> > > For early break, where we branch to scalar code we don't care about the 
> > > result of
> > the mask,
> > > as it's unused:
> > >
> > > so in gimple we have:
> > >
> > >   mask_patt_6.15_69 = vect_cst__60 != vect__4.14_67;
> > >   vec_mask_and_72 = loop_mask_64 & mask_patt_6.15_69;
> > >   if (vec_mask_and_72 != { 0, ... })
> > >
> > > which generates:
> > >
> > >         cmpne   p14.s, p7/z, z30.s, z29.s
> > 
> > p7 is the loop mask here?  p14.s the output of the masked compare?
> > 
> > >         ptest   p15, p14.b
> > 
> > and p15 is all-zero?
> 
> P15 here is all true, in SVE cbranch emits a ptest, and in this case a ptrue 
> is fine
> since we only care about 0 and != 0.
> 
> I'll explain more below.
> 
> > 
> > >         b.none  .L3
> > >
> > > notice the ptest.  In SVE if the size of a predicate of an operation does 
> > > not match
> > that of
> > > the governing predicate a ptest is required to reshape the flags.
> > >
> > > However, for early breaks all we case about are any or all.  So the exact 
> > > shape of
> > the predicate
> > > register doesn't matter and since vector compares already set flags we 
> > > already
> > have the answer
> > > we need.  So the above should be:
> > >
> > >         cmpne   p14.s, p7/z, z30.s, z29.s
> > >         b.none  .L3
> > 
> > Hmm, I don't really see anything missing on the GIMPLE side - doesn't
> > the required info all translate to RTL and thus instruction combine
> > should be able to handle this?
> 
> For SVE only we could potentially do it, but for using SVE instructions for 
> Adv. SIMD
> we can't.  Adv. SIMD requires us to compress the vector results into an 
> integer
> and compare that with 0 to get the CC flags.  SVE doesn't have this, because 
> in SVE
> a vector compare sets flags.  So in this case, we don't care about p14 at all.
> 
> But if we say want to use an SVE compare to replace an Adv. SIMD compare the 
> sequence
> of instructions will be longer than what combine can match.  And it becomes 
> very harry..
> I've tried it before and because of the various combinations of compares (the 
> vector and the
> Flag ones) It ends up with a lot of complicated patterns.

Is that something worth optimizing?

> > 
> > > This makes a big difference in loops for performance.  We have a small CC
> > optimization
> > > pass in that tries to optimize such cases
> > https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=5a783f42d77b2f00a1ed171c119
> > b020e8df8e521
> > >
> > > But this case is hard to detect as you need to know that the predicate is 
> > > not used
> > and we only
> > > case about any/all.  This is information we have in GIMPLE so can more 
> > > readily do
> > in expand.
> > 
> > But the ptest p15, p14.b is the same as vec_mask_and_72 != { 0, ... }
> > so I don't see where information is missing?  I probably don't
> > understand the "shape" thing.
> 
> So for SVE the usages of the compares for early break don't care about the 
> mask value.
> that is, we don’t care about the predicate result of vec_mask_and_72 != { 0, 
> ... }.
> 
> Because for SVE the vector compare sets the CC flags indicating one of 3 
> options:
> 1. none == result predicate is all false
> 2. first == the first element in the predicate is true
> 3. !last == the last element in the predicate is not true
> 
> This gives us the ability for early break to just branch based on the CC, 
> since we just care
> about none or !none, which is abstracted as the pseudo branch instructions 
> b.any and b.none.
> 
> ptrue is used to reshape the CC into the correct form, when the size of 
> governing predicate
> doesn't match that of the compare results.
> 
> For none and any this doesn't matter, but it changes the meaning for first 
> and !last.  But because
> for early break we only require none and any, the ptrue is never needed.

So why's it not "DCEd" then?

I guess I still don't get it - it seems like an architectural detail
GIMPLE should know nothing about and that you need to solve in the
backend.

Richard.

> Now for Adv. SIMD what we typically do, when an SVE instruction can be used 
> in place of an Adv. SIMD
> sequence we deal with it at expand time.  But at early break that's no 
> possible because cbranch doesn't
> see the compare.  So we are forced to expand to the Adv. SIMD equivalent and 
> later try to match it up
> with an SVE instruction.  This results in more instructions than combine can 
> handle.  The new optab
> would allow us to do this easily as we don't ever have to emit the Adv. SIMD 
> code first.
> 
> And for SVE it will tell us as well that the result value of the comparison 
> is not needed.  i.e. it coveys
> that all we care about is branching based on a compare, how you do it, is up 
> to you.
> 
> Cheers,
> Tamar
> 
> > 
> > Richard.
> > 
> > > The additional benefit of the optab is that it allows us to more easily 
> > > use SVE
> > instructions
> > > to do the flag setting when we choose Adv. SIMD but SVE is available.  
> > > Since we
> > know we don't
> > > need to materialize an Adv. SIMD Boolean vector we can avoid the
> > materialization of the conversion
> > > from an SVE predicate to Adv. SIMD mask vector.
> > >
> > > Cheers,
> > > Tamar
> > >
> > > >
> > > > > Are the general idea and steps OK?
> > > >
> > > > See above.  Thanks for the write-up.
> > > >
> > > > Richard.
> > > >
> > > > > If so I'll start implementation now.
> > > > >
> > > > > Thanks,
> > > > > Tamar
> > > > >
> > > > > [1] Yishen Chen, Charith Mendis, and Saman Amarasinghe. 2022. All
> > > > > You Need Is Superword-Level Parallelism: Systematic Control-Flow
> > > > > Vectorization with SLP. In Proceedings of the 43rd ACM SIGPLAN
> > > > > International Conference on Programming Language Design and
> > > > > Implementation (PLDI '22), June 13?17, 2022, San Diego, CA, USA.
> > > > > ACM, NewYork, NY, USA, 15 pages. https://doi.org/10.1145/3519939.
> > > > > 3523701
> > > > >
> > > > > [2] Predicated Static Single Assignment
> > > > >
> > > > > Lori Carter Beth Simon Brad Calder Larry Carter Jeanne Ferrante
> > > > > Department of Computer Science and Engineering
> > > > > University of California, San Diego
> > > > > flcarter,esimon,calder,carter,ferran...@cs.ucsd.edu
> > > > >
> > > >
> > > > --
> > > > Richard Biener <rguent...@suse.de>
> > > > SUSE Software Solutions Germany GmbH,
> > > > Frankenstrasse 146, 90461 Nuernberg, Germany;
> > > > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG
> > Nuernberg)
> > >
> > 
> > --
> > Richard Biener <rguent...@suse.de>
> > SUSE Software Solutions Germany GmbH,
> > Frankenstrasse 146, 90461 Nuernberg, Germany;
> > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
> 

-- 
Richard Biener <rguent...@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

RE: [RFC][middle-end] SLP Early break and control flow support in GCC

Reply via email to