RE: [PATCH] vect: Improve vectorization for small-trip-count loops using subvectors

Richard Biener Wed, 04 Jun 2025 00:34:11 -0700

On Wed, 4 Jun 2025, Tamar Christina wrote:

> > -----Original Message-----
> > From: Richard Biener <rguent...@suse.de>
> > Sent: Wednesday, June 4, 2025 8:04 AM
> > To: Tamar Christina <tamar.christ...@arm.com>
> > Cc: Richard Biener <richard.guent...@gmail.com>; Richard Sandiford
> > <richard.sandif...@arm.com>; Pengfei Li <pengfei....@arm.com>; gcc-
> > patc...@gcc.gnu.org; ktkac...@nvidia.com
> > Subject: RE: [PATCH] vect: Improve vectorization for small-trip-count loops 
> > using
> > subvectors
> > 
> > On Tue, 3 Jun 2025, Tamar Christina wrote:
> > 
> > > > -----Original Message-----
> > > > From: Richard Biener <richard.guent...@gmail.com>
> > > > Sent: Tuesday, June 3, 2025 2:12 PM
> > > > To: Tamar Christina <tamar.christ...@arm.com>
> > > > Cc: Richard Biener <rguent...@suse.de>; Richard Sandiford
> > > > <richard.sandif...@arm.com>; Pengfei Li <pengfei....@arm.com>; gcc-
> > > > patc...@gcc.gnu.org; ktkac...@nvidia.com
> > > > Subject: Re: [PATCH] vect: Improve vectorization for small-trip-count 
> > > > loops
> > using
> > > > subvectors
> > > >
> > > > On Fri, May 9, 2025 at 4:05 PM Tamar Christina <tamar.christ...@arm.com>
> > > > wrote:
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Richard Biener <rguent...@suse.de>
> > > > > > Sent: Friday, May 9, 2025 2:44 PM
> > > > > > To: Tamar Christina <tamar.christ...@arm.com>
> > > > > > Cc: Richard Sandiford <richard.sandif...@arm.com>; Pengfei Li
> > > > > > <pengfei....@arm.com>; gcc-patches@gcc.gnu.org; ktkac...@nvidia.com
> > > > > > Subject: RE: [PATCH] vect: Improve vectorization for 
> > > > > > small-trip-count loops
> > > > using
> > > > > > subvectors
> > > > > >
> > > > > > On Fri, 9 May 2025, Tamar Christina wrote:
> > > > > >
> > > > > > > > -----Original Message-----
> > > > > > > > From: Richard Biener <rguent...@suse.de>
> > > > > > > > Sent: Friday, May 9, 2025 11:08 AM
> > > > > > > > To: Richard Sandiford <richard.sandif...@arm.com>
> > > > > > > > Cc: Pengfei Li <pengfei....@arm.com>; gcc-patches@gcc.gnu.org;
> > > > > > > > ktkac...@nvidia.com
> > > > > > > > Subject: Re: [PATCH] vect: Improve vectorization for 
> > > > > > > > small-trip-count
> > loops
> > > > > > using
> > > > > > > > subvectors
> > > > > > > >
> > > > > > > > On Fri, 9 May 2025, Richard Sandiford wrote:
> > > > > > > >
> > > > > > > > > Richard Biener <rguent...@suse.de> writes:
> > > > > > > > > > On Thu, 8 May 2025, Pengfei Li wrote:
> > > > > > > > > >
> > > > > > > > > >> This patch improves the auto-vectorization for loops with 
> > > > > > > > > >> known
> > small
> > > > > > > > > >> trip counts by enabling the use of subvectors - bit fields 
> > > > > > > > > >> of original
> > > > > > > > > >> wider vectors. A subvector must have the same vector 
> > > > > > > > > >> element type
> > as
> > > > the
> > > > > > > > > >> original vector and enough bits for all vector elements to 
> > > > > > > > > >> be
> > processed
> > > > > > > > > >> in the loop. Using subvectors is beneficial because machine
> > instructions
> > > > > > > > > >> operating on narrower vectors usually show better 
> > > > > > > > > >> performance.
> > > > > > > > > >>
> > > > > > > > > >> To enable this optimization, this patch introduces a new 
> > > > > > > > > >> target
> > hook.
> > > > > > > > > >> This hook allows the vectorizer to query the backend for a 
> > > > > > > > > >> suitable
> > > > > > > > > >> subvector type given the original vector type and the 
> > > > > > > > > >> number of
> > > > elements
> > > > > > > > > >> to be processed in the small-trip-count loop. The target 
> > > > > > > > > >> hook also
> > has a
> > > > > > > > > >> could_trap parameter to say if the subvector is allowed to 
> > > > > > > > > >> have
> > more
> > > > > > > > > >> bits than needed.
> > > > > > > > > >>
> > > > > > > > > >> This optimization is currently enabled for AArch64 only. 
> > > > > > > > > >> Below
> > example
> > > > > > > > > >> shows how it uses AdvSIMD vectors as subvectors of SVE 
> > > > > > > > > >> vectors
> > for
> > > > > > > > > >> higher instruction throughput.
> > > > > > > > > >>
> > > > > > > > > >> Consider this loop operating on an array of 16-bit 
> > > > > > > > > >> integers:
> > > > > > > > > >>
> > > > > > > > > >>      for (int i = 0; i < 5; i++) {
> > > > > > > > > >>        a[i] = a[i] < 0 ? -a[i] : a[i];
> > > > > > > > > >>      }
> > > > > > > > > >>
> > > > > > > > > >> Before this patch, the generated AArch64 code would be:
> > > > > > > > > >>
> > > > > > > > > >>      ptrue   p7.h, vl5
> > > > > > > > > >>      ptrue   p6.b, all
> > > > > > > > > >>      ld1h    z31.h, p7/z, [x0]
> > > > > > > > > >>      abs     z31.h, p6/m, z31.h
> > > > > > > > > >>      st1h    z31.h, p7, [x0]
> > > > > > > > > >
> > > > > > > > > > p6.b has all lanes active - why is the abs then not
> > > > > > > > > > simply unmasked?
> > > > > > > > >
> > > > > > > > > There is no unpredicated abs for SVE.  The predicate has to 
> > > > > > > > > be there,
> > > > > > > > > and so expand introduces one even when the gimple stmt is
> > > > unconditional.
> > > > > > > > >
> > > > > > > > > >> After this patch, it is optimized to:
> > > > > > > > > >>
> > > > > > > > > >>      ptrue   p7.h, vl5
> > > > > > > > > >>      ld1h    z31.h, p7/z, [x0]
> > > > > > > > > >>      abs     v31.8h, v31.8h
> > > > > > > > > >>      st1h    z31.h, p7, [x0]
> > > > > > > > > >
> > > > > > > > > > Help me decipher this - I suppose z31 and v31 "overlap" in 
> > > > > > > > > > the
> > > > > > > > > > register file?  And z31 is a variable-length vector but
> > > > > > > > > > z31.8h is a 8 element fixed length vector?  How can we
> > > > > > > > >
> > > > > > > > > v31.8h, but otherwise yes.
> > > > > > > > >
> > > > > > > > > > end up with just 8 elements here?  From the upper interation
> > > > > > > > > > bound?
> > > > > > > > >
> > > > > > > > > Yeah.
> > > > > > > > >
> > > > > > > > > > I'm not sure why you need any target hook here.  It seems 
> > > > > > > > > > you
> > > > > > > > > > do already have suitable vector modes so why not just ask
> > > > > > > > > > for a suitable vector?  Is it because you need to have
> > > > > > > > > > that register overlap guarantee (otherwise you'd get
> > > > > > > > > > a move)?
> > > > > > > > >
> > > > > > > > > Yeah, the optimisation only makes sense for overlaid vector 
> > > > > > > > > registers.
> > > > > > > > >
> > > > > > > > > > Why do we not simply use fixed-length SVE here in the first 
> > > > > > > > > > place?
> > > > > > > > >
> > > > > > > > > Fixed-length SVE is restricted to cases where the exact 
> > > > > > > > > runtime length
> > > > > > > > > is known: the compile-time length is both a minimum and a 
> > > > > > > > > maximum.
> > > > > > > > > In contrast, the code above would work even for 256-bit SVE.
> > > > > > > > >
> > > > > > > > > > To me doing this in this way in the vectorizer looks
> > > > > > > > > > somewhat out-of-place.
> > > > > > > > > >
> > > > > > > > > > That said, we already have unmasked ABS in the IL:
> > > > > > > > > >
> > > > > > > > > >   vect__1.6_15 = .MASK_LOAD (&a, 16B, { -1, -1, -1, -1, -1, 
> > > > > > > > > > 0, 0, 0, 0,
> > 0,
> > > > > > > > > > 0, 0, 0, 0, 0, 0, ... }, { 0, ... });
> > > > > > > > > >   vect__2.7_16 = ABSU_EXPR <vect__1.6_15>;
> > > > > > > > > >   vect__3.8_17 = VIEW_CONVERT_EXPR<vector([8,8]) short
> > > > > > > > int>(vect__2.7_16);
> > > > > > > > > >   .MASK_STORE (&a, 16B, { -1, -1, -1, -1, -1, 0, 0, 0, 0, 
> > > > > > > > > > 0, 0, 0, 0, 0,
> > > > > > > > > > 0, 0, ... }, vect__3.8_17); [tail call]
> > > > > > > > > >
> > > > > > > > > > so what's missing here?  I suppose having a constant masked 
> > > > > > > > > > ABSU
> > here
> > > > > > > > > > would allow RTL expansion to select a fixed-size mode?
> > > > > > > > > >
> > > > > > > > > > And the vectorizer could simply use the existing
> > > > > > > > > > related_vector_mode hook instead?
> > > > > > > > >
> > > > > > > > > I agree it's a bit awkward.  The problem is that we want 
> > > > > > > > > conflicting
> > > > > > > > > things.  On the one hand, it would make conceptual sense to 
> > > > > > > > > use SVE
> > > > > > > > > instructions to provide conditional optabs for Advanced SIMD 
> > > > > > > > > vector
> > > > modes.
> > > > > > > > > E.g. SVE's LD1W could act as a predicated load for an 
> > > > > > > > > Advanced SIMD
> > > > > > > > > int32x4_t vector.  The main problem with that is that 
> > > > > > > > > Advanced SIMD's
> > > > > > > > > native boolean vector type is an integer vector of 0s and 
> > > > > > > > > -1s, rather
> > > > > > > > > than an SVE predicate.  For some (native Advanced SIMD) 
> > > > > > > > > operations
> > we'd
> > > > > > > > > want one type of boolean, for some (SVE emulating Advanced 
> > > > > > > > > SIMD)
> > > > > > > > > operations we'd want the other type of boolean.
> > > > > > > > >
> > > > > > > > > The patch goes the other way and treats using Advanced SIMD 
> > > > > > > > > as an
> > > > > > > > > optimisation for SVE loops.
> > > > > > > > >
> > > > > > > > > related_vector_mode suffers from the same problem.  If we ask 
> > > > > > > > > for a
> > > > > > > > > vector mode of >=5 halfwords for a load or store, we want the 
> > > > > > > > > SVE
> > mode,
> > > > > > > > > since that can be conditional on an SVE predicate.  But if we 
> > > > > > > > > ask for
> > > > > > > > > a vector mode of >=5 halfwords for an integer absolute 
> > > > > > > > > operation,
> > > > > > > > > we want the Advanced SIMD mode.  So I suppose the new hook is
> > > > effectively
> > > > > > > > > providing context.  Perhaps we could do that using an extra 
> > > > > > > > > parameter
> > to
> > > > > > > > > related_vector_mode, if that seems better.
> > > > > > > > >
> > > > > > > > > It's somewhat difficult to recover this information after 
> > > > > > > > > vectorisation,
> > > > > > > > > since like you say, the statements are often unconditional 
> > > > > > > > > and operate
> > > > > > > > > on all lanes.
> > > > > > > >
> > > > > > > > So it seems we want to query if there's a lowpart fixed-size 
> > > > > > > > vector
> > > > > > > > mode available for a given other mode.  It seems to me that we
> > > > > > > > should have a way to query for this already without having a new
> > > > > > > > target hook using general code?
> > > > > >
> > > > > > So any answer to this?  You should be able to iterate over all
> > > > > > vector modes, look for those with fixed size and fitting the
> > > > > > lane constraint and then asking whether the modes are tieable
> > > > > > or whatever else is the correct way to verify the constraint?
> > > > > >
> > > > > > So sth as simple as
> > > > > >
> > > > > >  mode = mode_for_vector (GET_MODE_INNER (vmode), ceil_pow2 (const-
> > > > > > nunits));
> > > > > >  if (targetm.modes_tieable_p (mode, vmode))
> > > > > >    return mode;
> > > > > >
> > > > > > ?  Why do we need a target hook for this?  What's the "hidden"
> > > > > > constraint I'm missing?
> > > > > >
> > > > >
> > > > > Richard can correct me if I'm wrong (probably) but the problem with 
> > > > > this
> > > > > is that it won't work with VLS e.g. -msve-vector-bits because the SVE 
> > > > > modes
> > > > > are fixed size then.  Secondly it'll have issues respecting --param 
> > > > > aarch64-
> > > > autovec-preference=
> > > > > as this is intended to only affect autovec where mode_for_vector is 
> > > > > general.
> > > > >
> > > > > The core of this optimization is that it must change to Adv. SIMD 
> > > > > over SVE
> > > > modes.
> > > >
> > > > OK, so SVE VLS -msve-vector-bits=128 modes are indistinguishable from 
> > > > Adv.
> > > > SIMD
> > > > modes by the middle-end?
> > >
> > > I believe so, the ACLE types have an annotation on them to lift some of 
> > > the
> > > restrictions but the modes are the same.
> > >
> > > > Is there a way to distinguish them, say, by cost
> > > > (target_reg_cost?)?  Since any use of a SVE reg will require a 
> > > > predicate reg?
> > > >
> > >
> > > We do have unpredicated SVE instructions, but yes costing could work.
> > > Essentially what we're trying to do is find the cheapest mode to perform
> > > the operation on.
> > >
> > > This could work.. But how would we incorporate it into the costing? Part 
> > > of
> > > the problem is that to iterate over similar modes with the same element 
> > > size
> > > likely requires some target input no?  Or are you saying we should only
> > > iterate over fixed size modes?
> > >
> > > Regards,
> > > Tamar
> > >
> > > > I think we miss a critical bit of information in the middle-end here 
> > > > and I'm
> > > > trying to see what piece of information that actually is.  
> > > > "find_subvector_type"
> > > > doesn't seem to be it, it's maybe using that hidden bit of information 
> > > > for
> > > > one specific use-case but it would be far better to have a way for the 
> > > > target
> > > > to communicate the missing bit of information in a more generic way?
> > > > We can then wrap a "find_subvector_type" around that.
> > 
> > So for this one sth like targetm.mode_requires_predication ()?  But
> > as Tamar says above this really depends on the operation.  But the
> > optabs do _not_ expose this requirement (we have non-.COND_ADD for
> > SVE modes), but you want to take advantage of this difference.
> > Can we access insn attributes from optab entries?  Could we add
> > some "standard" attribute noting that an insn requires a predicate?
> > But of course that likely depends on the alternative?
> > 
> 
> We'd likely also require the mask that would be used, because I think 
> otherwise
> targetm.mode_requires_predication would be a bit ambiguous for non-flag 
> setting
> instructions or instructions that don’t do cross lane operations.
> 
> e.g. SVE has both COND_ADD and ADD. But the key here is that if we know we'll
> access the bottom 64 or 128 bits we could use an Adv. SIMD ADD.


But SVE ADD still requires a predicate register (with all lanes enabled),
no?  That's the whole point of the optimization we're discussing?
I see the only problem with -msve-vector-bits=N where GET_MODE_SIZE
is no longer a POLY_INT - otherwise that would be the easy
way to identify Adv. SIMD vs. SVE and heuristically prefer
fixed-size modes in the vectorizer when possible (for small known
niter <= the fixed-size mode number of lanes).  But with
-msve-vector-bits=128 GET_MODE_SIZE for Adv. SIMD and SVE is equal(?),
so we need another way to distinguish.  Because even with
-msve-vector-bits=128 you need the predicate register appropriately
set up as I understand you are not altering the SVE HW config which
would be also possible(?), but I'm not sure that would make it
possible to have a predicate register less ADD instruction.

What SVE register taking machine instructions do not explicitly/implicitly
use one of the SVE predicate registers? 

> Where the
> operation would be beneficial for longer VL cores where the Adv. SIMD vector
> pipes are multiplexed on the SVE ones. Such as Neoverse-V1, but not 
> Neoverse-V2.
> 
> Without the predicate being considered (or niters) SVE would have to return 
> false
> for the hook.
> 
> Which is why an attribute may be tricky.
> 
> Regards,
> Tamar
> 
> > Richard.
> > 
> > 
> > > > Thanks,
> > > > Richard.
> > > >
> > > > > > > > It doesn't really fit related_vector_mode I guess.
> > > > > > > >
> > > > > > >
> > > > > > > Yeah I don't think would work unless as Richard mentioned we have
> > > > > > > an argument to indicate which SIMD class you want to end up with.
> > > > > > >
> > > > > > > > I also wonder if we can take it as a given that SVE and neon
> > > > > > > > inter-operate efficiently for all implementations, without some
> > > > > > > > kind of "rewriting" penalty.  Like in the above example we
> > > > > > > > set NEON v31 and use it as source for a SVE store via z31.
> > > > > > > >
> > > > > > >
> > > > > > > It's an architectural requirement that the register files overlap.
> > > > > > > This is described section B1.2 (Registers in AArch64 execution 
> > > > > > > state).
> > > > > > > Any core that ever implements non-overlapping register where they
> > > > > > > would need a rewiring penalty has bigger problems to worry about,
> > > > > > > as an example the architecture describes that writing to the 
> > > > > > > lower part of
> > > > > > > a Adv. SIMD register will clear the top parts up to VL of the SVE 
> > > > > > > "view".
> > > > > > >
> > > > > > > Any such rewiring would mean that Adv. SIMD and Scalar FPR 
> > > > > > > instructions
> > > > > > > become useless due to the defined state of the larger views on the
> > register.
> > > > > >
> > > > > > I'm aware of the architectural requirement - it's just that I could
> > > > > > think of the HW re-configuring itself for masked operations and thus
> > > > > > switching back and forth might incur some penalty.  If it is common
> > > > > > practice to mix SVE and NEON this way it's of course unlikely such
> > > > > > uarch would be sucessful.  But then, powering off mask & high vector
> > > > > > part handling logic when facing NEON might be a possibility.
> > > > > >
> > > > >
> > > > > Yeah the reason for this optimization has more to do with how the 
> > > > > vector
> > > > > pipes are split between Adv. SIMD and SVE. An easy one is say 
> > > > > reductions,
> > > > > the bigger VL the more expensive in-order reductions like addv become.
> > > > > But Adv. SIMD reductions have a fixed cost, and if we know we only 
> > > > > need to
> > > > > reduce the bottom N-bits it'll always beat SVE reductions.
> > > > >
> > > > > Others like MUL just have a higher throughput in Adv. SIMD vs SVE on 
> > > > > e.g. VL
> > > > 256
> > > > > bit cores.  So it's not just the masking but vector length in general.
> > > > >
> > > > > And the reason we don't pick Adv. SIMD for such loops is that SVE 
> > > > > allows
> > partial
> > > > masking,
> > > > > so for e.g. MUL it's ok for us to multiply with an unknown valued 
> > > > > lane since
> > > > predication
> > > > > makes the usages of the result safe.
> > > > >
> > > > > Thanks,
> > > > > Tamar
> > > > >
> > > > > > Richard.
> > > > > >
> > > > > > > Thanks,
> > > > > > > Tamar
> > > > > > >
> > > > > > > > Richard.
> > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Richard
> > > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Richard Biener <rguent...@suse.de>
> > > > > > > > SUSE Software Solutions Germany GmbH,
> > > > > > > > Frankenstrasse 146, 90461 Nuernberg, Germany;
> > > > > > > > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG
> > > > > > Nuernberg)
> > > > > > >
> > > > > >
> > > > > > --
> > > > > > Richard Biener <rguent...@suse.de>
> > > > > > SUSE Software Solutions Germany GmbH,
> > > > > > Frankenstrasse 146, 90461 Nuernberg, Germany;
> > > > > > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG
> > > > Nuernberg)
> > >
> > 
> > --
> > Richard Biener <rguent...@suse.de>
> > SUSE Software Solutions Germany GmbH,
> > Frankenstrasse 146, 90461 Nuernberg, Germany;
> > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
> 

-- 
Richard Biener <rguent...@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

RE: [PATCH] vect: Improve vectorization for small-trip-count loops using subvectors

Reply via email to