RE: [PATCH] vect: Improve vectorization for small-trip-count loops using subvectors

Tamar Christina Wed, 04 Jun 2025 00:45:16 -0700

> -----Original Message-----
> From: Richard Biener <rguent...@suse.de>
> Sent: Wednesday, June 4, 2025 8:34 AM
> To: Tamar Christina <tamar.christ...@arm.com>
> Cc: Richard Biener <richard.guent...@gmail.com>; Richard Sandiford
> <richard.sandif...@arm.com>; Pengfei Li <pengfei....@arm.com>; gcc-
> patc...@gcc.gnu.org; ktkac...@nvidia.com
> Subject: RE: [PATCH] vect: Improve vectorization for small-trip-count loops 
> using
> subvectors
> 
> On Wed, 4 Jun 2025, Tamar Christina wrote:
> 
> > > -----Original Message-----
> > > From: Richard Biener <rguent...@suse.de>
> > > Sent: Wednesday, June 4, 2025 8:04 AM
> > > To: Tamar Christina <tamar.christ...@arm.com>
> > > Cc: Richard Biener <richard.guent...@gmail.com>; Richard Sandiford
> > > <richard.sandif...@arm.com>; Pengfei Li <pengfei....@arm.com>; gcc-
> > > patc...@gcc.gnu.org; ktkac...@nvidia.com
> > > Subject: RE: [PATCH] vect: Improve vectorization for small-trip-count 
> > > loops
> using
> > > subvectors
> > >
> > > On Tue, 3 Jun 2025, Tamar Christina wrote:
> > >
> > > > > -----Original Message-----
> > > > > From: Richard Biener <richard.guent...@gmail.com>
> > > > > Sent: Tuesday, June 3, 2025 2:12 PM
> > > > > To: Tamar Christina <tamar.christ...@arm.com>
> > > > > Cc: Richard Biener <rguent...@suse.de>; Richard Sandiford
> > > > > <richard.sandif...@arm.com>; Pengfei Li <pengfei....@arm.com>; gcc-
> > > > > patc...@gcc.gnu.org; ktkac...@nvidia.com
> > > > > Subject: Re: [PATCH] vect: Improve vectorization for small-trip-count 
> > > > > loops
> > > using
> > > > > subvectors
> > > > >
> > > > > On Fri, May 9, 2025 at 4:05 PM Tamar Christina
> <tamar.christ...@arm.com>
> > > > > wrote:
> > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Richard Biener <rguent...@suse.de>
> > > > > > > Sent: Friday, May 9, 2025 2:44 PM
> > > > > > > To: Tamar Christina <tamar.christ...@arm.com>
> > > > > > > Cc: Richard Sandiford <richard.sandif...@arm.com>; Pengfei Li
> > > > > > > <pengfei....@arm.com>; gcc-patches@gcc.gnu.org;
> ktkac...@nvidia.com
> > > > > > > Subject: RE: [PATCH] vect: Improve vectorization for 
> > > > > > > small-trip-count
> loops
> > > > > using
> > > > > > > subvectors
> > > > > > >
> > > > > > > On Fri, 9 May 2025, Tamar Christina wrote:
> > > > > > >
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: Richard Biener <rguent...@suse.de>
> > > > > > > > > Sent: Friday, May 9, 2025 11:08 AM
> > > > > > > > > To: Richard Sandiford <richard.sandif...@arm.com>
> > > > > > > > > Cc: Pengfei Li <pengfei....@arm.com>; gcc-patches@gcc.gnu.org;
> > > > > > > > > ktkac...@nvidia.com
> > > > > > > > > Subject: Re: [PATCH] vect: Improve vectorization for 
> > > > > > > > > small-trip-count
> > > loops
> > > > > > > using
> > > > > > > > > subvectors
> > > > > > > > >
> > > > > > > > > On Fri, 9 May 2025, Richard Sandiford wrote:
> > > > > > > > >
> > > > > > > > > > Richard Biener <rguent...@suse.de> writes:
> > > > > > > > > > > On Thu, 8 May 2025, Pengfei Li wrote:
> > > > > > > > > > >
> > > > > > > > > > >> This patch improves the auto-vectorization for loops 
> > > > > > > > > > >> with known
> > > small
> > > > > > > > > > >> trip counts by enabling the use of subvectors - bit 
> > > > > > > > > > >> fields of
> original
> > > > > > > > > > >> wider vectors. A subvector must have the same vector 
> > > > > > > > > > >> element
> type
> > > as
> > > > > the
> > > > > > > > > > >> original vector and enough bits for all vector elements 
> > > > > > > > > > >> to be
> > > processed
> > > > > > > > > > >> in the loop. Using subvectors is beneficial because 
> > > > > > > > > > >> machine
> > > instructions
> > > > > > > > > > >> operating on narrower vectors usually show better 
> > > > > > > > > > >> performance.
> > > > > > > > > > >>
> > > > > > > > > > >> To enable this optimization, this patch introduces a new 
> > > > > > > > > > >> target
> > > hook.
> > > > > > > > > > >> This hook allows the vectorizer to query the backend for 
> > > > > > > > > > >> a
> suitable
> > > > > > > > > > >> subvector type given the original vector type and the 
> > > > > > > > > > >> number of
> > > > > elements
> > > > > > > > > > >> to be processed in the small-trip-count loop. The target 
> > > > > > > > > > >> hook
> also
> > > has a
> > > > > > > > > > >> could_trap parameter to say if the subvector is allowed 
> > > > > > > > > > >> to have
> > > more
> > > > > > > > > > >> bits than needed.
> > > > > > > > > > >>
> > > > > > > > > > >> This optimization is currently enabled for AArch64 only. 
> > > > > > > > > > >> Below
> > > example
> > > > > > > > > > >> shows how it uses AdvSIMD vectors as subvectors of SVE
> vectors
> > > for
> > > > > > > > > > >> higher instruction throughput.
> > > > > > > > > > >>
> > > > > > > > > > >> Consider this loop operating on an array of 16-bit 
> > > > > > > > > > >> integers:
> > > > > > > > > > >>
> > > > > > > > > > >>      for (int i = 0; i < 5; i++) {
> > > > > > > > > > >>        a[i] = a[i] < 0 ? -a[i] : a[i];
> > > > > > > > > > >>      }
> > > > > > > > > > >>
> > > > > > > > > > >> Before this patch, the generated AArch64 code would be:
> > > > > > > > > > >>
> > > > > > > > > > >>      ptrue   p7.h, vl5
> > > > > > > > > > >>      ptrue   p6.b, all
> > > > > > > > > > >>      ld1h    z31.h, p7/z, [x0]
> > > > > > > > > > >>      abs     z31.h, p6/m, z31.h
> > > > > > > > > > >>      st1h    z31.h, p7, [x0]
> > > > > > > > > > >
> > > > > > > > > > > p6.b has all lanes active - why is the abs then not
> > > > > > > > > > > simply unmasked?
> > > > > > > > > >
> > > > > > > > > > There is no unpredicated abs for SVE.  The predicate has to 
> > > > > > > > > > be there,
> > > > > > > > > > and so expand introduces one even when the gimple stmt is
> > > > > unconditional.
> > > > > > > > > >
> > > > > > > > > > >> After this patch, it is optimized to:
> > > > > > > > > > >>
> > > > > > > > > > >>      ptrue   p7.h, vl5
> > > > > > > > > > >>      ld1h    z31.h, p7/z, [x0]
> > > > > > > > > > >>      abs     v31.8h, v31.8h
> > > > > > > > > > >>      st1h    z31.h, p7, [x0]
> > > > > > > > > > >
> > > > > > > > > > > Help me decipher this - I suppose z31 and v31 "overlap" 
> > > > > > > > > > > in the
> > > > > > > > > > > register file?  And z31 is a variable-length vector but
> > > > > > > > > > > z31.8h is a 8 element fixed length vector?  How can we
> > > > > > > > > >
> > > > > > > > > > v31.8h, but otherwise yes.
> > > > > > > > > >
> > > > > > > > > > > end up with just 8 elements here?  From the upper 
> > > > > > > > > > > interation
> > > > > > > > > > > bound?
> > > > > > > > > >
> > > > > > > > > > Yeah.
> > > > > > > > > >
> > > > > > > > > > > I'm not sure why you need any target hook here.  It seems 
> > > > > > > > > > > you
> > > > > > > > > > > do already have suitable vector modes so why not just ask
> > > > > > > > > > > for a suitable vector?  Is it because you need to have
> > > > > > > > > > > that register overlap guarantee (otherwise you'd get
> > > > > > > > > > > a move)?
> > > > > > > > > >
> > > > > > > > > > Yeah, the optimisation only makes sense for overlaid vector
> registers.
> > > > > > > > > >
> > > > > > > > > > > Why do we not simply use fixed-length SVE here in the 
> > > > > > > > > > > first place?
> > > > > > > > > >
> > > > > > > > > > Fixed-length SVE is restricted to cases where the exact 
> > > > > > > > > > runtime
> length
> > > > > > > > > > is known: the compile-time length is both a minimum and a
> maximum.
> > > > > > > > > > In contrast, the code above would work even for 256-bit SVE.
> > > > > > > > > >
> > > > > > > > > > > To me doing this in this way in the vectorizer looks
> > > > > > > > > > > somewhat out-of-place.
> > > > > > > > > > >
> > > > > > > > > > > That said, we already have unmasked ABS in the IL:
> > > > > > > > > > >
> > > > > > > > > > >   vect__1.6_15 = .MASK_LOAD (&a, 16B, { -1, -1, -1, -1, 
> > > > > > > > > > > -1, 0, 0, 0,
> 0,
> > > 0,
> > > > > > > > > > > 0, 0, 0, 0, 0, 0, ... }, { 0, ... });
> > > > > > > > > > >   vect__2.7_16 = ABSU_EXPR <vect__1.6_15>;
> > > > > > > > > > >   vect__3.8_17 = VIEW_CONVERT_EXPR<vector([8,8]) short
> > > > > > > > > int>(vect__2.7_16);
> > > > > > > > > > >   .MASK_STORE (&a, 16B, { -1, -1, -1, -1, -1, 0, 0, 0, 0, 
> > > > > > > > > > > 0, 0, 0, 0, 0,
> > > > > > > > > > > 0, 0, ... }, vect__3.8_17); [tail call]
> > > > > > > > > > >
> > > > > > > > > > > so what's missing here?  I suppose having a constant 
> > > > > > > > > > > masked
> ABSU
> > > here
> > > > > > > > > > > would allow RTL expansion to select a fixed-size mode?
> > > > > > > > > > >
> > > > > > > > > > > And the vectorizer could simply use the existing
> > > > > > > > > > > related_vector_mode hook instead?
> > > > > > > > > >
> > > > > > > > > > I agree it's a bit awkward.  The problem is that we want 
> > > > > > > > > > conflicting
> > > > > > > > > > things.  On the one hand, it would make conceptual sense to 
> > > > > > > > > > use
> SVE
> > > > > > > > > > instructions to provide conditional optabs for Advanced SIMD
> vector
> > > > > modes.
> > > > > > > > > > E.g. SVE's LD1W could act as a predicated load for an 
> > > > > > > > > > Advanced
> SIMD
> > > > > > > > > > int32x4_t vector.  The main problem with that is that 
> > > > > > > > > > Advanced
> SIMD's
> > > > > > > > > > native boolean vector type is an integer vector of 0s and 
> > > > > > > > > > -1s, rather
> > > > > > > > > > than an SVE predicate.  For some (native Advanced SIMD)
> operations
> > > we'd
> > > > > > > > > > want one type of boolean, for some (SVE emulating Advanced
> SIMD)
> > > > > > > > > > operations we'd want the other type of boolean.
> > > > > > > > > >
> > > > > > > > > > The patch goes the other way and treats using Advanced SIMD 
> > > > > > > > > > as
> an
> > > > > > > > > > optimisation for SVE loops.
> > > > > > > > > >
> > > > > > > > > > related_vector_mode suffers from the same problem.  If we 
> > > > > > > > > > ask for
> a
> > > > > > > > > > vector mode of >=5 halfwords for a load or store, we want 
> > > > > > > > > > the SVE
> > > mode,
> > > > > > > > > > since that can be conditional on an SVE predicate.  But if 
> > > > > > > > > > we ask for
> > > > > > > > > > a vector mode of >=5 halfwords for an integer absolute 
> > > > > > > > > > operation,
> > > > > > > > > > we want the Advanced SIMD mode.  So I suppose the new hook 
> > > > > > > > > > is
> > > > > effectively
> > > > > > > > > > providing context.  Perhaps we could do that using an extra
> parameter
> > > to
> > > > > > > > > > related_vector_mode, if that seems better.
> > > > > > > > > >
> > > > > > > > > > It's somewhat difficult to recover this information after
> vectorisation,
> > > > > > > > > > since like you say, the statements are often unconditional 
> > > > > > > > > > and
> operate
> > > > > > > > > > on all lanes.
> > > > > > > > >
> > > > > > > > > So it seems we want to query if there's a lowpart fixed-size 
> > > > > > > > > vector
> > > > > > > > > mode available for a given other mode.  It seems to me that we
> > > > > > > > > should have a way to query for this already without having a 
> > > > > > > > > new
> > > > > > > > > target hook using general code?
> > > > > > >
> > > > > > > So any answer to this?  You should be able to iterate over all
> > > > > > > vector modes, look for those with fixed size and fitting the
> > > > > > > lane constraint and then asking whether the modes are tieable
> > > > > > > or whatever else is the correct way to verify the constraint?
> > > > > > >
> > > > > > > So sth as simple as
> > > > > > >
> > > > > > >  mode = mode_for_vector (GET_MODE_INNER (vmode), ceil_pow2
> (const-
> > > > > > > nunits));
> > > > > > >  if (targetm.modes_tieable_p (mode, vmode))
> > > > > > >    return mode;
> > > > > > >
> > > > > > > ?  Why do we need a target hook for this?  What's the "hidden"
> > > > > > > constraint I'm missing?
> > > > > > >
> > > > > >
> > > > > > Richard can correct me if I'm wrong (probably) but the problem with 
> > > > > > this
> > > > > > is that it won't work with VLS e.g. -msve-vector-bits because the 
> > > > > > SVE
> modes
> > > > > > are fixed size then.  Secondly it'll have issues respecting --param 
> > > > > > aarch64-
> > > > > autovec-preference=
> > > > > > as this is intended to only affect autovec where mode_for_vector is
> general.
> > > > > >
> > > > > > The core of this optimization is that it must change to Adv. SIMD 
> > > > > > over SVE
> > > > > modes.
> > > > >
> > > > > OK, so SVE VLS -msve-vector-bits=128 modes are indistinguishable from
> Adv.
> > > > > SIMD
> > > > > modes by the middle-end?
> > > >
> > > > I believe so, the ACLE types have an annotation on them to lift some of 
> > > > the
> > > > restrictions but the modes are the same.
> > > >
> > > > > Is there a way to distinguish them, say, by cost
> > > > > (target_reg_cost?)?  Since any use of a SVE reg will require a 
> > > > > predicate reg?
> > > > >
> > > >
> > > > We do have unpredicated SVE instructions, but yes costing could work.
> > > > Essentially what we're trying to do is find the cheapest mode to perform
> > > > the operation on.
> > > >
> > > > This could work.. But how would we incorporate it into the costing? 
> > > > Part of
> > > > the problem is that to iterate over similar modes with the same element 
> > > > size
> > > > likely requires some target input no?  Or are you saying we should only
> > > > iterate over fixed size modes?
> > > >
> > > > Regards,
> > > > Tamar
> > > >
> > > > > I think we miss a critical bit of information in the middle-end here 
> > > > > and I'm
> > > > > trying to see what piece of information that actually is.
> "find_subvector_type"
> > > > > doesn't seem to be it, it's maybe using that hidden bit of 
> > > > > information for
> > > > > one specific use-case but it would be far better to have a way for 
> > > > > the target
> > > > > to communicate the missing bit of information in a more generic way?
> > > > > We can then wrap a "find_subvector_type" around that.
> > >
> > > So for this one sth like targetm.mode_requires_predication ()?  But
> > > as Tamar says above this really depends on the operation.  But the
> > > optabs do _not_ expose this requirement (we have non-.COND_ADD for
> > > SVE modes), but you want to take advantage of this difference.
> > > Can we access insn attributes from optab entries?  Could we add
> > > some "standard" attribute noting that an insn requires a predicate?
> > > But of course that likely depends on the alternative?
> > >
> >
> > We'd likely also require the mask that would be used, because I think 
> > otherwise
> > targetm.mode_requires_predication would be a bit ambiguous for non-flag
> setting
> > instructions or instructions that don’t do cross lane operations.
> >
> > e.g. SVE has both COND_ADD and ADD. But the key here is that if we know 
> > we'll
> > access the bottom 64 or 128 bits we could use an Adv. SIMD ADD.
> 
> But SVE ADD still requires a predicate register (with all lanes enabled),
> no?  That's the whole point of the optimization we're discussing?
> I see the only problem with -msve-vector-bits=N where GET_MODE_SIZE
> is no longer a POLY_INT - otherwise that would be the easy
> way to identify Adv. SIMD vs. SVE and heuristically prefer
> fixed-size modes in the vectorizer when possible (for small known
> niter <= the fixed-size mode number of lanes).  But with
> -msve-vector-bits=128 GET_MODE_SIZE for Adv. SIMD and SVE is equal(?),
> so we need another way to distinguish.  Because even with
> -msve-vector-bits=128 you need the predicate register appropriately
> set up as I understand you are not altering the SVE HW config which
> would be also possible(?), but I'm not sure that would make it
> possible to have a predicate register less ADD instruction.
> 
> What SVE register taking machine instructions do not explicitly/implicitly
> use one of the SVE predicate registers?


Many, ADD for instance is this 
https://developer.arm.com/documentation/ddi0602/2025-03/SVE-Instructions/ADD--vectors--unpredicated---Add-vectors--unpredicated--

And SVE2 added many more. GCC already takes advantage of this and drops
predicates entirely when it can to avoid the dependency on the predicate pipe.

Those are actually different instructions not just aliases.

BIC for instance 
https://developer.arm.com/documentation/ddi0602/2025-03/SVE-Instructions/BIC--vectors--unpredicated---Bitwise-clear-vectors--unpredicated--

Regards,
Tamar

> 
> > Where the
> > operation would be beneficial for longer VL cores where the Adv. SIMD vector
> > pipes are multiplexed on the SVE ones. Such as Neoverse-V1, but not 
> > Neoverse-
> V2.
> >
> > Without the predicate being considered (or niters) SVE would have to return 
> > false
> > for the hook.
> >
> > Which is why an attribute may be tricky.
> >
> > Regards,
> > Tamar
> >
> > > Richard.
> > >
> > >
> > > > > Thanks,
> > > > > Richard.
> > > > >
> > > > > > > > > It doesn't really fit related_vector_mode I guess.
> > > > > > > > >
> > > > > > > >
> > > > > > > > Yeah I don't think would work unless as Richard mentioned we 
> > > > > > > > have
> > > > > > > > an argument to indicate which SIMD class you want to end up 
> > > > > > > > with.
> > > > > > > >
> > > > > > > > > I also wonder if we can take it as a given that SVE and neon
> > > > > > > > > inter-operate efficiently for all implementations, without 
> > > > > > > > > some
> > > > > > > > > kind of "rewriting" penalty.  Like in the above example we
> > > > > > > > > set NEON v31 and use it as source for a SVE store via z31.
> > > > > > > > >
> > > > > > > >
> > > > > > > > It's an architectural requirement that the register files 
> > > > > > > > overlap.
> > > > > > > > This is described section B1.2 (Registers in AArch64 execution 
> > > > > > > > state).
> > > > > > > > Any core that ever implements non-overlapping register where 
> > > > > > > > they
> > > > > > > > would need a rewiring penalty has bigger problems to worry 
> > > > > > > > about,
> > > > > > > > as an example the architecture describes that writing to the 
> > > > > > > > lower part
> of
> > > > > > > > a Adv. SIMD register will clear the top parts up to VL of the 
> > > > > > > > SVE "view".
> > > > > > > >
> > > > > > > > Any such rewiring would mean that Adv. SIMD and Scalar FPR
> instructions
> > > > > > > > become useless due to the defined state of the larger views on 
> > > > > > > > the
> > > register.
> > > > > > >
> > > > > > > I'm aware of the architectural requirement - it's just that I 
> > > > > > > could
> > > > > > > think of the HW re-configuring itself for masked operations and 
> > > > > > > thus
> > > > > > > switching back and forth might incur some penalty.  If it is 
> > > > > > > common
> > > > > > > practice to mix SVE and NEON this way it's of course unlikely such
> > > > > > > uarch would be sucessful.  But then, powering off mask & high 
> > > > > > > vector
> > > > > > > part handling logic when facing NEON might be a possibility.
> > > > > > >
> > > > > >
> > > > > > Yeah the reason for this optimization has more to do with how the 
> > > > > > vector
> > > > > > pipes are split between Adv. SIMD and SVE. An easy one is say 
> > > > > > reductions,
> > > > > > the bigger VL the more expensive in-order reductions like addv 
> > > > > > become.
> > > > > > But Adv. SIMD reductions have a fixed cost, and if we know we only 
> > > > > > need
> to
> > > > > > reduce the bottom N-bits it'll always beat SVE reductions.
> > > > > >
> > > > > > Others like MUL just have a higher throughput in Adv. SIMD vs SVE 
> > > > > > on e.g.
> VL
> > > > > 256
> > > > > > bit cores.  So it's not just the masking but vector length in 
> > > > > > general.
> > > > > >
> > > > > > And the reason we don't pick Adv. SIMD for such loops is that SVE 
> > > > > > allows
> > > partial
> > > > > masking,
> > > > > > so for e.g. MUL it's ok for us to multiply with an unknown valued 
> > > > > > lane
> since
> > > > > predication
> > > > > > makes the usages of the result safe.
> > > > > >
> > > > > > Thanks,
> > > > > > Tamar
> > > > > >
> > > > > > > Richard.
> > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Tamar
> > > > > > > >
> > > > > > > > > Richard.
> > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > Richard
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Richard Biener <rguent...@suse.de>
> > > > > > > > > SUSE Software Solutions Germany GmbH,
> > > > > > > > > Frankenstrasse 146, 90461 Nuernberg, Germany;
> > > > > > > > > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, 
> > > > > > > > > AG
> > > > > > > Nuernberg)
> > > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Richard Biener <rguent...@suse.de>
> > > > > > > SUSE Software Solutions Germany GmbH,
> > > > > > > Frankenstrasse 146, 90461 Nuernberg, Germany;
> > > > > > > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG
> > > > > Nuernberg)
> > > >
> > >
> > > --
> > > Richard Biener <rguent...@suse.de>
> > > SUSE Software Solutions Germany GmbH,
> > > Frankenstrasse 146, 90461 Nuernberg, Germany;
> > > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG
> Nuernberg)
> >
> 
> --
> Richard Biener <rguent...@suse.de>
> SUSE Software Solutions Germany GmbH,
> Frankenstrasse 146, 90461 Nuernberg, Germany;
> GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

RE: [PATCH] vect: Improve vectorization for small-trip-count loops using subvectors

Reply via email to