RE: [PATCH] vect: Improve vectorization for small-trip-count loops using subvectors

Tamar Christina Wed, 04 Jun 2025 00:24:24 -0700

> -----Original Message-----
> From: Richard Biener <rguent...@suse.de>
> Sent: Wednesday, June 4, 2025 8:04 AM
> To: Tamar Christina <tamar.christ...@arm.com>
> Cc: Richard Biener <richard.guent...@gmail.com>; Richard Sandiford
> <richard.sandif...@arm.com>; Pengfei Li <pengfei....@arm.com>; gcc-
> patc...@gcc.gnu.org; ktkac...@nvidia.com
> Subject: RE: [PATCH] vect: Improve vectorization for small-trip-count loops 
> using
> subvectors
> 
> On Tue, 3 Jun 2025, Tamar Christina wrote:
> 
> > > -----Original Message-----
> > > From: Richard Biener <richard.guent...@gmail.com>
> > > Sent: Tuesday, June 3, 2025 2:12 PM
> > > To: Tamar Christina <tamar.christ...@arm.com>
> > > Cc: Richard Biener <rguent...@suse.de>; Richard Sandiford
> > > <richard.sandif...@arm.com>; Pengfei Li <pengfei....@arm.com>; gcc-
> > > patc...@gcc.gnu.org; ktkac...@nvidia.com
> > > Subject: Re: [PATCH] vect: Improve vectorization for small-trip-count 
> > > loops
> using
> > > subvectors
> > >
> > > On Fri, May 9, 2025 at 4:05 PM Tamar Christina <tamar.christ...@arm.com>
> > > wrote:
> > > >
> > > > > -----Original Message-----
> > > > > From: Richard Biener <rguent...@suse.de>
> > > > > Sent: Friday, May 9, 2025 2:44 PM
> > > > > To: Tamar Christina <tamar.christ...@arm.com>
> > > > > Cc: Richard Sandiford <richard.sandif...@arm.com>; Pengfei Li
> > > > > <pengfei....@arm.com>; gcc-patches@gcc.gnu.org; ktkac...@nvidia.com
> > > > > Subject: RE: [PATCH] vect: Improve vectorization for small-trip-count 
> > > > > loops
> > > using
> > > > > subvectors
> > > > >
> > > > > On Fri, 9 May 2025, Tamar Christina wrote:
> > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Richard Biener <rguent...@suse.de>
> > > > > > > Sent: Friday, May 9, 2025 11:08 AM
> > > > > > > To: Richard Sandiford <richard.sandif...@arm.com>
> > > > > > > Cc: Pengfei Li <pengfei....@arm.com>; gcc-patches@gcc.gnu.org;
> > > > > > > ktkac...@nvidia.com
> > > > > > > Subject: Re: [PATCH] vect: Improve vectorization for 
> > > > > > > small-trip-count
> loops
> > > > > using
> > > > > > > subvectors
> > > > > > >
> > > > > > > On Fri, 9 May 2025, Richard Sandiford wrote:
> > > > > > >
> > > > > > > > Richard Biener <rguent...@suse.de> writes:
> > > > > > > > > On Thu, 8 May 2025, Pengfei Li wrote:
> > > > > > > > >
> > > > > > > > >> This patch improves the auto-vectorization for loops with 
> > > > > > > > >> known
> small
> > > > > > > > >> trip counts by enabling the use of subvectors - bit fields 
> > > > > > > > >> of original
> > > > > > > > >> wider vectors. A subvector must have the same vector element 
> > > > > > > > >> type
> as
> > > the
> > > > > > > > >> original vector and enough bits for all vector elements to be
> processed
> > > > > > > > >> in the loop. Using subvectors is beneficial because machine
> instructions
> > > > > > > > >> operating on narrower vectors usually show better 
> > > > > > > > >> performance.
> > > > > > > > >>
> > > > > > > > >> To enable this optimization, this patch introduces a new 
> > > > > > > > >> target
> hook.
> > > > > > > > >> This hook allows the vectorizer to query the backend for a 
> > > > > > > > >> suitable
> > > > > > > > >> subvector type given the original vector type and the number 
> > > > > > > > >> of
> > > elements
> > > > > > > > >> to be processed in the small-trip-count loop. The target 
> > > > > > > > >> hook also
> has a
> > > > > > > > >> could_trap parameter to say if the subvector is allowed to 
> > > > > > > > >> have
> more
> > > > > > > > >> bits than needed.
> > > > > > > > >>
> > > > > > > > >> This optimization is currently enabled for AArch64 only. 
> > > > > > > > >> Below
> example
> > > > > > > > >> shows how it uses AdvSIMD vectors as subvectors of SVE 
> > > > > > > > >> vectors
> for
> > > > > > > > >> higher instruction throughput.
> > > > > > > > >>
> > > > > > > > >> Consider this loop operating on an array of 16-bit integers:
> > > > > > > > >>
> > > > > > > > >>      for (int i = 0; i < 5; i++) {
> > > > > > > > >>        a[i] = a[i] < 0 ? -a[i] : a[i];
> > > > > > > > >>      }
> > > > > > > > >>
> > > > > > > > >> Before this patch, the generated AArch64 code would be:
> > > > > > > > >>
> > > > > > > > >>      ptrue   p7.h, vl5
> > > > > > > > >>      ptrue   p6.b, all
> > > > > > > > >>      ld1h    z31.h, p7/z, [x0]
> > > > > > > > >>      abs     z31.h, p6/m, z31.h
> > > > > > > > >>      st1h    z31.h, p7, [x0]
> > > > > > > > >
> > > > > > > > > p6.b has all lanes active - why is the abs then not
> > > > > > > > > simply unmasked?
> > > > > > > >
> > > > > > > > There is no unpredicated abs for SVE.  The predicate has to be 
> > > > > > > > there,
> > > > > > > > and so expand introduces one even when the gimple stmt is
> > > unconditional.
> > > > > > > >
> > > > > > > > >> After this patch, it is optimized to:
> > > > > > > > >>
> > > > > > > > >>      ptrue   p7.h, vl5
> > > > > > > > >>      ld1h    z31.h, p7/z, [x0]
> > > > > > > > >>      abs     v31.8h, v31.8h
> > > > > > > > >>      st1h    z31.h, p7, [x0]
> > > > > > > > >
> > > > > > > > > Help me decipher this - I suppose z31 and v31 "overlap" in the
> > > > > > > > > register file?  And z31 is a variable-length vector but
> > > > > > > > > z31.8h is a 8 element fixed length vector?  How can we
> > > > > > > >
> > > > > > > > v31.8h, but otherwise yes.
> > > > > > > >
> > > > > > > > > end up with just 8 elements here?  From the upper interation
> > > > > > > > > bound?
> > > > > > > >
> > > > > > > > Yeah.
> > > > > > > >
> > > > > > > > > I'm not sure why you need any target hook here.  It seems you
> > > > > > > > > do already have suitable vector modes so why not just ask
> > > > > > > > > for a suitable vector?  Is it because you need to have
> > > > > > > > > that register overlap guarantee (otherwise you'd get
> > > > > > > > > a move)?
> > > > > > > >
> > > > > > > > Yeah, the optimisation only makes sense for overlaid vector 
> > > > > > > > registers.
> > > > > > > >
> > > > > > > > > Why do we not simply use fixed-length SVE here in the first 
> > > > > > > > > place?
> > > > > > > >
> > > > > > > > Fixed-length SVE is restricted to cases where the exact runtime 
> > > > > > > > length
> > > > > > > > is known: the compile-time length is both a minimum and a 
> > > > > > > > maximum.
> > > > > > > > In contrast, the code above would work even for 256-bit SVE.
> > > > > > > >
> > > > > > > > > To me doing this in this way in the vectorizer looks
> > > > > > > > > somewhat out-of-place.
> > > > > > > > >
> > > > > > > > > That said, we already have unmasked ABS in the IL:
> > > > > > > > >
> > > > > > > > >   vect__1.6_15 = .MASK_LOAD (&a, 16B, { -1, -1, -1, -1, -1, 
> > > > > > > > > 0, 0, 0, 0,
> 0,
> > > > > > > > > 0, 0, 0, 0, 0, 0, ... }, { 0, ... });
> > > > > > > > >   vect__2.7_16 = ABSU_EXPR <vect__1.6_15>;
> > > > > > > > >   vect__3.8_17 = VIEW_CONVERT_EXPR<vector([8,8]) short
> > > > > > > int>(vect__2.7_16);
> > > > > > > > >   .MASK_STORE (&a, 16B, { -1, -1, -1, -1, -1, 0, 0, 0, 0, 0, 
> > > > > > > > > 0, 0, 0, 0,
> > > > > > > > > 0, 0, ... }, vect__3.8_17); [tail call]
> > > > > > > > >
> > > > > > > > > so what's missing here?  I suppose having a constant masked 
> > > > > > > > > ABSU
> here
> > > > > > > > > would allow RTL expansion to select a fixed-size mode?
> > > > > > > > >
> > > > > > > > > And the vectorizer could simply use the existing
> > > > > > > > > related_vector_mode hook instead?
> > > > > > > >
> > > > > > > > I agree it's a bit awkward.  The problem is that we want 
> > > > > > > > conflicting
> > > > > > > > things.  On the one hand, it would make conceptual sense to use 
> > > > > > > > SVE
> > > > > > > > instructions to provide conditional optabs for Advanced SIMD 
> > > > > > > > vector
> > > modes.
> > > > > > > > E.g. SVE's LD1W could act as a predicated load for an Advanced 
> > > > > > > > SIMD
> > > > > > > > int32x4_t vector.  The main problem with that is that Advanced 
> > > > > > > > SIMD's
> > > > > > > > native boolean vector type is an integer vector of 0s and -1s, 
> > > > > > > > rather
> > > > > > > > than an SVE predicate.  For some (native Advanced SIMD) 
> > > > > > > > operations
> we'd
> > > > > > > > want one type of boolean, for some (SVE emulating Advanced SIMD)
> > > > > > > > operations we'd want the other type of boolean.
> > > > > > > >
> > > > > > > > The patch goes the other way and treats using Advanced SIMD as 
> > > > > > > > an
> > > > > > > > optimisation for SVE loops.
> > > > > > > >
> > > > > > > > related_vector_mode suffers from the same problem.  If we ask 
> > > > > > > > for a
> > > > > > > > vector mode of >=5 halfwords for a load or store, we want the 
> > > > > > > > SVE
> mode,
> > > > > > > > since that can be conditional on an SVE predicate.  But if we 
> > > > > > > > ask for
> > > > > > > > a vector mode of >=5 halfwords for an integer absolute 
> > > > > > > > operation,
> > > > > > > > we want the Advanced SIMD mode.  So I suppose the new hook is
> > > effectively
> > > > > > > > providing context.  Perhaps we could do that using an extra 
> > > > > > > > parameter
> to
> > > > > > > > related_vector_mode, if that seems better.
> > > > > > > >
> > > > > > > > It's somewhat difficult to recover this information after 
> > > > > > > > vectorisation,
> > > > > > > > since like you say, the statements are often unconditional and 
> > > > > > > > operate
> > > > > > > > on all lanes.
> > > > > > >
> > > > > > > So it seems we want to query if there's a lowpart fixed-size 
> > > > > > > vector
> > > > > > > mode available for a given other mode.  It seems to me that we
> > > > > > > should have a way to query for this already without having a new
> > > > > > > target hook using general code?
> > > > >
> > > > > So any answer to this?  You should be able to iterate over all
> > > > > vector modes, look for those with fixed size and fitting the
> > > > > lane constraint and then asking whether the modes are tieable
> > > > > or whatever else is the correct way to verify the constraint?
> > > > >
> > > > > So sth as simple as
> > > > >
> > > > >  mode = mode_for_vector (GET_MODE_INNER (vmode), ceil_pow2 (const-
> > > > > nunits));
> > > > >  if (targetm.modes_tieable_p (mode, vmode))
> > > > >    return mode;
> > > > >
> > > > > ?  Why do we need a target hook for this?  What's the "hidden"
> > > > > constraint I'm missing?
> > > > >
> > > >
> > > > Richard can correct me if I'm wrong (probably) but the problem with this
> > > > is that it won't work with VLS e.g. -msve-vector-bits because the SVE 
> > > > modes
> > > > are fixed size then.  Secondly it'll have issues respecting --param 
> > > > aarch64-
> > > autovec-preference=
> > > > as this is intended to only affect autovec where mode_for_vector is 
> > > > general.
> > > >
> > > > The core of this optimization is that it must change to Adv. SIMD over 
> > > > SVE
> > > modes.
> > >
> > > OK, so SVE VLS -msve-vector-bits=128 modes are indistinguishable from Adv.
> > > SIMD
> > > modes by the middle-end?
> >
> > I believe so, the ACLE types have an annotation on them to lift some of the
> > restrictions but the modes are the same.
> >
> > > Is there a way to distinguish them, say, by cost
> > > (target_reg_cost?)?  Since any use of a SVE reg will require a predicate 
> > > reg?
> > >
> >
> > We do have unpredicated SVE instructions, but yes costing could work.
> > Essentially what we're trying to do is find the cheapest mode to perform
> > the operation on.
> >
> > This could work.. But how would we incorporate it into the costing? Part of
> > the problem is that to iterate over similar modes with the same element size
> > likely requires some target input no?  Or are you saying we should only
> > iterate over fixed size modes?
> >
> > Regards,
> > Tamar
> >
> > > I think we miss a critical bit of information in the middle-end here and 
> > > I'm
> > > trying to see what piece of information that actually is.  
> > > "find_subvector_type"
> > > doesn't seem to be it, it's maybe using that hidden bit of information for
> > > one specific use-case but it would be far better to have a way for the 
> > > target
> > > to communicate the missing bit of information in a more generic way?
> > > We can then wrap a "find_subvector_type" around that.
> 
> So for this one sth like targetm.mode_requires_predication ()?  But
> as Tamar says above this really depends on the operation.  But the
> optabs do _not_ expose this requirement (we have non-.COND_ADD for
> SVE modes), but you want to take advantage of this difference.
> Can we access insn attributes from optab entries?  Could we add
> some "standard" attribute noting that an insn requires a predicate?
> But of course that likely depends on the alternative?
>


We'd likely also require the mask that would be used, because I think otherwise
targetm.mode_requires_predication would be a bit ambiguous for non-flag setting
instructions or instructions that don’t do cross lane operations.

e.g. SVE has both COND_ADD and ADD. But the key here is that if we know we'll
access the bottom 64 or 128 bits we could use an Adv. SIMD ADD.  Where the
operation would be beneficial for longer VL cores where the Adv. SIMD vector
pipes are multiplexed on the SVE ones. Such as Neoverse-V1, but not Neoverse-V2.

Without the predicate being considered (or niters) SVE would have to return 
false
for the hook.

Which is why an attribute may be tricky.

Regards,
Tamar

> Richard.
> 
> 
> > > Thanks,
> > > Richard.
> > >
> > > > > > > It doesn't really fit related_vector_mode I guess.
> > > > > > >
> > > > > >
> > > > > > Yeah I don't think would work unless as Richard mentioned we have
> > > > > > an argument to indicate which SIMD class you want to end up with.
> > > > > >
> > > > > > > I also wonder if we can take it as a given that SVE and neon
> > > > > > > inter-operate efficiently for all implementations, without some
> > > > > > > kind of "rewriting" penalty.  Like in the above example we
> > > > > > > set NEON v31 and use it as source for a SVE store via z31.
> > > > > > >
> > > > > >
> > > > > > It's an architectural requirement that the register files overlap.
> > > > > > This is described section B1.2 (Registers in AArch64 execution 
> > > > > > state).
> > > > > > Any core that ever implements non-overlapping register where they
> > > > > > would need a rewiring penalty has bigger problems to worry about,
> > > > > > as an example the architecture describes that writing to the lower 
> > > > > > part of
> > > > > > a Adv. SIMD register will clear the top parts up to VL of the SVE 
> > > > > > "view".
> > > > > >
> > > > > > Any such rewiring would mean that Adv. SIMD and Scalar FPR 
> > > > > > instructions
> > > > > > become useless due to the defined state of the larger views on the
> register.
> > > > >
> > > > > I'm aware of the architectural requirement - it's just that I could
> > > > > think of the HW re-configuring itself for masked operations and thus
> > > > > switching back and forth might incur some penalty.  If it is common
> > > > > practice to mix SVE and NEON this way it's of course unlikely such
> > > > > uarch would be sucessful.  But then, powering off mask & high vector
> > > > > part handling logic when facing NEON might be a possibility.
> > > > >
> > > >
> > > > Yeah the reason for this optimization has more to do with how the vector
> > > > pipes are split between Adv. SIMD and SVE. An easy one is say 
> > > > reductions,
> > > > the bigger VL the more expensive in-order reductions like addv become.
> > > > But Adv. SIMD reductions have a fixed cost, and if we know we only need 
> > > > to
> > > > reduce the bottom N-bits it'll always beat SVE reductions.
> > > >
> > > > Others like MUL just have a higher throughput in Adv. SIMD vs SVE on 
> > > > e.g. VL
> > > 256
> > > > bit cores.  So it's not just the masking but vector length in general.
> > > >
> > > > And the reason we don't pick Adv. SIMD for such loops is that SVE allows
> partial
> > > masking,
> > > > so for e.g. MUL it's ok for us to multiply with an unknown valued lane 
> > > > since
> > > predication
> > > > makes the usages of the result safe.
> > > >
> > > > Thanks,
> > > > Tamar
> > > >
> > > > > Richard.
> > > > >
> > > > > > Thanks,
> > > > > > Tamar
> > > > > >
> > > > > > > Richard.
> > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Richard
> > > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Richard Biener <rguent...@suse.de>
> > > > > > > SUSE Software Solutions Germany GmbH,
> > > > > > > Frankenstrasse 146, 90461 Nuernberg, Germany;
> > > > > > > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG
> > > > > Nuernberg)
> > > > > >
> > > > >
> > > > > --
> > > > > Richard Biener <rguent...@suse.de>
> > > > > SUSE Software Solutions Germany GmbH,
> > > > > Frankenstrasse 146, 90461 Nuernberg, Germany;
> > > > > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG
> > > Nuernberg)
> >
> 
> --
> Richard Biener <rguent...@suse.de>
> SUSE Software Solutions Germany GmbH,
> Frankenstrasse 146, 90461 Nuernberg, Germany;
> GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

RE: [PATCH] vect: Improve vectorization for small-trip-count loops using subvectors

Reply via email to