Re: [PATCH] vect: Improve vectorization for small-trip-count loops using subvectors

Richard Biener Tue, 03 Jun 2025 06:15:04 -0700

On Fri, May 9, 2025 at 4:05 PM Tamar Christina <[email protected]> wrote:
>
> > -----Original Message-----
> > From: Richard Biener <[email protected]>
> > Sent: Friday, May 9, 2025 2:44 PM
> > To: Tamar Christina <[email protected]>
> > Cc: Richard Sandiford <[email protected]>; Pengfei Li
> > <[email protected]>; [email protected]; [email protected]
> > Subject: RE: [PATCH] vect: Improve vectorization for small-trip-count loops 
> > using
> > subvectors
> >
> > On Fri, 9 May 2025, Tamar Christina wrote:
> >
> > > > -----Original Message-----
> > > > From: Richard Biener <[email protected]>
> > > > Sent: Friday, May 9, 2025 11:08 AM
> > > > To: Richard Sandiford <[email protected]>
> > > > Cc: Pengfei Li <[email protected]>; [email protected];
> > > > [email protected]
> > > > Subject: Re: [PATCH] vect: Improve vectorization for small-trip-count 
> > > > loops
> > using
> > > > subvectors
> > > >
> > > > On Fri, 9 May 2025, Richard Sandiford wrote:
> > > >
> > > > > Richard Biener <[email protected]> writes:
> > > > > > On Thu, 8 May 2025, Pengfei Li wrote:
> > > > > >
> > > > > >> This patch improves the auto-vectorization for loops with known 
> > > > > >> small
> > > > > >> trip counts by enabling the use of subvectors - bit fields of 
> > > > > >> original
> > > > > >> wider vectors. A subvector must have the same vector element type 
> > > > > >> as the
> > > > > >> original vector and enough bits for all vector elements to be 
> > > > > >> processed
> > > > > >> in the loop. Using subvectors is beneficial because machine 
> > > > > >> instructions
> > > > > >> operating on narrower vectors usually show better performance.
> > > > > >>
> > > > > >> To enable this optimization, this patch introduces a new target 
> > > > > >> hook.
> > > > > >> This hook allows the vectorizer to query the backend for a suitable
> > > > > >> subvector type given the original vector type and the number of 
> > > > > >> elements
> > > > > >> to be processed in the small-trip-count loop. The target hook also 
> > > > > >> has a
> > > > > >> could_trap parameter to say if the subvector is allowed to have 
> > > > > >> more
> > > > > >> bits than needed.
> > > > > >>
> > > > > >> This optimization is currently enabled for AArch64 only. Below 
> > > > > >> example
> > > > > >> shows how it uses AdvSIMD vectors as subvectors of SVE vectors for
> > > > > >> higher instruction throughput.
> > > > > >>
> > > > > >> Consider this loop operating on an array of 16-bit integers:
> > > > > >>
> > > > > >>      for (int i = 0; i < 5; i++) {
> > > > > >>        a[i] = a[i] < 0 ? -a[i] : a[i];
> > > > > >>      }
> > > > > >>
> > > > > >> Before this patch, the generated AArch64 code would be:
> > > > > >>
> > > > > >>      ptrue   p7.h, vl5
> > > > > >>      ptrue   p6.b, all
> > > > > >>      ld1h    z31.h, p7/z, [x0]
> > > > > >>      abs     z31.h, p6/m, z31.h
> > > > > >>      st1h    z31.h, p7, [x0]
> > > > > >
> > > > > > p6.b has all lanes active - why is the abs then not
> > > > > > simply unmasked?
> > > > >
> > > > > There is no unpredicated abs for SVE.  The predicate has to be there,
> > > > > and so expand introduces one even when the gimple stmt is 
> > > > > unconditional.
> > > > >
> > > > > >> After this patch, it is optimized to:
> > > > > >>
> > > > > >>      ptrue   p7.h, vl5
> > > > > >>      ld1h    z31.h, p7/z, [x0]
> > > > > >>      abs     v31.8h, v31.8h
> > > > > >>      st1h    z31.h, p7, [x0]
> > > > > >
> > > > > > Help me decipher this - I suppose z31 and v31 "overlap" in the
> > > > > > register file?  And z31 is a variable-length vector but
> > > > > > z31.8h is a 8 element fixed length vector?  How can we
> > > > >
> > > > > v31.8h, but otherwise yes.
> > > > >
> > > > > > end up with just 8 elements here?  From the upper interation
> > > > > > bound?
> > > > >
> > > > > Yeah.
> > > > >
> > > > > > I'm not sure why you need any target hook here.  It seems you
> > > > > > do already have suitable vector modes so why not just ask
> > > > > > for a suitable vector?  Is it because you need to have
> > > > > > that register overlap guarantee (otherwise you'd get
> > > > > > a move)?
> > > > >
> > > > > Yeah, the optimisation only makes sense for overlaid vector registers.
> > > > >
> > > > > > Why do we not simply use fixed-length SVE here in the first place?
> > > > >
> > > > > Fixed-length SVE is restricted to cases where the exact runtime length
> > > > > is known: the compile-time length is both a minimum and a maximum.
> > > > > In contrast, the code above would work even for 256-bit SVE.
> > > > >
> > > > > > To me doing this in this way in the vectorizer looks
> > > > > > somewhat out-of-place.
> > > > > >
> > > > > > That said, we already have unmasked ABS in the IL:
> > > > > >
> > > > > >   vect__1.6_15 = .MASK_LOAD (&a, 16B, { -1, -1, -1, -1, -1, 0, 0, 
> > > > > > 0, 0, 0,
> > > > > > 0, 0, 0, 0, 0, 0, ... }, { 0, ... });
> > > > > >   vect__2.7_16 = ABSU_EXPR <vect__1.6_15>;
> > > > > >   vect__3.8_17 = VIEW_CONVERT_EXPR<vector([8,8]) short
> > > > int>(vect__2.7_16);
> > > > > >   .MASK_STORE (&a, 16B, { -1, -1, -1, -1, -1, 0, 0, 0, 0, 0, 0, 0, 
> > > > > > 0, 0,
> > > > > > 0, 0, ... }, vect__3.8_17); [tail call]
> > > > > >
> > > > > > so what's missing here?  I suppose having a constant masked ABSU 
> > > > > > here
> > > > > > would allow RTL expansion to select a fixed-size mode?
> > > > > >
> > > > > > And the vectorizer could simply use the existing
> > > > > > related_vector_mode hook instead?
> > > > >
> > > > > I agree it's a bit awkward.  The problem is that we want conflicting
> > > > > things.  On the one hand, it would make conceptual sense to use SVE
> > > > > instructions to provide conditional optabs for Advanced SIMD vector 
> > > > > modes.
> > > > > E.g. SVE's LD1W could act as a predicated load for an Advanced SIMD
> > > > > int32x4_t vector.  The main problem with that is that Advanced SIMD's
> > > > > native boolean vector type is an integer vector of 0s and -1s, rather
> > > > > than an SVE predicate.  For some (native Advanced SIMD) operations 
> > > > > we'd
> > > > > want one type of boolean, for some (SVE emulating Advanced SIMD)
> > > > > operations we'd want the other type of boolean.
> > > > >
> > > > > The patch goes the other way and treats using Advanced SIMD as an
> > > > > optimisation for SVE loops.
> > > > >
> > > > > related_vector_mode suffers from the same problem.  If we ask for a
> > > > > vector mode of >=5 halfwords for a load or store, we want the SVE 
> > > > > mode,
> > > > > since that can be conditional on an SVE predicate.  But if we ask for
> > > > > a vector mode of >=5 halfwords for an integer absolute operation,
> > > > > we want the Advanced SIMD mode.  So I suppose the new hook is 
> > > > > effectively
> > > > > providing context.  Perhaps we could do that using an extra parameter 
> > > > > to
> > > > > related_vector_mode, if that seems better.
> > > > >
> > > > > It's somewhat difficult to recover this information after 
> > > > > vectorisation,
> > > > > since like you say, the statements are often unconditional and operate
> > > > > on all lanes.
> > > >
> > > > So it seems we want to query if there's a lowpart fixed-size vector
> > > > mode available for a given other mode.  It seems to me that we
> > > > should have a way to query for this already without having a new
> > > > target hook using general code?
> >
> > So any answer to this?  You should be able to iterate over all
> > vector modes, look for those with fixed size and fitting the
> > lane constraint and then asking whether the modes are tieable
> > or whatever else is the correct way to verify the constraint?
> >
> > So sth as simple as
> >
> >  mode = mode_for_vector (GET_MODE_INNER (vmode), ceil_pow2 (const-
> > nunits));
> >  if (targetm.modes_tieable_p (mode, vmode))
> >    return mode;
> >
> > ?  Why do we need a target hook for this?  What's the "hidden"
> > constraint I'm missing?
> >
>
> Richard can correct me if I'm wrong (probably) but the problem with this
> is that it won't work with VLS e.g. -msve-vector-bits because the SVE modes
> are fixed size then.  Secondly it'll have issues respecting --param 
> aarch64-autovec-preference=
> as this is intended to only affect autovec where mode_for_vector is general.
>
> The core of this optimization is that it must change to Adv. SIMD over SVE 
> modes.


OK, so SVE VLS -msve-vector-bits=128 modes are indistinguishable from Adv. SIMD
modes by the middle-end?  Is there a way to distinguish them, say, by cost
(target_reg_cost?)?  Since any use of a SVE reg will require a predicate reg?

I think we miss a critical bit of information in the middle-end here and I'm
trying to see what piece of information that actually is.  "find_subvector_type"
doesn't seem to be it, it's maybe using that hidden bit of information for
one specific use-case but it would be far better to have a way for the target
to communicate the missing bit of information in a more generic way?
We can then wrap a "find_subvector_type" around that.

Thanks,
Richard.

> > > > It doesn't really fit related_vector_mode I guess.
> > > >
> > >
> > > Yeah I don't think would work unless as Richard mentioned we have
> > > an argument to indicate which SIMD class you want to end up with.
> > >
> > > > I also wonder if we can take it as a given that SVE and neon
> > > > inter-operate efficiently for all implementations, without some
> > > > kind of "rewriting" penalty.  Like in the above example we
> > > > set NEON v31 and use it as source for a SVE store via z31.
> > > >
> > >
> > > It's an architectural requirement that the register files overlap.
> > > This is described section B1.2 (Registers in AArch64 execution state).
> > > Any core that ever implements non-overlapping register where they
> > > would need a rewiring penalty has bigger problems to worry about,
> > > as an example the architecture describes that writing to the lower part of
> > > a Adv. SIMD register will clear the top parts up to VL of the SVE "view".
> > >
> > > Any such rewiring would mean that Adv. SIMD and Scalar FPR instructions
> > > become useless due to the defined state of the larger views on the 
> > > register.
> >
> > I'm aware of the architectural requirement - it's just that I could
> > think of the HW re-configuring itself for masked operations and thus
> > switching back and forth might incur some penalty.  If it is common
> > practice to mix SVE and NEON this way it's of course unlikely such
> > uarch would be sucessful.  But then, powering off mask & high vector
> > part handling logic when facing NEON might be a possibility.
> >
>
> Yeah the reason for this optimization has more to do with how the vector
> pipes are split between Adv. SIMD and SVE. An easy one is say reductions,
> the bigger VL the more expensive in-order reductions like addv become.
> But Adv. SIMD reductions have a fixed cost, and if we know we only need to
> reduce the bottom N-bits it'll always beat SVE reductions.
>
> Others like MUL just have a higher throughput in Adv. SIMD vs SVE on e.g. VL 
> 256
> bit cores.  So it's not just the masking but vector length in general.
>
> And the reason we don't pick Adv. SIMD for such loops is that SVE allows 
> partial masking,
> so for e.g. MUL it's ok for us to multiply with an unknown valued lane since 
> predication
> makes the usages of the result safe.
>
> Thanks,
> Tamar
>
> > Richard.
> >
> > > Thanks,
> > > Tamar
> > >
> > > > Richard.
> > > >
> > > > > Thanks,
> > > > > Richard
> > > > >
> > > >
> > > > --
> > > > Richard Biener <[email protected]>
> > > > SUSE Software Solutions Germany GmbH,
> > > > Frankenstrasse 146, 90461 Nuernberg, Germany;
> > > > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG
> > Nuernberg)
> > >
> >
> > --
> > Richard Biener <[email protected]>
> > SUSE Software Solutions Germany GmbH,
> > Frankenstrasse 146, 90461 Nuernberg, Germany;
> > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

Re: [PATCH] vect: Improve vectorization for small-trip-count loops using subvectors

Reply via email to