RE: [PATCH] vect: Improve vectorization for small-trip-count loops using subvectors

Richard Biener Fri, 09 May 2025 10:32:37 -0700

On Fri, 9 May 2025, Tamar Christina wrote:

> > -----Original Message-----
> > From: Richard Biener <[email protected]>
> > Sent: Friday, May 9, 2025 11:08 AM
> > To: Richard Sandiford <[email protected]>
> > Cc: Pengfei Li <[email protected]>; [email protected];
> > [email protected]
> > Subject: Re: [PATCH] vect: Improve vectorization for small-trip-count loops 
> > using
> > subvectors
> > 
> > On Fri, 9 May 2025, Richard Sandiford wrote:
> > 
> > > Richard Biener <[email protected]> writes:
> > > > On Thu, 8 May 2025, Pengfei Li wrote:
> > > >
> > > >> This patch improves the auto-vectorization for loops with known small
> > > >> trip counts by enabling the use of subvectors - bit fields of original
> > > >> wider vectors. A subvector must have the same vector element type as 
> > > >> the
> > > >> original vector and enough bits for all vector elements to be processed
> > > >> in the loop. Using subvectors is beneficial because machine 
> > > >> instructions
> > > >> operating on narrower vectors usually show better performance.
> > > >>
> > > >> To enable this optimization, this patch introduces a new target hook.
> > > >> This hook allows the vectorizer to query the backend for a suitable
> > > >> subvector type given the original vector type and the number of 
> > > >> elements
> > > >> to be processed in the small-trip-count loop. The target hook also has 
> > > >> a
> > > >> could_trap parameter to say if the subvector is allowed to have more
> > > >> bits than needed.
> > > >>
> > > >> This optimization is currently enabled for AArch64 only. Below example
> > > >> shows how it uses AdvSIMD vectors as subvectors of SVE vectors for
> > > >> higher instruction throughput.
> > > >>
> > > >> Consider this loop operating on an array of 16-bit integers:
> > > >>
> > > >>        for (int i = 0; i < 5; i++) {
> > > >>          a[i] = a[i] < 0 ? -a[i] : a[i];
> > > >>        }
> > > >>
> > > >> Before this patch, the generated AArch64 code would be:
> > > >>
> > > >>        ptrue   p7.h, vl5
> > > >>        ptrue   p6.b, all
> > > >>        ld1h    z31.h, p7/z, [x0]
> > > >>        abs     z31.h, p6/m, z31.h
> > > >>        st1h    z31.h, p7, [x0]
> > > >
> > > > p6.b has all lanes active - why is the abs then not
> > > > simply unmasked?
> > >
> > > There is no unpredicated abs for SVE.  The predicate has to be there,
> > > and so expand introduces one even when the gimple stmt is unconditional.
> > >
> > > >> After this patch, it is optimized to:
> > > >>
> > > >>        ptrue   p7.h, vl5
> > > >>        ld1h    z31.h, p7/z, [x0]
> > > >>        abs     v31.8h, v31.8h
> > > >>        st1h    z31.h, p7, [x0]
> > > >
> > > > Help me decipher this - I suppose z31 and v31 "overlap" in the
> > > > register file?  And z31 is a variable-length vector but
> > > > z31.8h is a 8 element fixed length vector?  How can we
> > >
> > > v31.8h, but otherwise yes.
> > >
> > > > end up with just 8 elements here?  From the upper interation
> > > > bound?
> > >
> > > Yeah.
> > >
> > > > I'm not sure why you need any target hook here.  It seems you
> > > > do already have suitable vector modes so why not just ask
> > > > for a suitable vector?  Is it because you need to have
> > > > that register overlap guarantee (otherwise you'd get
> > > > a move)?
> > >
> > > Yeah, the optimisation only makes sense for overlaid vector registers.
> > >
> > > > Why do we not simply use fixed-length SVE here in the first place?
> > >
> > > Fixed-length SVE is restricted to cases where the exact runtime length
> > > is known: the compile-time length is both a minimum and a maximum.
> > > In contrast, the code above would work even for 256-bit SVE.
> > >
> > > > To me doing this in this way in the vectorizer looks
> > > > somewhat out-of-place.
> > > >
> > > > That said, we already have unmasked ABS in the IL:
> > > >
> > > >   vect__1.6_15 = .MASK_LOAD (&a, 16B, { -1, -1, -1, -1, -1, 0, 0, 0, 0, 
> > > > 0,
> > > > 0, 0, 0, 0, 0, 0, ... }, { 0, ... });
> > > >   vect__2.7_16 = ABSU_EXPR <vect__1.6_15>;
> > > >   vect__3.8_17 = VIEW_CONVERT_EXPR<vector([8,8]) short
> > int>(vect__2.7_16);
> > > >   .MASK_STORE (&a, 16B, { -1, -1, -1, -1, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> > > > 0, 0, ... }, vect__3.8_17); [tail call]
> > > >
> > > > so what's missing here?  I suppose having a constant masked ABSU here
> > > > would allow RTL expansion to select a fixed-size mode?
> > > >
> > > > And the vectorizer could simply use the existing
> > > > related_vector_mode hook instead?
> > >
> > > I agree it's a bit awkward.  The problem is that we want conflicting
> > > things.  On the one hand, it would make conceptual sense to use SVE
> > > instructions to provide conditional optabs for Advanced SIMD vector modes.
> > > E.g. SVE's LD1W could act as a predicated load for an Advanced SIMD
> > > int32x4_t vector.  The main problem with that is that Advanced SIMD's
> > > native boolean vector type is an integer vector of 0s and -1s, rather
> > > than an SVE predicate.  For some (native Advanced SIMD) operations we'd
> > > want one type of boolean, for some (SVE emulating Advanced SIMD)
> > > operations we'd want the other type of boolean.
> > >
> > > The patch goes the other way and treats using Advanced SIMD as an
> > > optimisation for SVE loops.
> > >
> > > related_vector_mode suffers from the same problem.  If we ask for a
> > > vector mode of >=5 halfwords for a load or store, we want the SVE mode,
> > > since that can be conditional on an SVE predicate.  But if we ask for
> > > a vector mode of >=5 halfwords for an integer absolute operation,
> > > we want the Advanced SIMD mode.  So I suppose the new hook is effectively
> > > providing context.  Perhaps we could do that using an extra parameter to
> > > related_vector_mode, if that seems better.
> > >
> > > It's somewhat difficult to recover this information after vectorisation,
> > > since like you say, the statements are often unconditional and operate
> > > on all lanes.
> > 
> > So it seems we want to query if there's a lowpart fixed-size vector
> > mode available for a given other mode.  It seems to me that we
> > should have a way to query for this already without having a new
> > target hook using general code?


So any answer to this?  You should be able to iterate over all
vector modes, look for those with fixed size and fitting the
lane constraint and then asking whether the modes are tieable
or whatever else is the correct way to verify the constraint?

So sth as simple as

 mode = mode_for_vector (GET_MODE_INNER (vmode), ceil_pow2 (const-nunits));
 if (targetm.modes_tieable_p (mode, vmode))
   return mode;

?  Why do we need a target hook for this?  What's the "hidden"
constraint I'm missing?

> > It doesn't really fit related_vector_mode I guess.
> > 
> 
> Yeah I don't think would work unless as Richard mentioned we have
> an argument to indicate which SIMD class you want to end up with.
> 
> > I also wonder if we can take it as a given that SVE and neon
> > inter-operate efficiently for all implementations, without some
> > kind of "rewriting" penalty.  Like in the above example we
> > set NEON v31 and use it as source for a SVE store via z31.
> > 
> 
> It's an architectural requirement that the register files overlap.
> This is described section B1.2 (Registers in AArch64 execution state).
> Any core that ever implements non-overlapping register where they
> would need a rewiring penalty has bigger problems to worry about,
> as an example the architecture describes that writing to the lower part of
> a Adv. SIMD register will clear the top parts up to VL of the SVE "view".
> 
> Any such rewiring would mean that Adv. SIMD and Scalar FPR instructions
> become useless due to the defined state of the larger views on the register.

I'm aware of the architectural requirement - it's just that I could
think of the HW re-configuring itself for masked operations and thus
switching back and forth might incur some penalty.  If it is common
practice to mix SVE and NEON this way it's of course unlikely such
uarch would be sucessful.  But then, powering off mask & high vector
part handling logic when facing NEON might be a possibility.

Richard.

> Thanks,
> Tamar
> 
> > Richard.
> > 
> > > Thanks,
> > > Richard
> > >
> > 
> > --
> > Richard Biener <[email protected]>
> > SUSE Software Solutions Germany GmbH,
> > Frankenstrasse 146, 90461 Nuernberg, Germany;
> > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
> 

-- 
Richard Biener <[email protected]>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

RE: [PATCH] vect: Improve vectorization for small-trip-count loops using subvectors

Reply via email to