Re: [PATCH] vect: Improve vectorization for small-trip-count loops using subvectors

Richard Sandiford Fri, 09 May 2025 05:36:06 -0700

Tamar Christina <tamar.christ...@arm.com> writes:
>> -----Original Message-----
>> From: Richard Biener <rguent...@suse.de>
>> Sent: Friday, May 9, 2025 11:08 AM
>> To: Richard Sandiford <richard.sandif...@arm.com>
>> Cc: Pengfei Li <pengfei....@arm.com>; gcc-patches@gcc.gnu.org;
>> ktkac...@nvidia.com
>> Subject: Re: [PATCH] vect: Improve vectorization for small-trip-count loops 
>> using
>> subvectors
>> 
>> On Fri, 9 May 2025, Richard Sandiford wrote:
>> 
>> > Richard Biener <rguent...@suse.de> writes:
>> > > On Thu, 8 May 2025, Pengfei Li wrote:
>> > >
>> > >> This patch improves the auto-vectorization for loops with known small
>> > >> trip counts by enabling the use of subvectors - bit fields of original
>> > >> wider vectors. A subvector must have the same vector element type as the
>> > >> original vector and enough bits for all vector elements to be processed
>> > >> in the loop. Using subvectors is beneficial because machine instructions
>> > >> operating on narrower vectors usually show better performance.
>> > >>
>> > >> To enable this optimization, this patch introduces a new target hook.
>> > >> This hook allows the vectorizer to query the backend for a suitable
>> > >> subvector type given the original vector type and the number of elements
>> > >> to be processed in the small-trip-count loop. The target hook also has a
>> > >> could_trap parameter to say if the subvector is allowed to have more
>> > >> bits than needed.
>> > >>
>> > >> This optimization is currently enabled for AArch64 only. Below example
>> > >> shows how it uses AdvSIMD vectors as subvectors of SVE vectors for
>> > >> higher instruction throughput.
>> > >>
>> > >> Consider this loop operating on an array of 16-bit integers:
>> > >>
>> > >>         for (int i = 0; i < 5; i++) {
>> > >>           a[i] = a[i] < 0 ? -a[i] : a[i];
>> > >>         }
>> > >>
>> > >> Before this patch, the generated AArch64 code would be:
>> > >>
>> > >>         ptrue   p7.h, vl5
>> > >>         ptrue   p6.b, all
>> > >>         ld1h    z31.h, p7/z, [x0]
>> > >>         abs     z31.h, p6/m, z31.h
>> > >>         st1h    z31.h, p7, [x0]
>> > >
>> > > p6.b has all lanes active - why is the abs then not
>> > > simply unmasked?
>> >
>> > There is no unpredicated abs for SVE.  The predicate has to be there,
>> > and so expand introduces one even when the gimple stmt is unconditional.
>> >
>> > >> After this patch, it is optimized to:
>> > >>
>> > >>         ptrue   p7.h, vl5
>> > >>         ld1h    z31.h, p7/z, [x0]
>> > >>         abs     v31.8h, v31.8h
>> > >>         st1h    z31.h, p7, [x0]
>> > >
>> > > Help me decipher this - I suppose z31 and v31 "overlap" in the
>> > > register file?  And z31 is a variable-length vector but
>> > > z31.8h is a 8 element fixed length vector?  How can we
>> >
>> > v31.8h, but otherwise yes.
>> >
>> > > end up with just 8 elements here?  From the upper interation
>> > > bound?
>> >
>> > Yeah.
>> >
>> > > I'm not sure why you need any target hook here.  It seems you
>> > > do already have suitable vector modes so why not just ask
>> > > for a suitable vector?  Is it because you need to have
>> > > that register overlap guarantee (otherwise you'd get
>> > > a move)?
>> >
>> > Yeah, the optimisation only makes sense for overlaid vector registers.
>> >
>> > > Why do we not simply use fixed-length SVE here in the first place?
>> >
>> > Fixed-length SVE is restricted to cases where the exact runtime length
>> > is known: the compile-time length is both a minimum and a maximum.
>> > In contrast, the code above would work even for 256-bit SVE.
>> >
>> > > To me doing this in this way in the vectorizer looks
>> > > somewhat out-of-place.
>> > >
>> > > That said, we already have unmasked ABS in the IL:
>> > >
>> > >   vect__1.6_15 = .MASK_LOAD (&a, 16B, { -1, -1, -1, -1, -1, 0, 0, 0, 0, 
>> > > 0,
>> > > 0, 0, 0, 0, 0, 0, ... }, { 0, ... });
>> > >   vect__2.7_16 = ABSU_EXPR <vect__1.6_15>;
>> > >   vect__3.8_17 = VIEW_CONVERT_EXPR<vector([8,8]) short
>> int>(vect__2.7_16);
>> > >   .MASK_STORE (&a, 16B, { -1, -1, -1, -1, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
>> > > 0, 0, ... }, vect__3.8_17); [tail call]
>> > >
>> > > so what's missing here?  I suppose having a constant masked ABSU here
>> > > would allow RTL expansion to select a fixed-size mode?
>> > >
>> > > And the vectorizer could simply use the existing
>> > > related_vector_mode hook instead?
>> >
>> > I agree it's a bit awkward.  The problem is that we want conflicting
>> > things.  On the one hand, it would make conceptual sense to use SVE
>> > instructions to provide conditional optabs for Advanced SIMD vector modes.
>> > E.g. SVE's LD1W could act as a predicated load for an Advanced SIMD
>> > int32x4_t vector.  The main problem with that is that Advanced SIMD's
>> > native boolean vector type is an integer vector of 0s and -1s, rather
>> > than an SVE predicate.  For some (native Advanced SIMD) operations we'd
>> > want one type of boolean, for some (SVE emulating Advanced SIMD)
>> > operations we'd want the other type of boolean.
>> >
>> > The patch goes the other way and treats using Advanced SIMD as an
>> > optimisation for SVE loops.
>> >
>> > related_vector_mode suffers from the same problem.  If we ask for a
>> > vector mode of >=5 halfwords for a load or store, we want the SVE mode,
>> > since that can be conditional on an SVE predicate.  But if we ask for
>> > a vector mode of >=5 halfwords for an integer absolute operation,
>> > we want the Advanced SIMD mode.  So I suppose the new hook is effectively
>> > providing context.  Perhaps we could do that using an extra parameter to
>> > related_vector_mode, if that seems better.
>> >
>> > It's somewhat difficult to recover this information after vectorisation,
>> > since like you say, the statements are often unconditional and operate
>> > on all lanes.
>> 
>> So it seems we want to query if there's a lowpart fixed-size vector
>> mode available for a given other mode.  It seems to me that we
>> should have a way to query for this already without having a new
>> target hook using general code?
>> 
>> It doesn't really fit related_vector_mode I guess.
>> 
>
> Yeah I don't think would work unless as Richard mentioned we have
> an argument to indicate which SIMD class you want to end up with.
>
>> I also wonder if we can take it as a given that SVE and neon
>> inter-operate efficiently for all implementations, without some
>> kind of "rewriting" penalty.  Like in the above example we
>> set NEON v31 and use it as source for a SVE store via z31.
>> 
>
> It's an architectural requirement that the register files overlap.
> This is described section B1.2 (Registers in AArch64 execution state).
> Any core that ever implements non-overlapping register where they
> would need a rewiring penalty has bigger problems to worry about,
> as an example the architecture describes that writing to the lower part of
> a Adv. SIMD register will clear the top parts up to VL of the SVE "view".


Yeah, sorry for the confusion.  Reading it back, I see that Richard
was asking whether the target hook was to cope with AArch64 implementations
that don't overlay the SVE and Advanced SIMD registers (which would
contravene the architecture), whereas I was answering on the basis
"we need this hook because vector registers of different modes might
not be overlaid on other (non-AArch64) targets".

Richard

Re: [PATCH] vect: Improve vectorization for small-trip-count loops using subvectors

Reply via email to