Re: [PATCH] vect: Improve vectorization for small-trip-count loops using subvectors

Richard Biener Fri, 09 May 2025 03:27:46 -0700

On Fri, 9 May 2025, Richard Sandiford wrote:

> Richard Biener <rguent...@suse.de> writes:
> > On Thu, 8 May 2025, Pengfei Li wrote:
> >
> >> This patch improves the auto-vectorization for loops with known small
> >> trip counts by enabling the use of subvectors - bit fields of original
> >> wider vectors. A subvector must have the same vector element type as the
> >> original vector and enough bits for all vector elements to be processed
> >> in the loop. Using subvectors is beneficial because machine instructions
> >> operating on narrower vectors usually show better performance.
> >> 
> >> To enable this optimization, this patch introduces a new target hook.
> >> This hook allows the vectorizer to query the backend for a suitable
> >> subvector type given the original vector type and the number of elements
> >> to be processed in the small-trip-count loop. The target hook also has a
> >> could_trap parameter to say if the subvector is allowed to have more
> >> bits than needed.
> >> 
> >> This optimization is currently enabled for AArch64 only. Below example
> >> shows how it uses AdvSIMD vectors as subvectors of SVE vectors for
> >> higher instruction throughput.
> >> 
> >> Consider this loop operating on an array of 16-bit integers:
> >> 
> >>    for (int i = 0; i < 5; i++) {
> >>      a[i] = a[i] < 0 ? -a[i] : a[i];
> >>    }
> >> 
> >> Before this patch, the generated AArch64 code would be:
> >> 
> >>    ptrue   p7.h, vl5
> >>    ptrue   p6.b, all
> >>    ld1h    z31.h, p7/z, [x0]
> >>    abs     z31.h, p6/m, z31.h
> >>    st1h    z31.h, p7, [x0]
> >
> > p6.b has all lanes active - why is the abs then not
> > simply unmasked?
> 
> There is no unpredicated abs for SVE.  The predicate has to be there,
> and so expand introduces one even when the gimple stmt is unconditional.
> 
> >> After this patch, it is optimized to:
> >> 
> >>    ptrue   p7.h, vl5
> >>    ld1h    z31.h, p7/z, [x0]
> >>    abs     v31.8h, v31.8h
> >>    st1h    z31.h, p7, [x0]
> >
> > Help me decipher this - I suppose z31 and v31 "overlap" in the
> > register file?  And z31 is a variable-length vector but
> > z31.8h is a 8 element fixed length vector?  How can we
> 
> v31.8h, but otherwise yes.
> 
> > end up with just 8 elements here?  From the upper interation
> > bound?
> 
> Yeah.
> 
> > I'm not sure why you need any target hook here.  It seems you
> > do already have suitable vector modes so why not just ask
> > for a suitable vector?  Is it because you need to have
> > that register overlap guarantee (otherwise you'd get
> > a move)?
> 
> Yeah, the optimisation only makes sense for overlaid vector registers.
> 
> > Why do we not simply use fixed-length SVE here in the first place?
> 
> Fixed-length SVE is restricted to cases where the exact runtime length
> is known: the compile-time length is both a minimum and a maximum.
> In contrast, the code above would work even for 256-bit SVE.
>
> > To me doing this in this way in the vectorizer looks
> > somewhat out-of-place.
> >
> > That said, we already have unmasked ABS in the IL:
> >
> >   vect__1.6_15 = .MASK_LOAD (&a, 16B, { -1, -1, -1, -1, -1, 0, 0, 0, 0, 0, 
> > 0, 0, 0, 0, 0, 0, ... }, { 0, ... });
> >   vect__2.7_16 = ABSU_EXPR <vect__1.6_15>;
> >   vect__3.8_17 = VIEW_CONVERT_EXPR<vector([8,8]) short int>(vect__2.7_16);
> >   .MASK_STORE (&a, 16B, { -1, -1, -1, -1, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
> > 0, 0, ... }, vect__3.8_17); [tail call]
> >
> > so what's missing here?  I suppose having a constant masked ABSU here
> > would allow RTL expansion to select a fixed-size mode?
> >
> > And the vectorizer could simply use the existing
> > related_vector_mode hook instead?
> 
> I agree it's a bit awkward.  The problem is that we want conflicting
> things.  On the one hand, it would make conceptual sense to use SVE
> instructions to provide conditional optabs for Advanced SIMD vector modes.
> E.g. SVE's LD1W could act as a predicated load for an Advanced SIMD
> int32x4_t vector.  The main problem with that is that Advanced SIMD's
> native boolean vector type is an integer vector of 0s and -1s, rather
> than an SVE predicate.  For some (native Advanced SIMD) operations we'd
> want one type of boolean, for some (SVE emulating Advanced SIMD)
> operations we'd want the other type of boolean.
> 
> The patch goes the other way and treats using Advanced SIMD as an
> optimisation for SVE loops.
> 
> related_vector_mode suffers from the same problem.  If we ask for a
> vector mode of >=5 halfwords for a load or store, we want the SVE mode,
> since that can be conditional on an SVE predicate.  But if we ask for
> a vector mode of >=5 halfwords for an integer absolute operation,
> we want the Advanced SIMD mode.  So I suppose the new hook is effectively
> providing context.  Perhaps we could do that using an extra parameter to
> related_vector_mode, if that seems better.
> 
> It's somewhat difficult to recover this information after vectorisation,
> since like you say, the statements are often unconditional and operate
> on all lanes.


So it seems we want to query if there's a lowpart fixed-size vector
mode available for a given other mode.  It seems to me that we
should have a way to query for this already without having a new
target hook using general code?

It doesn't really fit related_vector_mode I guess.

I also wonder if we can take it as a given that SVE and neon
inter-operate efficiently for all implementations, without some
kind of "rewriting" penalty.  Like in the above example we
set NEON v31 and use it as source for a SVE store via z31.

Richard.

> Thanks,
> Richard
> 

-- 
Richard Biener <rguent...@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

Re: [PATCH] vect: Improve vectorization for small-trip-count loops using subvectors

Reply via email to