On Fri, 9 May 2025, Richard Sandiford wrote: > Richard Biener <rguent...@suse.de> writes: > > On Thu, 8 May 2025, Pengfei Li wrote: > > > >> This patch improves the auto-vectorization for loops with known small > >> trip counts by enabling the use of subvectors - bit fields of original > >> wider vectors. A subvector must have the same vector element type as the > >> original vector and enough bits for all vector elements to be processed > >> in the loop. Using subvectors is beneficial because machine instructions > >> operating on narrower vectors usually show better performance. > >> > >> To enable this optimization, this patch introduces a new target hook. > >> This hook allows the vectorizer to query the backend for a suitable > >> subvector type given the original vector type and the number of elements > >> to be processed in the small-trip-count loop. The target hook also has a > >> could_trap parameter to say if the subvector is allowed to have more > >> bits than needed. > >> > >> This optimization is currently enabled for AArch64 only. Below example > >> shows how it uses AdvSIMD vectors as subvectors of SVE vectors for > >> higher instruction throughput. > >> > >> Consider this loop operating on an array of 16-bit integers: > >> > >> for (int i = 0; i < 5; i++) { > >> a[i] = a[i] < 0 ? -a[i] : a[i]; > >> } > >> > >> Before this patch, the generated AArch64 code would be: > >> > >> ptrue p7.h, vl5 > >> ptrue p6.b, all > >> ld1h z31.h, p7/z, [x0] > >> abs z31.h, p6/m, z31.h > >> st1h z31.h, p7, [x0] > > > > p6.b has all lanes active - why is the abs then not > > simply unmasked? > > There is no unpredicated abs for SVE. The predicate has to be there, > and so expand introduces one even when the gimple stmt is unconditional. > > >> After this patch, it is optimized to: > >> > >> ptrue p7.h, vl5 > >> ld1h z31.h, p7/z, [x0] > >> abs v31.8h, v31.8h > >> st1h z31.h, p7, [x0] > > > > Help me decipher this - I suppose z31 and v31 "overlap" in the > > register file? And z31 is a variable-length vector but > > z31.8h is a 8 element fixed length vector? How can we > > v31.8h, but otherwise yes. > > > end up with just 8 elements here? From the upper interation > > bound? > > Yeah. > > > I'm not sure why you need any target hook here. It seems you > > do already have suitable vector modes so why not just ask > > for a suitable vector? Is it because you need to have > > that register overlap guarantee (otherwise you'd get > > a move)? > > Yeah, the optimisation only makes sense for overlaid vector registers. > > > Why do we not simply use fixed-length SVE here in the first place? > > Fixed-length SVE is restricted to cases where the exact runtime length > is known: the compile-time length is both a minimum and a maximum. > In contrast, the code above would work even for 256-bit SVE. > > > To me doing this in this way in the vectorizer looks > > somewhat out-of-place. > > > > That said, we already have unmasked ABS in the IL: > > > > vect__1.6_15 = .MASK_LOAD (&a, 16B, { -1, -1, -1, -1, -1, 0, 0, 0, 0, 0, > > 0, 0, 0, 0, 0, 0, ... }, { 0, ... }); > > vect__2.7_16 = ABSU_EXPR <vect__1.6_15>; > > vect__3.8_17 = VIEW_CONVERT_EXPR<vector([8,8]) short int>(vect__2.7_16); > > .MASK_STORE (&a, 16B, { -1, -1, -1, -1, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, > > 0, 0, ... }, vect__3.8_17); [tail call] > > > > so what's missing here? I suppose having a constant masked ABSU here > > would allow RTL expansion to select a fixed-size mode? > > > > And the vectorizer could simply use the existing > > related_vector_mode hook instead? > > I agree it's a bit awkward. The problem is that we want conflicting > things. On the one hand, it would make conceptual sense to use SVE > instructions to provide conditional optabs for Advanced SIMD vector modes. > E.g. SVE's LD1W could act as a predicated load for an Advanced SIMD > int32x4_t vector. The main problem with that is that Advanced SIMD's > native boolean vector type is an integer vector of 0s and -1s, rather > than an SVE predicate. For some (native Advanced SIMD) operations we'd > want one type of boolean, for some (SVE emulating Advanced SIMD) > operations we'd want the other type of boolean. > > The patch goes the other way and treats using Advanced SIMD as an > optimisation for SVE loops. > > related_vector_mode suffers from the same problem. If we ask for a > vector mode of >=5 halfwords for a load or store, we want the SVE mode, > since that can be conditional on an SVE predicate. But if we ask for > a vector mode of >=5 halfwords for an integer absolute operation, > we want the Advanced SIMD mode. So I suppose the new hook is effectively > providing context. Perhaps we could do that using an extra parameter to > related_vector_mode, if that seems better. > > It's somewhat difficult to recover this information after vectorisation, > since like you say, the statements are often unconditional and operate > on all lanes.
So it seems we want to query if there's a lowpart fixed-size vector mode available for a given other mode. It seems to me that we should have a way to query for this already without having a new target hook using general code? It doesn't really fit related_vector_mode I guess. I also wonder if we can take it as a given that SVE and neon inter-operate efficiently for all implementations, without some kind of "rewriting" penalty. Like in the above example we set NEON v31 and use it as source for a SVE store via z31. Richard. > Thanks, > Richard > -- Richard Biener <rguent...@suse.de> SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg, Germany; GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)