Tamar Christina <tamar.christ...@arm.com> writes: >> -----Original Message----- >> From: Richard Biener <rguent...@suse.de> >> Sent: Friday, May 9, 2025 11:08 AM >> To: Richard Sandiford <richard.sandif...@arm.com> >> Cc: Pengfei Li <pengfei....@arm.com>; gcc-patches@gcc.gnu.org; >> ktkac...@nvidia.com >> Subject: Re: [PATCH] vect: Improve vectorization for small-trip-count loops >> using >> subvectors >> >> On Fri, 9 May 2025, Richard Sandiford wrote: >> >> > Richard Biener <rguent...@suse.de> writes: >> > > On Thu, 8 May 2025, Pengfei Li wrote: >> > > >> > >> This patch improves the auto-vectorization for loops with known small >> > >> trip counts by enabling the use of subvectors - bit fields of original >> > >> wider vectors. A subvector must have the same vector element type as the >> > >> original vector and enough bits for all vector elements to be processed >> > >> in the loop. Using subvectors is beneficial because machine instructions >> > >> operating on narrower vectors usually show better performance. >> > >> >> > >> To enable this optimization, this patch introduces a new target hook. >> > >> This hook allows the vectorizer to query the backend for a suitable >> > >> subvector type given the original vector type and the number of elements >> > >> to be processed in the small-trip-count loop. The target hook also has a >> > >> could_trap parameter to say if the subvector is allowed to have more >> > >> bits than needed. >> > >> >> > >> This optimization is currently enabled for AArch64 only. Below example >> > >> shows how it uses AdvSIMD vectors as subvectors of SVE vectors for >> > >> higher instruction throughput. >> > >> >> > >> Consider this loop operating on an array of 16-bit integers: >> > >> >> > >> for (int i = 0; i < 5; i++) { >> > >> a[i] = a[i] < 0 ? -a[i] : a[i]; >> > >> } >> > >> >> > >> Before this patch, the generated AArch64 code would be: >> > >> >> > >> ptrue p7.h, vl5 >> > >> ptrue p6.b, all >> > >> ld1h z31.h, p7/z, [x0] >> > >> abs z31.h, p6/m, z31.h >> > >> st1h z31.h, p7, [x0] >> > > >> > > p6.b has all lanes active - why is the abs then not >> > > simply unmasked? >> > >> > There is no unpredicated abs for SVE. The predicate has to be there, >> > and so expand introduces one even when the gimple stmt is unconditional. >> > >> > >> After this patch, it is optimized to: >> > >> >> > >> ptrue p7.h, vl5 >> > >> ld1h z31.h, p7/z, [x0] >> > >> abs v31.8h, v31.8h >> > >> st1h z31.h, p7, [x0] >> > > >> > > Help me decipher this - I suppose z31 and v31 "overlap" in the >> > > register file? And z31 is a variable-length vector but >> > > z31.8h is a 8 element fixed length vector? How can we >> > >> > v31.8h, but otherwise yes. >> > >> > > end up with just 8 elements here? From the upper interation >> > > bound? >> > >> > Yeah. >> > >> > > I'm not sure why you need any target hook here. It seems you >> > > do already have suitable vector modes so why not just ask >> > > for a suitable vector? Is it because you need to have >> > > that register overlap guarantee (otherwise you'd get >> > > a move)? >> > >> > Yeah, the optimisation only makes sense for overlaid vector registers. >> > >> > > Why do we not simply use fixed-length SVE here in the first place? >> > >> > Fixed-length SVE is restricted to cases where the exact runtime length >> > is known: the compile-time length is both a minimum and a maximum. >> > In contrast, the code above would work even for 256-bit SVE. >> > >> > > To me doing this in this way in the vectorizer looks >> > > somewhat out-of-place. >> > > >> > > That said, we already have unmasked ABS in the IL: >> > > >> > > vect__1.6_15 = .MASK_LOAD (&a, 16B, { -1, -1, -1, -1, -1, 0, 0, 0, 0, >> > > 0, >> > > 0, 0, 0, 0, 0, 0, ... }, { 0, ... }); >> > > vect__2.7_16 = ABSU_EXPR <vect__1.6_15>; >> > > vect__3.8_17 = VIEW_CONVERT_EXPR<vector([8,8]) short >> int>(vect__2.7_16); >> > > .MASK_STORE (&a, 16B, { -1, -1, -1, -1, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, >> > > 0, 0, ... }, vect__3.8_17); [tail call] >> > > >> > > so what's missing here? I suppose having a constant masked ABSU here >> > > would allow RTL expansion to select a fixed-size mode? >> > > >> > > And the vectorizer could simply use the existing >> > > related_vector_mode hook instead? >> > >> > I agree it's a bit awkward. The problem is that we want conflicting >> > things. On the one hand, it would make conceptual sense to use SVE >> > instructions to provide conditional optabs for Advanced SIMD vector modes. >> > E.g. SVE's LD1W could act as a predicated load for an Advanced SIMD >> > int32x4_t vector. The main problem with that is that Advanced SIMD's >> > native boolean vector type is an integer vector of 0s and -1s, rather >> > than an SVE predicate. For some (native Advanced SIMD) operations we'd >> > want one type of boolean, for some (SVE emulating Advanced SIMD) >> > operations we'd want the other type of boolean. >> > >> > The patch goes the other way and treats using Advanced SIMD as an >> > optimisation for SVE loops. >> > >> > related_vector_mode suffers from the same problem. If we ask for a >> > vector mode of >=5 halfwords for a load or store, we want the SVE mode, >> > since that can be conditional on an SVE predicate. But if we ask for >> > a vector mode of >=5 halfwords for an integer absolute operation, >> > we want the Advanced SIMD mode. So I suppose the new hook is effectively >> > providing context. Perhaps we could do that using an extra parameter to >> > related_vector_mode, if that seems better. >> > >> > It's somewhat difficult to recover this information after vectorisation, >> > since like you say, the statements are often unconditional and operate >> > on all lanes. >> >> So it seems we want to query if there's a lowpart fixed-size vector >> mode available for a given other mode. It seems to me that we >> should have a way to query for this already without having a new >> target hook using general code? >> >> It doesn't really fit related_vector_mode I guess. >> > > Yeah I don't think would work unless as Richard mentioned we have > an argument to indicate which SIMD class you want to end up with. > >> I also wonder if we can take it as a given that SVE and neon >> inter-operate efficiently for all implementations, without some >> kind of "rewriting" penalty. Like in the above example we >> set NEON v31 and use it as source for a SVE store via z31. >> > > It's an architectural requirement that the register files overlap. > This is described section B1.2 (Registers in AArch64 execution state). > Any core that ever implements non-overlapping register where they > would need a rewiring penalty has bigger problems to worry about, > as an example the architecture describes that writing to the lower part of > a Adv. SIMD register will clear the top parts up to VL of the SVE "view".
Yeah, sorry for the confusion. Reading it back, I see that Richard was asking whether the target hook was to cope with AArch64 implementations that don't overlay the SVE and Advanced SIMD registers (which would contravene the architecture), whereas I was answering on the basis "we need this hook because vector registers of different modes might not be overlaid on other (non-AArch64) targets". Richard