On Fri, May 9, 2025 at 4:05 PM Tamar Christina <tamar.christ...@arm.com> wrote: > > > -----Original Message----- > > From: Richard Biener <rguent...@suse.de> > > Sent: Friday, May 9, 2025 2:44 PM > > To: Tamar Christina <tamar.christ...@arm.com> > > Cc: Richard Sandiford <richard.sandif...@arm.com>; Pengfei Li > > <pengfei....@arm.com>; gcc-patches@gcc.gnu.org; ktkac...@nvidia.com > > Subject: RE: [PATCH] vect: Improve vectorization for small-trip-count loops > > using > > subvectors > > > > On Fri, 9 May 2025, Tamar Christina wrote: > > > > > > -----Original Message----- > > > > From: Richard Biener <rguent...@suse.de> > > > > Sent: Friday, May 9, 2025 11:08 AM > > > > To: Richard Sandiford <richard.sandif...@arm.com> > > > > Cc: Pengfei Li <pengfei....@arm.com>; gcc-patches@gcc.gnu.org; > > > > ktkac...@nvidia.com > > > > Subject: Re: [PATCH] vect: Improve vectorization for small-trip-count > > > > loops > > using > > > > subvectors > > > > > > > > On Fri, 9 May 2025, Richard Sandiford wrote: > > > > > > > > > Richard Biener <rguent...@suse.de> writes: > > > > > > On Thu, 8 May 2025, Pengfei Li wrote: > > > > > > > > > > > >> This patch improves the auto-vectorization for loops with known > > > > > >> small > > > > > >> trip counts by enabling the use of subvectors - bit fields of > > > > > >> original > > > > > >> wider vectors. A subvector must have the same vector element type > > > > > >> as the > > > > > >> original vector and enough bits for all vector elements to be > > > > > >> processed > > > > > >> in the loop. Using subvectors is beneficial because machine > > > > > >> instructions > > > > > >> operating on narrower vectors usually show better performance. > > > > > >> > > > > > >> To enable this optimization, this patch introduces a new target > > > > > >> hook. > > > > > >> This hook allows the vectorizer to query the backend for a suitable > > > > > >> subvector type given the original vector type and the number of > > > > > >> elements > > > > > >> to be processed in the small-trip-count loop. The target hook also > > > > > >> has a > > > > > >> could_trap parameter to say if the subvector is allowed to have > > > > > >> more > > > > > >> bits than needed. > > > > > >> > > > > > >> This optimization is currently enabled for AArch64 only. Below > > > > > >> example > > > > > >> shows how it uses AdvSIMD vectors as subvectors of SVE vectors for > > > > > >> higher instruction throughput. > > > > > >> > > > > > >> Consider this loop operating on an array of 16-bit integers: > > > > > >> > > > > > >> for (int i = 0; i < 5; i++) { > > > > > >> a[i] = a[i] < 0 ? -a[i] : a[i]; > > > > > >> } > > > > > >> > > > > > >> Before this patch, the generated AArch64 code would be: > > > > > >> > > > > > >> ptrue p7.h, vl5 > > > > > >> ptrue p6.b, all > > > > > >> ld1h z31.h, p7/z, [x0] > > > > > >> abs z31.h, p6/m, z31.h > > > > > >> st1h z31.h, p7, [x0] > > > > > > > > > > > > p6.b has all lanes active - why is the abs then not > > > > > > simply unmasked? > > > > > > > > > > There is no unpredicated abs for SVE. The predicate has to be there, > > > > > and so expand introduces one even when the gimple stmt is > > > > > unconditional. > > > > > > > > > > >> After this patch, it is optimized to: > > > > > >> > > > > > >> ptrue p7.h, vl5 > > > > > >> ld1h z31.h, p7/z, [x0] > > > > > >> abs v31.8h, v31.8h > > > > > >> st1h z31.h, p7, [x0] > > > > > > > > > > > > Help me decipher this - I suppose z31 and v31 "overlap" in the > > > > > > register file? And z31 is a variable-length vector but > > > > > > z31.8h is a 8 element fixed length vector? How can we > > > > > > > > > > v31.8h, but otherwise yes. > > > > > > > > > > > end up with just 8 elements here? From the upper interation > > > > > > bound? > > > > > > > > > > Yeah. > > > > > > > > > > > I'm not sure why you need any target hook here. It seems you > > > > > > do already have suitable vector modes so why not just ask > > > > > > for a suitable vector? Is it because you need to have > > > > > > that register overlap guarantee (otherwise you'd get > > > > > > a move)? > > > > > > > > > > Yeah, the optimisation only makes sense for overlaid vector registers. > > > > > > > > > > > Why do we not simply use fixed-length SVE here in the first place? > > > > > > > > > > Fixed-length SVE is restricted to cases where the exact runtime length > > > > > is known: the compile-time length is both a minimum and a maximum. > > > > > In contrast, the code above would work even for 256-bit SVE. > > > > > > > > > > > To me doing this in this way in the vectorizer looks > > > > > > somewhat out-of-place. > > > > > > > > > > > > That said, we already have unmasked ABS in the IL: > > > > > > > > > > > > vect__1.6_15 = .MASK_LOAD (&a, 16B, { -1, -1, -1, -1, -1, 0, 0, > > > > > > 0, 0, 0, > > > > > > 0, 0, 0, 0, 0, 0, ... }, { 0, ... }); > > > > > > vect__2.7_16 = ABSU_EXPR <vect__1.6_15>; > > > > > > vect__3.8_17 = VIEW_CONVERT_EXPR<vector([8,8]) short > > > > int>(vect__2.7_16); > > > > > > .MASK_STORE (&a, 16B, { -1, -1, -1, -1, -1, 0, 0, 0, 0, 0, 0, 0, > > > > > > 0, 0, > > > > > > 0, 0, ... }, vect__3.8_17); [tail call] > > > > > > > > > > > > so what's missing here? I suppose having a constant masked ABSU > > > > > > here > > > > > > would allow RTL expansion to select a fixed-size mode? > > > > > > > > > > > > And the vectorizer could simply use the existing > > > > > > related_vector_mode hook instead? > > > > > > > > > > I agree it's a bit awkward. The problem is that we want conflicting > > > > > things. On the one hand, it would make conceptual sense to use SVE > > > > > instructions to provide conditional optabs for Advanced SIMD vector > > > > > modes. > > > > > E.g. SVE's LD1W could act as a predicated load for an Advanced SIMD > > > > > int32x4_t vector. The main problem with that is that Advanced SIMD's > > > > > native boolean vector type is an integer vector of 0s and -1s, rather > > > > > than an SVE predicate. For some (native Advanced SIMD) operations > > > > > we'd > > > > > want one type of boolean, for some (SVE emulating Advanced SIMD) > > > > > operations we'd want the other type of boolean. > > > > > > > > > > The patch goes the other way and treats using Advanced SIMD as an > > > > > optimisation for SVE loops. > > > > > > > > > > related_vector_mode suffers from the same problem. If we ask for a > > > > > vector mode of >=5 halfwords for a load or store, we want the SVE > > > > > mode, > > > > > since that can be conditional on an SVE predicate. But if we ask for > > > > > a vector mode of >=5 halfwords for an integer absolute operation, > > > > > we want the Advanced SIMD mode. So I suppose the new hook is > > > > > effectively > > > > > providing context. Perhaps we could do that using an extra parameter > > > > > to > > > > > related_vector_mode, if that seems better. > > > > > > > > > > It's somewhat difficult to recover this information after > > > > > vectorisation, > > > > > since like you say, the statements are often unconditional and operate > > > > > on all lanes. > > > > > > > > So it seems we want to query if there's a lowpart fixed-size vector > > > > mode available for a given other mode. It seems to me that we > > > > should have a way to query for this already without having a new > > > > target hook using general code? > > > > So any answer to this? You should be able to iterate over all > > vector modes, look for those with fixed size and fitting the > > lane constraint and then asking whether the modes are tieable > > or whatever else is the correct way to verify the constraint? > > > > So sth as simple as > > > > mode = mode_for_vector (GET_MODE_INNER (vmode), ceil_pow2 (const- > > nunits)); > > if (targetm.modes_tieable_p (mode, vmode)) > > return mode; > > > > ? Why do we need a target hook for this? What's the "hidden" > > constraint I'm missing? > > > > Richard can correct me if I'm wrong (probably) but the problem with this > is that it won't work with VLS e.g. -msve-vector-bits because the SVE modes > are fixed size then. Secondly it'll have issues respecting --param > aarch64-autovec-preference= > as this is intended to only affect autovec where mode_for_vector is general. > > The core of this optimization is that it must change to Adv. SIMD over SVE > modes.
OK, so SVE VLS -msve-vector-bits=128 modes are indistinguishable from Adv. SIMD modes by the middle-end? Is there a way to distinguish them, say, by cost (target_reg_cost?)? Since any use of a SVE reg will require a predicate reg? I think we miss a critical bit of information in the middle-end here and I'm trying to see what piece of information that actually is. "find_subvector_type" doesn't seem to be it, it's maybe using that hidden bit of information for one specific use-case but it would be far better to have a way for the target to communicate the missing bit of information in a more generic way? We can then wrap a "find_subvector_type" around that. Thanks, Richard. > > > > It doesn't really fit related_vector_mode I guess. > > > > > > > > > > Yeah I don't think would work unless as Richard mentioned we have > > > an argument to indicate which SIMD class you want to end up with. > > > > > > > I also wonder if we can take it as a given that SVE and neon > > > > inter-operate efficiently for all implementations, without some > > > > kind of "rewriting" penalty. Like in the above example we > > > > set NEON v31 and use it as source for a SVE store via z31. > > > > > > > > > > It's an architectural requirement that the register files overlap. > > > This is described section B1.2 (Registers in AArch64 execution state). > > > Any core that ever implements non-overlapping register where they > > > would need a rewiring penalty has bigger problems to worry about, > > > as an example the architecture describes that writing to the lower part of > > > a Adv. SIMD register will clear the top parts up to VL of the SVE "view". > > > > > > Any such rewiring would mean that Adv. SIMD and Scalar FPR instructions > > > become useless due to the defined state of the larger views on the > > > register. > > > > I'm aware of the architectural requirement - it's just that I could > > think of the HW re-configuring itself for masked operations and thus > > switching back and forth might incur some penalty. If it is common > > practice to mix SVE and NEON this way it's of course unlikely such > > uarch would be sucessful. But then, powering off mask & high vector > > part handling logic when facing NEON might be a possibility. > > > > Yeah the reason for this optimization has more to do with how the vector > pipes are split between Adv. SIMD and SVE. An easy one is say reductions, > the bigger VL the more expensive in-order reductions like addv become. > But Adv. SIMD reductions have a fixed cost, and if we know we only need to > reduce the bottom N-bits it'll always beat SVE reductions. > > Others like MUL just have a higher throughput in Adv. SIMD vs SVE on e.g. VL > 256 > bit cores. So it's not just the masking but vector length in general. > > And the reason we don't pick Adv. SIMD for such loops is that SVE allows > partial masking, > so for e.g. MUL it's ok for us to multiply with an unknown valued lane since > predication > makes the usages of the result safe. > > Thanks, > Tamar > > > Richard. > > > > > Thanks, > > > Tamar > > > > > > > Richard. > > > > > > > > > Thanks, > > > > > Richard > > > > > > > > > > > > > -- > > > > Richard Biener <rguent...@suse.de> > > > > SUSE Software Solutions Germany GmbH, > > > > Frankenstrasse 146, 90461 Nuernberg, Germany; > > > > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG > > Nuernberg) > > > > > > > -- > > Richard Biener <rguent...@suse.de> > > SUSE Software Solutions Germany GmbH, > > Frankenstrasse 146, 90461 Nuernberg, Germany; > > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)