On Wed, 4 Jun 2025, Tamar Christina wrote: > > -----Original Message----- > > From: Richard Biener <rguent...@suse.de> > > Sent: Wednesday, June 4, 2025 8:34 AM > > To: Tamar Christina <tamar.christ...@arm.com> > > Cc: Richard Biener <richard.guent...@gmail.com>; Richard Sandiford > > <richard.sandif...@arm.com>; Pengfei Li <pengfei....@arm.com>; gcc- > > patc...@gcc.gnu.org; ktkac...@nvidia.com > > Subject: RE: [PATCH] vect: Improve vectorization for small-trip-count loops > > using > > subvectors > > > > On Wed, 4 Jun 2025, Tamar Christina wrote: > > > > > > -----Original Message----- > > > > From: Richard Biener <rguent...@suse.de> > > > > Sent: Wednesday, June 4, 2025 8:04 AM > > > > To: Tamar Christina <tamar.christ...@arm.com> > > > > Cc: Richard Biener <richard.guent...@gmail.com>; Richard Sandiford > > > > <richard.sandif...@arm.com>; Pengfei Li <pengfei....@arm.com>; gcc- > > > > patc...@gcc.gnu.org; ktkac...@nvidia.com > > > > Subject: RE: [PATCH] vect: Improve vectorization for small-trip-count > > > > loops > > using > > > > subvectors > > > > > > > > On Tue, 3 Jun 2025, Tamar Christina wrote: > > > > > > > > > > -----Original Message----- > > > > > > From: Richard Biener <richard.guent...@gmail.com> > > > > > > Sent: Tuesday, June 3, 2025 2:12 PM > > > > > > To: Tamar Christina <tamar.christ...@arm.com> > > > > > > Cc: Richard Biener <rguent...@suse.de>; Richard Sandiford > > > > > > <richard.sandif...@arm.com>; Pengfei Li <pengfei....@arm.com>; gcc- > > > > > > patc...@gcc.gnu.org; ktkac...@nvidia.com > > > > > > Subject: Re: [PATCH] vect: Improve vectorization for > > > > > > small-trip-count loops > > > > using > > > > > > subvectors > > > > > > > > > > > > On Fri, May 9, 2025 at 4:05 PM Tamar Christina > > <tamar.christ...@arm.com> > > > > > > wrote: > > > > > > > > > > > > > > > -----Original Message----- > > > > > > > > From: Richard Biener <rguent...@suse.de> > > > > > > > > Sent: Friday, May 9, 2025 2:44 PM > > > > > > > > To: Tamar Christina <tamar.christ...@arm.com> > > > > > > > > Cc: Richard Sandiford <richard.sandif...@arm.com>; Pengfei Li > > > > > > > > <pengfei....@arm.com>; gcc-patches@gcc.gnu.org; > > ktkac...@nvidia.com > > > > > > > > Subject: RE: [PATCH] vect: Improve vectorization for > > > > > > > > small-trip-count > > loops > > > > > > using > > > > > > > > subvectors > > > > > > > > > > > > > > > > On Fri, 9 May 2025, Tamar Christina wrote: > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > > > > > From: Richard Biener <rguent...@suse.de> > > > > > > > > > > Sent: Friday, May 9, 2025 11:08 AM > > > > > > > > > > To: Richard Sandiford <richard.sandif...@arm.com> > > > > > > > > > > Cc: Pengfei Li <pengfei....@arm.com>; > > > > > > > > > > gcc-patches@gcc.gnu.org; > > > > > > > > > > ktkac...@nvidia.com > > > > > > > > > > Subject: Re: [PATCH] vect: Improve vectorization for > > > > > > > > > > small-trip-count > > > > loops > > > > > > > > using > > > > > > > > > > subvectors > > > > > > > > > > > > > > > > > > > > On Fri, 9 May 2025, Richard Sandiford wrote: > > > > > > > > > > > > > > > > > > > > > Richard Biener <rguent...@suse.de> writes: > > > > > > > > > > > > On Thu, 8 May 2025, Pengfei Li wrote: > > > > > > > > > > > > > > > > > > > > > > > >> This patch improves the auto-vectorization for loops > > > > > > > > > > > >> with known > > > > small > > > > > > > > > > > >> trip counts by enabling the use of subvectors - bit > > > > > > > > > > > >> fields of > > original > > > > > > > > > > > >> wider vectors. A subvector must have the same vector > > > > > > > > > > > >> element > > type > > > > as > > > > > > the > > > > > > > > > > > >> original vector and enough bits for all vector > > > > > > > > > > > >> elements to be > > > > processed > > > > > > > > > > > >> in the loop. Using subvectors is beneficial because > > > > > > > > > > > >> machine > > > > instructions > > > > > > > > > > > >> operating on narrower vectors usually show better > > > > > > > > > > > >> performance. > > > > > > > > > > > >> > > > > > > > > > > > >> To enable this optimization, this patch introduces a > > > > > > > > > > > >> new target > > > > hook. > > > > > > > > > > > >> This hook allows the vectorizer to query the backend > > > > > > > > > > > >> for a > > suitable > > > > > > > > > > > >> subvector type given the original vector type and the > > > > > > > > > > > >> number of > > > > > > elements > > > > > > > > > > > >> to be processed in the small-trip-count loop. The > > > > > > > > > > > >> target hook > > also > > > > has a > > > > > > > > > > > >> could_trap parameter to say if the subvector is > > > > > > > > > > > >> allowed to have > > > > more > > > > > > > > > > > >> bits than needed. > > > > > > > > > > > >> > > > > > > > > > > > >> This optimization is currently enabled for AArch64 > > > > > > > > > > > >> only. Below > > > > example > > > > > > > > > > > >> shows how it uses AdvSIMD vectors as subvectors of SVE > > vectors > > > > for > > > > > > > > > > > >> higher instruction throughput. > > > > > > > > > > > >> > > > > > > > > > > > >> Consider this loop operating on an array of 16-bit > > > > > > > > > > > >> integers: > > > > > > > > > > > >> > > > > > > > > > > > >> for (int i = 0; i < 5; i++) { > > > > > > > > > > > >> a[i] = a[i] < 0 ? -a[i] : a[i]; > > > > > > > > > > > >> } > > > > > > > > > > > >> > > > > > > > > > > > >> Before this patch, the generated AArch64 code would be: > > > > > > > > > > > >> > > > > > > > > > > > >> ptrue p7.h, vl5 > > > > > > > > > > > >> ptrue p6.b, all > > > > > > > > > > > >> ld1h z31.h, p7/z, [x0] > > > > > > > > > > > >> abs z31.h, p6/m, z31.h > > > > > > > > > > > >> st1h z31.h, p7, [x0] > > > > > > > > > > > > > > > > > > > > > > > > p6.b has all lanes active - why is the abs then not > > > > > > > > > > > > simply unmasked? > > > > > > > > > > > > > > > > > > > > > > There is no unpredicated abs for SVE. The predicate has > > > > > > > > > > > to be there, > > > > > > > > > > > and so expand introduces one even when the gimple stmt is > > > > > > unconditional. > > > > > > > > > > > > > > > > > > > > > > >> After this patch, it is optimized to: > > > > > > > > > > > >> > > > > > > > > > > > >> ptrue p7.h, vl5 > > > > > > > > > > > >> ld1h z31.h, p7/z, [x0] > > > > > > > > > > > >> abs v31.8h, v31.8h > > > > > > > > > > > >> st1h z31.h, p7, [x0] > > > > > > > > > > > > > > > > > > > > > > > > Help me decipher this - I suppose z31 and v31 "overlap" > > > > > > > > > > > > in the > > > > > > > > > > > > register file? And z31 is a variable-length vector but > > > > > > > > > > > > z31.8h is a 8 element fixed length vector? How can we > > > > > > > > > > > > > > > > > > > > > > v31.8h, but otherwise yes. > > > > > > > > > > > > > > > > > > > > > > > end up with just 8 elements here? From the upper > > > > > > > > > > > > interation > > > > > > > > > > > > bound? > > > > > > > > > > > > > > > > > > > > > > Yeah. > > > > > > > > > > > > > > > > > > > > > > > I'm not sure why you need any target hook here. It > > > > > > > > > > > > seems you > > > > > > > > > > > > do already have suitable vector modes so why not just > > > > > > > > > > > > ask > > > > > > > > > > > > for a suitable vector? Is it because you need to have > > > > > > > > > > > > that register overlap guarantee (otherwise you'd get > > > > > > > > > > > > a move)? > > > > > > > > > > > > > > > > > > > > > > Yeah, the optimisation only makes sense for overlaid > > > > > > > > > > > vector > > registers. > > > > > > > > > > > > > > > > > > > > > > > Why do we not simply use fixed-length SVE here in the > > > > > > > > > > > > first place? > > > > > > > > > > > > > > > > > > > > > > Fixed-length SVE is restricted to cases where the exact > > > > > > > > > > > runtime > > length > > > > > > > > > > > is known: the compile-time length is both a minimum and a > > maximum. > > > > > > > > > > > In contrast, the code above would work even for 256-bit > > > > > > > > > > > SVE. > > > > > > > > > > > > > > > > > > > > > > > To me doing this in this way in the vectorizer looks > > > > > > > > > > > > somewhat out-of-place. > > > > > > > > > > > > > > > > > > > > > > > > That said, we already have unmasked ABS in the IL: > > > > > > > > > > > > > > > > > > > > > > > > vect__1.6_15 = .MASK_LOAD (&a, 16B, { -1, -1, -1, -1, > > > > > > > > > > > > -1, 0, 0, 0, > > 0, > > > > 0, > > > > > > > > > > > > 0, 0, 0, 0, 0, 0, ... }, { 0, ... }); > > > > > > > > > > > > vect__2.7_16 = ABSU_EXPR <vect__1.6_15>; > > > > > > > > > > > > vect__3.8_17 = VIEW_CONVERT_EXPR<vector([8,8]) short > > > > > > > > > > int>(vect__2.7_16); > > > > > > > > > > > > .MASK_STORE (&a, 16B, { -1, -1, -1, -1, -1, 0, 0, 0, > > > > > > > > > > > > 0, 0, 0, 0, 0, 0, > > > > > > > > > > > > 0, 0, ... }, vect__3.8_17); [tail call] > > > > > > > > > > > > > > > > > > > > > > > > so what's missing here? I suppose having a constant > > > > > > > > > > > > masked > > ABSU > > > > here > > > > > > > > > > > > would allow RTL expansion to select a fixed-size mode? > > > > > > > > > > > > > > > > > > > > > > > > And the vectorizer could simply use the existing > > > > > > > > > > > > related_vector_mode hook instead? > > > > > > > > > > > > > > > > > > > > > > I agree it's a bit awkward. The problem is that we want > > > > > > > > > > > conflicting > > > > > > > > > > > things. On the one hand, it would make conceptual sense > > > > > > > > > > > to use > > SVE > > > > > > > > > > > instructions to provide conditional optabs for Advanced > > > > > > > > > > > SIMD > > vector > > > > > > modes. > > > > > > > > > > > E.g. SVE's LD1W could act as a predicated load for an > > > > > > > > > > > Advanced > > SIMD > > > > > > > > > > > int32x4_t vector. The main problem with that is that > > > > > > > > > > > Advanced > > SIMD's > > > > > > > > > > > native boolean vector type is an integer vector of 0s and > > > > > > > > > > > -1s, rather > > > > > > > > > > > than an SVE predicate. For some (native Advanced SIMD) > > operations > > > > we'd > > > > > > > > > > > want one type of boolean, for some (SVE emulating Advanced > > SIMD) > > > > > > > > > > > operations we'd want the other type of boolean. > > > > > > > > > > > > > > > > > > > > > > The patch goes the other way and treats using Advanced > > > > > > > > > > > SIMD as > > an > > > > > > > > > > > optimisation for SVE loops. > > > > > > > > > > > > > > > > > > > > > > related_vector_mode suffers from the same problem. If we > > > > > > > > > > > ask for > > a > > > > > > > > > > > vector mode of >=5 halfwords for a load or store, we want > > > > > > > > > > > the SVE > > > > mode, > > > > > > > > > > > since that can be conditional on an SVE predicate. But > > > > > > > > > > > if we ask for > > > > > > > > > > > a vector mode of >=5 halfwords for an integer absolute > > > > > > > > > > > operation, > > > > > > > > > > > we want the Advanced SIMD mode. So I suppose the new > > > > > > > > > > > hook is > > > > > > effectively > > > > > > > > > > > providing context. Perhaps we could do that using an > > > > > > > > > > > extra > > parameter > > > > to > > > > > > > > > > > related_vector_mode, if that seems better. > > > > > > > > > > > > > > > > > > > > > > It's somewhat difficult to recover this information after > > vectorisation, > > > > > > > > > > > since like you say, the statements are often > > > > > > > > > > > unconditional and > > operate > > > > > > > > > > > on all lanes. > > > > > > > > > > > > > > > > > > > > So it seems we want to query if there's a lowpart > > > > > > > > > > fixed-size vector > > > > > > > > > > mode available for a given other mode. It seems to me that > > > > > > > > > > we > > > > > > > > > > should have a way to query for this already without having > > > > > > > > > > a new > > > > > > > > > > target hook using general code? > > > > > > > > > > > > > > > > So any answer to this? You should be able to iterate over all > > > > > > > > vector modes, look for those with fixed size and fitting the > > > > > > > > lane constraint and then asking whether the modes are tieable > > > > > > > > or whatever else is the correct way to verify the constraint? > > > > > > > > > > > > > > > > So sth as simple as > > > > > > > > > > > > > > > > mode = mode_for_vector (GET_MODE_INNER (vmode), ceil_pow2 > > (const- > > > > > > > > nunits)); > > > > > > > > if (targetm.modes_tieable_p (mode, vmode)) > > > > > > > > return mode; > > > > > > > > > > > > > > > > ? Why do we need a target hook for this? What's the "hidden" > > > > > > > > constraint I'm missing? > > > > > > > > > > > > > > > > > > > > > > Richard can correct me if I'm wrong (probably) but the problem > > > > > > > with this > > > > > > > is that it won't work with VLS e.g. -msve-vector-bits because the > > > > > > > SVE > > modes > > > > > > > are fixed size then. Secondly it'll have issues respecting > > > > > > > --param aarch64- > > > > > > autovec-preference= > > > > > > > as this is intended to only affect autovec where mode_for_vector > > > > > > > is > > general. > > > > > > > > > > > > > > The core of this optimization is that it must change to Adv. SIMD > > > > > > > over SVE > > > > > > modes. > > > > > > > > > > > > OK, so SVE VLS -msve-vector-bits=128 modes are indistinguishable > > > > > > from > > Adv. > > > > > > SIMD > > > > > > modes by the middle-end? > > > > > > > > > > I believe so, the ACLE types have an annotation on them to lift some > > > > > of the > > > > > restrictions but the modes are the same. > > > > > > > > > > > Is there a way to distinguish them, say, by cost > > > > > > (target_reg_cost?)? Since any use of a SVE reg will require a > > > > > > predicate reg? > > > > > > > > > > > > > > > > We do have unpredicated SVE instructions, but yes costing could work. > > > > > Essentially what we're trying to do is find the cheapest mode to > > > > > perform > > > > > the operation on. > > > > > > > > > > This could work.. But how would we incorporate it into the costing? > > > > > Part of > > > > > the problem is that to iterate over similar modes with the same > > > > > element size > > > > > likely requires some target input no? Or are you saying we should > > > > > only > > > > > iterate over fixed size modes? > > > > > > > > > > Regards, > > > > > Tamar > > > > > > > > > > > I think we miss a critical bit of information in the middle-end > > > > > > here and I'm > > > > > > trying to see what piece of information that actually is. > > "find_subvector_type" > > > > > > doesn't seem to be it, it's maybe using that hidden bit of > > > > > > information for > > > > > > one specific use-case but it would be far better to have a way for > > > > > > the target > > > > > > to communicate the missing bit of information in a more generic way? > > > > > > We can then wrap a "find_subvector_type" around that. > > > > > > > > So for this one sth like targetm.mode_requires_predication ()? But > > > > as Tamar says above this really depends on the operation. But the > > > > optabs do _not_ expose this requirement (we have non-.COND_ADD for > > > > SVE modes), but you want to take advantage of this difference. > > > > Can we access insn attributes from optab entries? Could we add > > > > some "standard" attribute noting that an insn requires a predicate? > > > > But of course that likely depends on the alternative? > > > > > > > > > > We'd likely also require the mask that would be used, because I think > > > otherwise > > > targetm.mode_requires_predication would be a bit ambiguous for non-flag > > setting > > > instructions or instructions that don’t do cross lane operations. > > > > > > e.g. SVE has both COND_ADD and ADD. But the key here is that if we know > > > we'll > > > access the bottom 64 or 128 bits we could use an Adv. SIMD ADD. > > > > But SVE ADD still requires a predicate register (with all lanes enabled), > > no? That's the whole point of the optimization we're discussing? > > I see the only problem with -msve-vector-bits=N where GET_MODE_SIZE > > is no longer a POLY_INT - otherwise that would be the easy > > way to identify Adv. SIMD vs. SVE and heuristically prefer > > fixed-size modes in the vectorizer when possible (for small known > > niter <= the fixed-size mode number of lanes). But with > > -msve-vector-bits=128 GET_MODE_SIZE for Adv. SIMD and SVE is equal(?), > > so we need another way to distinguish. Because even with > > -msve-vector-bits=128 you need the predicate register appropriately > > set up as I understand you are not altering the SVE HW config which > > would be also possible(?), but I'm not sure that would make it > > possible to have a predicate register less ADD instruction. > > > > What SVE register taking machine instructions do not explicitly/implicitly > > use one of the SVE predicate registers? > > Many, ADD for instance is this > https://developer.arm.com/documentation/ddi0602/2025-03/SVE-Instructions/ADD--vectors--unpredicated---Add-vectors--unpredicated-- > > And SVE2 added many more. GCC already takes advantage of this and drops > predicates entirely when it can to avoid the dependency on the predicate pipe. > > Those are actually different instructions not just aliases.
I see. So this clearly is a feature on instructions then, not modes. In fact it might be profitable to use unpredicated add to avoid computing the loop mask for a specific element width completely even when that would require more operation for a wide SVE implementation. For the patch at hand I would suggest to re-post without a new target hook, ignoring the -msve-vector-bits complication for now and simply key on GET_MODE_SIZE being POLY_INT, having a vectorizer local helper like tree get_fixed_size_vectype (tree old_vectype, unsigned nlanes-upper-bound) ? > BIC for instance > https://developer.arm.com/documentation/ddi0602/2025-03/SVE-Instructions/BIC--vectors--unpredicated---Bitwise-clear-vectors--unpredicated-- > > Regards, > Tamar > > > > > > Where the > > > operation would be beneficial for longer VL cores where the Adv. SIMD > > > vector > > > pipes are multiplexed on the SVE ones. Such as Neoverse-V1, but not > > > Neoverse- > > V2. > > > > > > Without the predicate being considered (or niters) SVE would have to > > > return false > > > for the hook. > > > > > > Which is why an attribute may be tricky. > > > > > > Regards, > > > Tamar > > > > > > > Richard. > > > > > > > > > > > > > > Thanks, > > > > > > Richard. > > > > > > > > > > > > > > > > It doesn't really fit related_vector_mode I guess. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Yeah I don't think would work unless as Richard mentioned we > > > > > > > > > have > > > > > > > > > an argument to indicate which SIMD class you want to end up > > > > > > > > > with. > > > > > > > > > > > > > > > > > > > I also wonder if we can take it as a given that SVE and neon > > > > > > > > > > inter-operate efficiently for all implementations, without > > > > > > > > > > some > > > > > > > > > > kind of "rewriting" penalty. Like in the above example we > > > > > > > > > > set NEON v31 and use it as source for a SVE store via z31. > > > > > > > > > > > > > > > > > > > > > > > > > > > > It's an architectural requirement that the register files > > > > > > > > > overlap. > > > > > > > > > This is described section B1.2 (Registers in AArch64 > > > > > > > > > execution state). > > > > > > > > > Any core that ever implements non-overlapping register where > > > > > > > > > they > > > > > > > > > would need a rewiring penalty has bigger problems to worry > > > > > > > > > about, > > > > > > > > > as an example the architecture describes that writing to the > > > > > > > > > lower part > > of > > > > > > > > > a Adv. SIMD register will clear the top parts up to VL of the > > > > > > > > > SVE "view". > > > > > > > > > > > > > > > > > > Any such rewiring would mean that Adv. SIMD and Scalar FPR > > instructions > > > > > > > > > become useless due to the defined state of the larger views > > > > > > > > > on the > > > > register. > > > > > > > > > > > > > > > > I'm aware of the architectural requirement - it's just that I > > > > > > > > could > > > > > > > > think of the HW re-configuring itself for masked operations and > > > > > > > > thus > > > > > > > > switching back and forth might incur some penalty. If it is > > > > > > > > common > > > > > > > > practice to mix SVE and NEON this way it's of course unlikely > > > > > > > > such > > > > > > > > uarch would be sucessful. But then, powering off mask & high > > > > > > > > vector > > > > > > > > part handling logic when facing NEON might be a possibility. > > > > > > > > > > > > > > > > > > > > > > Yeah the reason for this optimization has more to do with how the > > > > > > > vector > > > > > > > pipes are split between Adv. SIMD and SVE. An easy one is say > > > > > > > reductions, > > > > > > > the bigger VL the more expensive in-order reductions like addv > > > > > > > become. > > > > > > > But Adv. SIMD reductions have a fixed cost, and if we know we > > > > > > > only need > > to > > > > > > > reduce the bottom N-bits it'll always beat SVE reductions. > > > > > > > > > > > > > > Others like MUL just have a higher throughput in Adv. SIMD vs SVE > > > > > > > on e.g. > > VL > > > > > > 256 > > > > > > > bit cores. So it's not just the masking but vector length in > > > > > > > general. > > > > > > > > > > > > > > And the reason we don't pick Adv. SIMD for such loops is that SVE > > > > > > > allows > > > > partial > > > > > > masking, > > > > > > > so for e.g. MUL it's ok for us to multiply with an unknown valued > > > > > > > lane > > since > > > > > > predication > > > > > > > makes the usages of the result safe. > > > > > > > > > > > > > > Thanks, > > > > > > > Tamar > > > > > > > > > > > > > > > Richard. > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > Tamar > > > > > > > > > > > > > > > > > > > Richard. > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > Richard > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > Richard Biener <rguent...@suse.de> > > > > > > > > > > SUSE Software Solutions Germany GmbH, > > > > > > > > > > Frankenstrasse 146, 90461 Nuernberg, Germany; > > > > > > > > > > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB > > > > > > > > > > 36809, AG > > > > > > > > Nuernberg) > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > Richard Biener <rguent...@suse.de> > > > > > > > > SUSE Software Solutions Germany GmbH, > > > > > > > > Frankenstrasse 146, 90461 Nuernberg, Germany; > > > > > > > > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG > > > > > > Nuernberg) > > > > > > > > > > > > > -- > > > > Richard Biener <rguent...@suse.de> > > > > SUSE Software Solutions Germany GmbH, > > > > Frankenstrasse 146, 90461 Nuernberg, Germany; > > > > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG > > Nuernberg) > > > > > > > -- > > Richard Biener <rguent...@suse.de> > > SUSE Software Solutions Germany GmbH, > > Frankenstrasse 146, 90461 Nuernberg, Germany; > > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg) > -- Richard Biener <rguent...@suse.de> SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg, Germany; GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)