> On 2 Oct 2024, at 14:34, Tamar Christina <tamar.christ...@arm.com> wrote: > > External email: Use caution opening links or attachments > > >> -----Original Message----- >> From: Kyrylo Tkachov <ktkac...@nvidia.com> >> Sent: Wednesday, October 2, 2024 1:09 PM >> To: Richard Sandiford <richard.sandif...@arm.com> >> Cc: Tamar Christina <tamar.christ...@arm.com>; Jennifer Schmitz >> <jschm...@nvidia.com>; gcc-patches@gcc.gnu.org; Kyrylo Tkachov >> <ktkac...@exchange.nvidia.com> >> Subject: Re: [PATCH] [PR113816] AArch64: Use SVE bit op reduction for vector >> reductions >> >> >> >>> On 2 Oct 2024, at 13:43, Richard Sandiford <richard.sandif...@arm.com> >> wrote: >>> >>> External email: Use caution opening links or attachments >>> >>> >>> Tamar Christina <tamar.christ...@arm.com> writes: >>>> Hi Jennifer, >>>> >>>>> -----Original Message----- >>>>> From: Richard Sandiford <richard.sandif...@arm.com> >>>>> Sent: Tuesday, October 1, 2024 12:20 PM >>>>> To: Jennifer Schmitz <jschm...@nvidia.com> >>>>> Cc: gcc-patches@gcc.gnu.org; Kyrylo Tkachov >> <ktkac...@exchange.nvidia.com> >>>>> Subject: Re: [PATCH] [PR113816] AArch64: Use SVE bit op reduction for >>>>> vector >>>>> reductions >>>>> >>>>> Jennifer Schmitz <jschm...@nvidia.com> writes: >>>>>> This patch implements the optabs reduc_and_scal_<mode>, >>>>>> reduc_ior_scal_<mode>, and reduc_xor_scal_<mode> for Advanced SIMD >>>>>> integers for TARGET_SVE in order to use the SVE instructions ANDV, ORV, >>>>>> and >>>>>> EORV for fixed-width bitwise reductions. >>>>>> For example, the test case >>>>>> >>>>>> int32_t foo (int32_t *a) >>>>>> { >>>>>> int32_t b = -1; >>>>>> for (int i = 0; i < 4; ++i) >>>>>> b &= a[i]; >>>>>> return b; >>>>>> } >>>>>> >>>>>> was previously compiled to >>>>>> (-O2 -ftree-vectorize --param aarch64-autovec-preference=asimd-only): >>>>>> foo: >>>>>> ldp w2, w1, [x0] >>>>>> ldp w3, w0, [x0, 8] >>>>>> and w1, w1, w3 >>>>>> and w0, w0, w2 >>>>>> and w0, w1, w0 >>>>>> ret >>>>>> >>>>>> With patch, it is compiled to: >>>>>> foo: >>>>>> ldr q31, [x0] >>>>>> ptrue p7.b, all >>>>>> andv s31, p7, z31.s >>>>>> fmov w0, s3 >>>>>> ret >>>>>> >>>>>> Test cases were added to check the produced assembly for use of SVE >>>>>> instructions. >>>>> >>>>> I would imagine that in this particular case, the scalar version is >>>>> better. But I agree it's a useful feature for other cases. >>>>> >>>> >>>> Yeah, I'm concerned because ANDV and other reductions are extremely >> expensive. >>>> But assuming the reductions are done outside of a loop then it should be >>>> ok, >> though. >>>> >>>> The issue is that the reduction latency grows with VL, so e.g. compare the >> latencies and >>>> throughput for Neoverse V1 and Neoverse V2. So I think we want to gate >>>> this >> on VL128. >>>> >>>> As an aside, is the sequence correct? With ORR reduction ptrue makes >>>> sense, >> but for >>>> VL > 128 ptrue doesn't work as the top bits would be zero. So an ANDV on >>>> zero >> values >>>> lanes would result in zero. >>> >>> Argh! Thanks for spotting that. I'm kicking myself for missing it :( >>> >>>> You'd want to predicate the ANDV with the size of the vector being reduced. >> The same >>>> is true for SMIN and SMAX. >>>> >>>> I do wonder whether we need to split the pattern into two, where w->w uses >> the SVE >>>> Instructions but w->r uses Adv SIMD. >>>> >>>> In the case of w->r as the example above >>>> >>>> ext v1.16b, v0.16b, v0.16b, #8 >>>> and v0.8b, v0.8b, v1.8b >>>> fmov x8, d0 >>>> lsr x9, x8, #32 >>>> and w0, w8, w9 >>>> >>>> would beat the ADDV on pretty much every uarch. >>>> >>>> But I'll leave it up to the maintainers. >>> >>> Also a good point. And since these are integer reductions, an r >>> result is more probable than a w result. w would typically only >>> be used if the result is stored directly to memory. >>> >>> At which point, the question (which you might have been implying) >>> is whether it's worth doing this at all, given the limited cases >>> for which it's beneficial, and the complication that's needed to >>> (a) detect those cases and (b) make them work. >> >> These are good points in the thread. Maybe it makes sense to do this only for >> V16QI reductions? >> Maybe a variant of Tamar’s w->r sequence wins out even there. > > I do agree that they're worth implementing, and also for 64-bit vectors > (there you > Skip the first reduction and just fmov the value to gpr since you don't have > the > Initial 128 -> 64 bit reduction step), > > But I think at the moment they're possibly not modelled as reductions in our > cost model. Like Richard mentioned I don't think the low iteration cases > should vectorize and instead just unroll. > >> >> Originally I had hoped that we’d tackle the straight-line case from PR113816 >> but it >> seems that GCC didn’t even try to create a reduction op for the code there. >> Maybe that’s something to look into separately. > > Yeah, I think unrolled scalar is going to beat the ORV there as you can have > better > throughput doing the reductions in pairs. > >> >> Also, for the alternative test case that we tried to use for a motivation: >> char sior_loop (char *a) >> { >> char b = 0; >> for (int i = 0; i < 16; ++i) >> b |= a[i]; >> return b; >> } >> >> GCC generates some terrible code: https://godbolt.org/z/a68jodKca >> So it feels that we should do something to improve the bitwise reductions for >> AArch64 regardless. >> Maybe we need to agree on the optimal sequences for the various modes and >> implement them. > > Yeah I do think they're worth having! Because the vectorizer's fallback method > of doing these reductions without the optab is element-wise but on the SIMD > side. I think the optimal sequence for all of them are a mix of SIMD + GPR. > I revised the patch and resubmitted in https://gcc.gnu.org/pipermail/gcc-patches/2024-October/664981.html. Best, Jennifer > Cheers, > Tamar > >> >> Thanks, >> Kyrill >> >>> >>> Thanks, >>> Richard >>> >>>> >>>> Thanks for the patch though! >>>> >>>> Cheers, >>>> Tamar >>>> >>>>> However, a natural follow-on would be to use SMAXV and UMAXV for V2DI. >>>>> Taking that into account: >>>>> >>>>>> [...] >>>>>> diff --git a/gcc/config/aarch64/aarch64-sve.md >> b/gcc/config/aarch64/aarch64- >>>>> sve.md >>>>>> index bfa28849adf..0d9e5cebef0 100644 >>>>>> --- a/gcc/config/aarch64/aarch64-sve.md >>>>>> +++ b/gcc/config/aarch64/aarch64-sve.md >>>>>> @@ -8927,6 +8927,28 @@ >>>>>> "<su>addv\t%d0, %1, %2.<Vetype>" >>>>>> ) >>>>>> >>>>>> +;; Unpredicated logical integer reductions for Advanced SIMD modes. >>>>>> +(define_expand "reduc_<optab>_scal_<mode>" >>>>>> + [(set (match_operand:<VEL> 0 "register_operand") >>>>>> + (unspec:<VEL> [(match_dup 2) >>>>>> + (match_operand:VQ_I 1 "register_operand")] >>>>>> + SVE_INT_REDUCTION_LOGICAL))] >>>>>> + "TARGET_SVE" >>>>>> + { >>>>>> + operands[2] = aarch64_ptrue_reg (<VPRED>mode); >>>>>> + } >>>>>> +) >>>>>> + >>>>>> +;; Predicated logical integer reductions for Advanced SIMD modes. >>>>>> +(define_insn "*aarch64_pred_reduc_<optab>_<mode>" >>>>>> + [(set (match_operand:<VEL> 0 "register_operand" "=w") >>>>>> + (unspec:<VEL> [(match_operand:<VPRED> 1 "register_operand" "Upl") >>>>>> + (match_operand:VQ_I 2 "register_operand" "w")] >>>>>> + SVE_INT_REDUCTION_LOGICAL))] >>>>>> + "TARGET_SVE" >>>>>> + "<sve_int_op>\t%<Vetype>0, %1, %Z2.<Vetype>" >>>>>> +) >>>>>> + >>>>> >>>>> ...I think we should avoid adding more patterns, and instead extend >>>>> the existing: >>>>> >>>>> (define_expand "reduc_<optab>_scal_<mode>" >>>>> [(set (match_operand:<VEL> 0 "register_operand") >>>>> (unspec:<VEL> [(match_dup 2) >>>>> (match_operand:SVE_FULL_I 1 "register_operand")] >>>>> SVE_INT_REDUCTION))] >>>>> "TARGET_SVE" >>>>> { >>>>> operands[2] = aarch64_ptrue_reg (<VPRED>mode); >>>>> } >>>>> ) >>>>> >>>>> to all Advanced SIMD integer modes except V1DI. This would involve >>>>> adding a new mode iterator along the same lines as SVE_FULL_SDI_SIMD >>>>> (SVE_FULL_I_SIMD?). >>>>> >>>>> Then we could change the name of: >>>>> >>>>> (define_expand "reduc_<optab>_scal_<mode>" >>>>> [(match_operand:<VEL> 0 "register_operand") >>>>> (unspec:VDQ_BHSI [(match_operand:VDQ_BHSI 1 "register_operand")] >>>>> MAXMINV)] >>>>> "TARGET_SIMD" >>>>> { >>>>> rtx elt = aarch64_endian_lane_rtx (<MODE>mode, 0); >>>>> rtx scratch = gen_reg_rtx (<MODE>mode); >>>>> emit_insn (gen_aarch64_reduc_<optab>_internal<mode> (scratch, >>>>> operands[1])); >>>>> emit_insn (gen_aarch64_get_lane<mode> (operands[0], scratch, elt)); >>>>> DONE; >>>>> } >>>>> ) >>>>> >>>>> to something like: >>>>> >>>>> (define_expand "@aarch64_advsimd_<optab><mode>" >>>>> >>>>> The expander above would then be something like: >>>>> >>>>> (define_expand "reduc_<optab>_scal_<mode>" >>>>> [(set (match_operand:<VEL> 0 "register_operand") >>>>> (unspec:<VEL> [(match_dup 2) >>>>> (match_operand:SVE_FULL_I_SIMD 1 "register_operand")] >>>>> SVE_INT_REDUCTION))] >>>>> "TARGET_SVE >>>>> || (maybe_code_for_aarch64_advsimd (<SVE_INT_REDUCTION>, >>>>> <MODE>mode) >>>>> != CODE_FOR_nothing)" >>>>> { >>>>> if (rtx pat = maybe_gen_aarch64_advsimd (<SVE_INT_REDUCTION>, >>>>> <MODE>mode)) >>>>> { >>>>> emit_insn (pat); >>>>> DONE; >>>>> } >>>>> operands[2] = aarch64_ptrue_reg (<VPRED>mode); >>>>> } >>>>> ) >>>>> >>>>> which acts as a kind of union pattern for SVE and Advanced SIMD. >>>>> >>>>> This would handle the SMAXV and UMAXV cases as well as the ANDV, ORV >>>>> and EORV cases. >>>>> >>>>> Not tested, so let me know if it doesn't work :) >>>>> >>>>> Thanks, >>>>> Richard
smime.p7s
Description: S/MIME cryptographic signature