[PATCH] aarch64: Emit ADD X, Y, Y instead of SHL X, Y, #1 for SVE instructions
On Neoverse V2, SVE ADD instructions have a throughput of 4, while shift instructions like SHL have a throughput of 2. We can lean on that to emit code like: addz31.b, z31.b, z31.b instead of: lslz31.b, z31.b, #1 The implementation of this change for SVE vectors is similar to a prior patch <https://gcc.gnu.org/pipermail/gcc-patches/2024-August/659958.html> that adds the above functionality for Neon vectors. Here, the machine descriptor pattern is split up to separately accommodate left and right shifts, so we can specifically emit an add for all left shifts by 1. The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression. OK for mainline? Signed-off-by: Soumya AR gcc/ChangeLog: * config/aarch64/aarch64-sve.md (*post_ra_v3): Split pattern to accomodate left and right shifts separately. (*post_ra_v_ashl3): Matches left shifts with additional constraint to check for shifts by 1. (*post_ra_v_3): Matches right shifts. gcc/testsuite/ChangeLog: * gcc.target/aarch64/sve/acle/asm/lsl_s16.c: Updated instances of lsl-1 with corresponding add * gcc.target/aarch64/sve/acle/asm/lsl_s32.c: Likewise. * gcc.target/aarch64/sve/acle/asm/lsl_s64.c: Likewise. * gcc.target/aarch64/sve/acle/asm/lsl_s8.c: Likewise. * gcc.target/aarch64/sve/acle/asm/lsl_u16.c: Likewise. * gcc.target/aarch64/sve/acle/asm/lsl_u32.c: Likewise. * gcc.target/aarch64/sve/acle/asm/lsl_u64.c: Likewise. * gcc.target/aarch64/sve/acle/asm/lsl_u8.c: Likewise. * gcc.target/aarch64/sve/acle/asm/lsl_wide_s16.c: Likewise. * gcc.target/aarch64/sve/acle/asm/lsl_wide_s32.c: Likewise. * gcc.target/aarch64/sve/acle/asm/lsl_wide_s8.c: Likewise. * gcc.target/aarch64/sve/acle/asm/lsl_wide_u16.c: Likewise. * gcc.target/aarch64/sve/acle/asm/lsl_wide_u32.c: Likewise. * gcc.target/aarch64/sve/acle/asm/lsl_wide_u8.c: Likewise. * gcc.target/aarch64/sve/adr_1.c: Likewise. * gcc.target/aarch64/sve/adr_6.c: Likewise. * gcc.target/aarch64/sve/cond_mla_7.c: Likewise. * gcc.target/aarch64/sve/cond_mla_8.c: Likewise. * gcc.target/aarch64/sve/shift_2.c: Likewise. * gcc.target/aarch64/sve2/acle/asm/ldnt1sh_gather Likewise. * gcc.target/aarch64/sve2/acle/asm/ldnt1sh_gather_u64.c: Likewise. * gcc.target/aarch64/sve2/acle/asm/ldnt1uh_gather_s64.c: Likewise. * gcc.target/aarch64/sve2/acle/asm/ldnt1uh_gather_u64.c: Likewise. * gcc.target/aarch64/sve2/acle/asm/rshl_s16.c: Likewise. * gcc.target/aarch64/sve2/acle/asm/rshl_s32.c: Likewise. * gcc.target/aarch64/sve2/acle/asm/rshl_s64.c: Likewise. * gcc.target/aarch64/sve2/acle/asm/rshl_s8.c: Likewise. * gcc.target/aarch64/sve2/acle/asm/rshl_u16.c: Likewise. * gcc.target/aarch64/sve2/acle/asm/rshl_u32.c: Likewise. * gcc.target/aarch64/sve2/acle/asm/rshl_u64.c: Likewise. * gcc.target/aarch64/sve2/acle/asm/rshl_u8.c: Likewise. * gcc.target/aarch64/sve2/acle/asm/stnt1h_scatter_s64.c: Likewise. * gcc.target/aarch64/sve2/acle/asm/stnt1h_scatter_u64.c: Likewise. * gcc.target/aarch64/sve/sve_shl_add.c: New test. 0001-aarch64-Emit-ADD-X-Y-Y-instead-of-SHL-X-Y-1-for-SVE-.patch Description: 0001-aarch64-Emit-ADD-X-Y-Y-instead-of-SHL-X-Y-1-for-SVE-.patch
Re: [PATCH] aarch64: Emit ADD X, Y, Y instead of SHL X, Y, #1 for SVE instructions
> On 12 Sep 2024, at 7:22 PM, Richard Sandiford > wrote: > > External email: Use caution opening links or attachments > > > Richard Biener writes: >> On Thu, Sep 12, 2024 at 2:35 PM Richard Sandiford >> wrote: >>> >>> Soumya AR writes: >>>> On Neoverse V2, SVE ADD instructions have a throughput of 4, while shift >>>> instructions like SHL have a throughput of 2. We can lean on that to emit >>>> code >>>> like: >>>> add z31.b, z31.b, z31.b >>>> instead of: >>>> lsl z31.b, z31.b, #1 >>>> >>>> The implementation of this change for SVE vectors is similar to a prior >>>> patch >>>> <https://gcc.gnu.org/pipermail/gcc-patches/2024-August/659958.html> that >>>> adds >>>> the above functionality for Neon vectors. >>>> >>>> Here, the machine descriptor pattern is split up to separately accommodate >>>> left >>>> and right shifts, so we can specifically emit an add for all left shifts >>>> by 1. >>> >>> Thanks for doing this. >> >> I do wonder whether our scheduling infrastructure has the ability to "mutate" >> instructions in cases like here if either adds or shifts exceed their >> available resources >> but there is a resource readily available in an alternate instruction form? > > Yeah, that sounds like a useful feature in general. But in this particular > case, the shift resources are a subset of the addition resources, so there > should never be a specific advantage to using shifts. > > Thanks, > Richard > >> Richard. >> >>>> The patch was bootstrapped and regtested on aarch64-linux-gnu, no >>>> regression. >>>> OK for mainline? >>>> >>>> Signed-off-by: Soumya AR >>>> >>>> gcc/ChangeLog: >>>> >>>> * config/aarch64/aarch64-sve.md (*post_ra_v3): Split >>>> pattern to >>>> accomodate left and right shifts separately. >>>> (*post_ra_v_ashl3): Matches left shifts with additional >>>> constraint to >>>> check for shifts by 1. >>>> (*post_ra_v_3): Matches right shifts. >>>> >>>> gcc/testsuite/ChangeLog: >>>> >>>> * gcc.target/aarch64/sve/acle/asm/lsl_s16.c: Updated instances of >>>> lsl-1 with >>>> corresponding add >>>> * gcc.target/aarch64/sve/acle/asm/lsl_s32.c: Likewise. >>>> * gcc.target/aarch64/sve/acle/asm/lsl_s64.c: Likewise. >>>> * gcc.target/aarch64/sve/acle/asm/lsl_s8.c: Likewise. >>>> * gcc.target/aarch64/sve/acle/asm/lsl_u16.c: Likewise. >>>> * gcc.target/aarch64/sve/acle/asm/lsl_u32.c: Likewise. >>>> * gcc.target/aarch64/sve/acle/asm/lsl_u64.c: Likewise. >>>> * gcc.target/aarch64/sve/acle/asm/lsl_u8.c: Likewise. >>>> * gcc.target/aarch64/sve/acle/asm/lsl_wide_s16.c: Likewise. >>>> * gcc.target/aarch64/sve/acle/asm/lsl_wide_s32.c: Likewise. >>>> * gcc.target/aarch64/sve/acle/asm/lsl_wide_s8.c: Likewise. >>>> * gcc.target/aarch64/sve/acle/asm/lsl_wide_u16.c: Likewise. >>>> * gcc.target/aarch64/sve/acle/asm/lsl_wide_u32.c: Likewise. >>>> * gcc.target/aarch64/sve/acle/asm/lsl_wide_u8.c: Likewise. >>>> * gcc.target/aarch64/sve/adr_1.c: Likewise. >>>> * gcc.target/aarch64/sve/adr_6.c: Likewise. >>>> * gcc.target/aarch64/sve/cond_mla_7.c: Likewise. >>>> * gcc.target/aarch64/sve/cond_mla_8.c: Likewise. >>>> * gcc.target/aarch64/sve/shift_2.c: Likewise. >>>> * gcc.target/aarch64/sve2/acle/asm/ldnt1sh_gather Likewise. >>>> * gcc.target/aarch64/sve2/acle/asm/ldnt1sh_gather_u64.c: Likewise. >>>> * gcc.target/aarch64/sve2/acle/asm/ldnt1uh_gather_s64.c: Likewise. >>>> * gcc.target/aarch64/sve2/acle/asm/ldnt1uh_gather_u64.c: Likewise. >>>> * gcc.target/aarch64/sve2/acle/asm/rshl_s16.c: Likewise. >>>> * gcc.target/aarch64/sve2/acle/asm/rshl_s32.c: Likewise. >>>> * gcc.target/aarch64/sve2/acle/asm/rshl_s64.c: Likewise. >>>> * gcc.target/aarch64/sve2/acle/asm/rshl_s8.c: Likewise. >>>> * gcc.target/aarch64/sve2/acle/asm/rshl_u16.c: Likewise. >>>> * gcc.target/aarch64/sve2/acle/asm/rshl_u32.c: Likewise. >>>> * gcc.target/aarch64/sve2/acle/asm/rshl_u64.c:
Re: [PATCH] aarch64: Expand CTZ to RBIT + CLZ for SVE [PR109498]
> On 1 Oct 2024, at 6:17 PM, Richard Sandiford > wrote: > > External email: Use caution opening links or attachments > > > Soumya AR writes: >> Currently, we vectorize CTZ for SVE by using the following operation: >> .CTZ (X) = (PREC - 1) - .CLZ (X & -X) >> >> Instead, this patch expands CTZ to RBIT + CLZ for SVE, as suggested in >> PR109498. >> >> The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression. >> OK for mainline? >> >> Signed-off-by: Soumya AR >> >> gcc/ChangeLog: >> PR target/109498 >> * config/aarch64/aarch64-sve.md (ctz2): Added pattern to expand >>CTZ to RBIT + CLZ for SVE. >> >> gcc/testsuite/ChangeLog: >> PR target/109498 >> * gcc.target/aarch64/sve/ctz.c: New test. > > Generally looks good, but a couple of comments: > >> --- >> gcc/config/aarch64/aarch64-sve.md | 16 +++ >> gcc/testsuite/gcc.target/aarch64/sve/ctz.c | 49 ++ >> 2 files changed, 65 insertions(+) >> create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/ctz.c >> >> diff --git a/gcc/config/aarch64/aarch64-sve.md >> b/gcc/config/aarch64/aarch64-sve.md >> index bfa28849adf..10094f156b3 100644 >> --- a/gcc/config/aarch64/aarch64-sve.md >> +++ b/gcc/config/aarch64/aarch64-sve.md >> @@ -3088,6 +3088,22 @@ >> ;; - NOT >> ;; - >> >> +(define_expand "ctz2" >> + [(set (match_operand:SVE_I 0 "register_operand") >> + (unspec:SVE_I >> + [(match_dup 2) >> +(ctz:SVE_I >> + (match_operand:SVE_I 1 "register_operand"))] >> + UNSPEC_PRED_X))] >> + "TARGET_SVE" >> + { >> + operands[2] = aarch64_ptrue_reg (mode); > > There's no real need to use operands[...] here. It can just be > a local variable. > >> + emit_insn (gen_aarch64_pred_rbit (operands[0], >> operands[2],operands[1])); >> + emit_insn (gen_aarch64_pred_clz (operands[0], operands[2], >> operands[0])); > > Formatting nit: C++ lines should be 80 characters or fewer. > > More importantly, I think we should use a fresh register for the > temporary (RBIT) result, since that tends to permit more optimisation. Thanks for the feedback! Attaching an updated patch with the suggested changes. Regards, Soumya > Thanks, > Richard > > > >> + DONE; >> + } >> +) >> + >> ;; Unpredicated integer unary arithmetic. >> (define_expand "2" >> [(set (match_operand:SVE_I 0 "register_operand") >> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/ctz.c >> b/gcc/testsuite/gcc.target/aarch64/sve/ctz.c >> new file mode 100644 >> index 000..433a9174f48 >> --- /dev/null >> +++ b/gcc/testsuite/gcc.target/aarch64/sve/ctz.c >> @@ -0,0 +1,49 @@ >> +/* { dg-final { check-function-bodies "**" "" } } */ >> +/* { dg-options "-O3 --param aarch64-autovec-preference=sve-only" } */ >> + >> +#include >> + >> +#define FUNC(FUNCTION, NAME, DTYPE) \ >> +void \ >> +NAME (DTYPE *__restrict x, DTYPE *__restrict y, int n) { \ >> + for (int i = 0; i < n; i++)\ >> +x[i] = FUNCTION (y[i]); \ >> +}\ >> + >> + >> +/* >> +** ctz_uint8: >> +** ... >> +** rbitz[0-9]+\.b, p[0-7]/m, z[0-9]+\.b >> +** clz z[0-9]+\.b, p[0-7]/m, z[0-9]+\.b >> +** ... >> +*/ >> +FUNC (__builtin_ctzg, ctz_uint8, uint8_t) >> + >> +/* >> +** ctz_uint16: >> +** ... >> +** rbitz[0-9]+\.h, p[0-7]/m, z[0-9]+\.h >> +** clz z[0-9]+\.h, p[0-7]/m, z[0-9]+\.h >> +** ... >> +*/ >> +FUNC (__builtin_ctzg, ctz_uint16, uint16_t) >> + >> +/* >> +** ctz_uint32: >> +** ... >> +** rbitz[0-9]+\.s, p[0-7]/m, z[0-9]+\.s >> +** clz z[0-9]+\.s, p[0-7]/m, z[0-9]+\.s >> +** ... >> +*/ >> +FUNC (__builtin_ctz, ctz_uint32, uint32_t) >> + >> +/* >> +** ctz_uint64: >> +** ... >> +** rbitz[0-9]+\.d, p[0-7]/m, z[0-9]+\.d >> +** clz z[0-9]+\.d, p[0-7]/m, z[0-9]+\.d >> +** ... >> +*/ >> +FUNC (__builtin_ctzll, ctz_uint64, uint64_t) >> + >> -- >> 2.43.2 0001-aarch64-Expand-CTZ-to-RBIT-CLZ-for-SVE-PR109498.patch Description: 0001-aarch64-Expand-CTZ-to-RBIT-CLZ-for-SVE-PR109498.patch
Re: [PATCH] aarch64: Expand CTZ to RBIT + CLZ for SVE [PR109498]
Reworked the patch to substitute immediate register values in the test case with regular expressions. Apologies for the oversight. Thanks, Soumya > On 24 Sep 2024, at 8:53 AM, Soumya AR wrote: > > Currently, we vectorize CTZ for SVE by using the following operation: > .CTZ (X) = (PREC - 1) - .CLZ (X & -X) > > Instead, this patch expands CTZ to RBIT + CLZ for SVE, as suggested in > PR109498. > > The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression. > OK for mainline? > > Signed-off-by: Soumya AR > > gcc/ChangeLog: > PR target/109498 > * config/aarch64/aarch64-sve.md (ctz2): Added pattern to expand >CTZ to RBIT + CLZ for SVE. > > gcc/testsuite/ChangeLog: > PR target/109498 > * gcc.target/aarch64/sve/ctz.c: New test. > 0001-aarch64-Expand-CTZ-to-RBIT-CLZ-for-SVE-PR109498.patch Description: 0001-aarch64-Expand-CTZ-to-RBIT-CLZ-for-SVE-PR109498.patch 0001-aarch64-Expand-CTZ-to-RBIT-CLZ-for-SVE-PR109498.patch Description: 0001-aarch64-Expand-CTZ-to-RBIT-CLZ-for-SVE-PR109498.patch
Re: SVE intrinsics: Fold constant operands for svlsl.
Hi, > On 17 Oct 2024, at 12:38 PM, Kyrylo Tkachov wrote: > > Hi Soumya > >> On 17 Oct 2024, at 06:10, Soumya AR wrote: >> >> Hi Richard, >> >> Thanks for the feedback. I’ve updated the patch with the suggested change. >> Ok for mainline? >> >> Best, >> Soumya >> >>> On 14 Oct 2024, at 6:40 PM, Richard Sandiford >>> wrote: >>> >>> External email: Use caution opening links or attachments >>> >>> >>> Soumya AR writes: >>>> This patch implements constant folding for svlsl. Test cases have been >>>> added to >>>> check for the following cases: >>>> >>>> Zero, merge, and don't care predication. >>>> Shift by 0. >>>> Shift by register width. >>>> Overflow shift on signed and unsigned integers. >>>> Shift on a negative integer. >>>> Maximum possible shift, eg. shift by 7 on an 8-bit integer. >>>> >>>> The patch was bootstrapped and regtested on aarch64-linux-gnu, no >>>> regression. >>>> OK for mainline? >>>> >>>> Signed-off-by: Soumya AR >>>> >>>> gcc/ChangeLog: >>>> >>>> * config/aarch64/aarch64-sve-builtins-base.cc (svlsl_impl::fold): >>>> Try constant folding. >>>> >>>> gcc/testsuite/ChangeLog: >>>> >>>> * gcc.target/aarch64/sve/const_fold_lsl_1.c: New test. >>>> >>>> From 0cf5223e51623dcdbc47a06cbd17d927c74094e2 Mon Sep 17 00:00:00 2001 >>>> From: Soumya AR >>>> Date: Tue, 24 Sep 2024 09:09:32 +0530 >>>> Subject: [PATCH] SVE intrinsics: Fold constant operands for svlsl. >>>> >>>> This patch implements constant folding for svlsl. Test cases have been >>>> added to >>>> check for the following cases: >>>> >>>> Zero, merge, and don't care predication. >>>> Shift by 0. >>>> Shift by register width. >>>> Overflow shift on signed and unsigned integers. >>>> Shift on a negative integer. >>>> Maximum possible shift, eg. shift by 7 on an 8-bit integer. >>>> >>>> The patch was bootstrapped and regtested on aarch64-linux-gnu, no >>>> regression. >>>> OK for mainline? >>>> >>>> Signed-off-by: Soumya AR >>>> >>>> gcc/ChangeLog: >>>> >>>> * config/aarch64/aarch64-sve-builtins-base.cc (svlsl_impl::fold): >>>> Try constant folding. >>>> >>>> gcc/testsuite/ChangeLog: >>>> >>>> * gcc.target/aarch64/sve/const_fold_lsl_1.c: New test. >>>> --- >>>> .../aarch64/aarch64-sve-builtins-base.cc | 15 +- >>>> .../gcc.target/aarch64/sve/const_fold_lsl_1.c | 133 ++ >>>> 2 files changed, 147 insertions(+), 1 deletion(-) >>>> create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/const_fold_lsl_1.c >>>> >>>> diff --git a/gcc/config/aarch64/aarch64-sve-builtins-base.cc >>>> b/gcc/config/aarch64/aarch64-sve-builtins-base.cc >>>> index afce52a7e8d..be5d6eae525 100644 >>>> --- a/gcc/config/aarch64/aarch64-sve-builtins-base.cc >>>> +++ b/gcc/config/aarch64/aarch64-sve-builtins-base.cc >>>> @@ -1893,6 +1893,19 @@ public: >>>> } >>>> }; >>>> >>>> +class svlsl_impl : public rtx_code_function >>>> +{ >>>> +public: >>>> + CONSTEXPR svlsl_impl () >>>> +: rtx_code_function (ASHIFT, ASHIFT) {} >>>> + >>>> + gimple * >>>> + fold (gimple_folder &f) const override >>>> + { >>>> +return f.fold_const_binary (LSHIFT_EXPR); >>>> + } >>>> +}; >>>> + >>> >>> Sorry for the slow review. I think we should also make aarch64_const_binop >>> return 0 for LSHIFT_EXPR when the shift is out of range, to match the >>> behaviour of the underlying instruction. >>> >>> It looks good otherwise. >>> >>> Thanks, >>> Richard >>> >> > > In the test case: > +/* > +** s64_x_bit_width: > +** mov z[0-9]+\.b, #0 > +** ret > +*/ > +svint64_t s64_x_bit_width (svbool_t pg) { > +return svlsl_n_s64_x (pg, svdup_s64 (5), 64); > +} > + > +/* > +** s64_x_out_of_range: > +** mov z[0-9]+\.b, #0 > +** ret > +*/ > > You’ll need to adjust the scan for register zeroing according to the upcoming > changes from Tamar as per: > https://gcc.gnu.org/pipermail/gcc-patches/2024-October/665669.html > > The patch LGTM (but Richard should approve as it’s SVE-specific) > Adjusted this patch according to Tamar’s changes. Ok for mainline? Thanks, Soumya > Thanks, > Kyrill 0001-SVE-intrinsics-Fold-constant-operands-for-svlsl.patch Description: 0001-SVE-intrinsics-Fold-constant-operands-for-svlsl.patch
Re: SVE intrinsics: Fold constant operands for svlsl.
Hi Richard, Thanks for the feedback. I’ve updated the patch with the suggested change. Ok for mainline? Best, Soumya > On 14 Oct 2024, at 6:40 PM, Richard Sandiford > wrote: > > External email: Use caution opening links or attachments > > > Soumya AR writes: >> This patch implements constant folding for svlsl. Test cases have been added >> to >> check for the following cases: >> >> Zero, merge, and don't care predication. >> Shift by 0. >> Shift by register width. >> Overflow shift on signed and unsigned integers. >> Shift on a negative integer. >> Maximum possible shift, eg. shift by 7 on an 8-bit integer. >> >> The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression. >> OK for mainline? >> >> Signed-off-by: Soumya AR >> >> gcc/ChangeLog: >> >> * config/aarch64/aarch64-sve-builtins-base.cc (svlsl_impl::fold): >> Try constant folding. >> >> gcc/testsuite/ChangeLog: >> >> * gcc.target/aarch64/sve/const_fold_lsl_1.c: New test. >> >> From 0cf5223e51623dcdbc47a06cbd17d927c74094e2 Mon Sep 17 00:00:00 2001 >> From: Soumya AR >> Date: Tue, 24 Sep 2024 09:09:32 +0530 >> Subject: [PATCH] SVE intrinsics: Fold constant operands for svlsl. >> >> This patch implements constant folding for svlsl. Test cases have been added >> to >> check for the following cases: >> >> Zero, merge, and don't care predication. >> Shift by 0. >> Shift by register width. >> Overflow shift on signed and unsigned integers. >> Shift on a negative integer. >> Maximum possible shift, eg. shift by 7 on an 8-bit integer. >> >> The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression. >> OK for mainline? >> >> Signed-off-by: Soumya AR >> >> gcc/ChangeLog: >> >> * config/aarch64/aarch64-sve-builtins-base.cc (svlsl_impl::fold): >> Try constant folding. >> >> gcc/testsuite/ChangeLog: >> >> * gcc.target/aarch64/sve/const_fold_lsl_1.c: New test. >> --- >> .../aarch64/aarch64-sve-builtins-base.cc | 15 +- >> .../gcc.target/aarch64/sve/const_fold_lsl_1.c | 133 ++ >> 2 files changed, 147 insertions(+), 1 deletion(-) >> create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/const_fold_lsl_1.c >> >> diff --git a/gcc/config/aarch64/aarch64-sve-builtins-base.cc >> b/gcc/config/aarch64/aarch64-sve-builtins-base.cc >> index afce52a7e8d..be5d6eae525 100644 >> --- a/gcc/config/aarch64/aarch64-sve-builtins-base.cc >> +++ b/gcc/config/aarch64/aarch64-sve-builtins-base.cc >> @@ -1893,6 +1893,19 @@ public: >> } >> }; >> >> +class svlsl_impl : public rtx_code_function >> +{ >> +public: >> + CONSTEXPR svlsl_impl () >> +: rtx_code_function (ASHIFT, ASHIFT) {} >> + >> + gimple * >> + fold (gimple_folder &f) const override >> + { >> +return f.fold_const_binary (LSHIFT_EXPR); >> + } >> +}; >> + > > Sorry for the slow review. I think we should also make aarch64_const_binop > return 0 for LSHIFT_EXPR when the shift is out of range, to match the > behaviour of the underlying instruction. > > It looks good otherwise. > > Thanks, > Richard > >> class svmad_impl : public function_base >> { >> public: >> @@ -3199,7 +3212,7 @@ FUNCTION (svldnf1uh, svldxf1_extend_impl, >> (TYPE_SUFFIX_u16, UNSPEC_LDNF1)) >> FUNCTION (svldnf1uw, svldxf1_extend_impl, (TYPE_SUFFIX_u32, UNSPEC_LDNF1)) >> FUNCTION (svldnt1, svldnt1_impl,) >> FUNCTION (svlen, svlen_impl,) >> -FUNCTION (svlsl, rtx_code_function, (ASHIFT, ASHIFT)) >> +FUNCTION (svlsl, svlsl_impl,) >> FUNCTION (svlsl_wide, shift_wide, (ASHIFT, UNSPEC_ASHIFT_WIDE)) >> FUNCTION (svlsr, rtx_code_function, (LSHIFTRT, LSHIFTRT)) >> FUNCTION (svlsr_wide, shift_wide, (LSHIFTRT, UNSPEC_LSHIFTRT_WIDE)) >> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/const_fold_lsl_1.c >> b/gcc/testsuite/gcc.target/aarch64/sve/const_fold_lsl_1.c >> new file mode 100644 >> index 000..4299dbd850e >> --- /dev/null >> +++ b/gcc/testsuite/gcc.target/aarch64/sve/const_fold_lsl_1.c >> @@ -0,0 +1,133 @@ >> +/* { dg-final { check-function-bodies "**" "" } } */ >> +/* { dg-options "-O2" } */ >> + >> +#include "arm_sve.h" >> + >> +/* >> +** s64_x: >> +** mov z[0-9]+\.d, #20 >> +** ret >> +*/ >> +svint64_t s64_x (svbool_t pg) { >> +return svlsl_n_s64_x (
Re: [PATCH] SVE intrinsics: Fold constant operands for svlsl.
Pinging with updated subject, had missed the [PATCH] header before. Regards, Soumya > On 24 Sep 2024, at 2:00 PM, Soumya AR wrote: > > This patch implements constant folding for svlsl. Test cases have been added > to > check for the following cases: > > Zero, merge, and don't care predication. > Shift by 0. > Shift by register width. > Overflow shift on signed and unsigned integers. > Shift on a negative integer. > Maximum possible shift, eg. shift by 7 on an 8-bit integer. > > The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression. > OK for mainline? > > Signed-off-by: Soumya AR > > gcc/ChangeLog: > > * config/aarch64/aarch64-sve-builtins-base.cc (svlsl_impl::fold): > Try constant folding. > > gcc/testsuite/ChangeLog: > > * gcc.target/aarch64/sve/const_fold_lsl_1.c: New test. > > <0001-SVE-intrinsics-Fold-constant-operands-for-svlsl.patch>
Re: SVE intrinsics: Fold constant operands for svlsl.
> On 24 Oct 2024, at 2:55 PM, Richard Sandiford > wrote: > > External email: Use caution opening links or attachments > > > Kyrylo Tkachov writes: >>> On 24 Oct 2024, at 10:39, Soumya AR wrote: >>> >>> Hi Richard, >>> >>>> On 23 Oct 2024, at 5:58 PM, Richard Sandiford >>>> wrote: >>>> >>>> External email: Use caution opening links or attachments >>>> >>>> >>>> Soumya AR writes: >>>>> diff --git a/gcc/config/aarch64/aarch64-sve-builtins.cc >>>>> b/gcc/config/aarch64/aarch64-sve-builtins.cc >>>>> index 41673745cfe..aa556859d2e 100644 >>>>> --- a/gcc/config/aarch64/aarch64-sve-builtins.cc >>>>> +++ b/gcc/config/aarch64/aarch64-sve-builtins.cc >>>>> @@ -1143,11 +1143,14 @@ aarch64_const_binop (enum tree_code code, tree >>>>> arg1, tree arg2) >>>>> tree type = TREE_TYPE (arg1); >>>>> signop sign = TYPE_SIGN (type); >>>>> wi::overflow_type overflow = wi::OVF_NONE; >>>>> - >>>>> + unsigned int element_bytes = tree_to_uhwi (TYPE_SIZE_UNIT (type)); >>>>> /* Return 0 for division by 0, like SDIV and UDIV do. */ >>>>> if (code == TRUNC_DIV_EXPR && integer_zerop (arg2)) >>>>> return arg2; >>>>> - >>>>> + /* Return 0 if shift amount is out of range. */ >>>>> + if (code == LSHIFT_EXPR >>>>> + && tree_to_uhwi (arg2) >= (element_bytes * BITS_PER_UNIT)) >>>> >>>> tree_to_uhwi is dangerous because a general shift might be negative >>>> (even if these particular shift amounts are unsigned). We should >>>> probably also key off TYPE_PRECISION rather than TYPE_SIZE_UNIT. So: >>>> >>>> if (code == LSHIFT_EXPR >>>> && wi::geu_p (wi::to_wide (arg2), TYPE_PRECISION (type))) >>>> >>>> without the element_bytes variable. Also: the indentation looks a bit off; >>>> it should be tabs only followed by spaces only. >>> >>> Thanks for the feedback, posting an updated patch with the suggested >>> changes. >> >> Thanks Soumya, I’ve pushed this patch to trunk as commit 3e7549ece7c after >> adjusting >> the ChangeLog slightly to start the lines with tabs instead of spaces. > > Sorry Soumya, I forgot that you didn't have commit access yet. > It's time you did though. Could you follow the instructions > on https://gcc.gnu.org/gitwrite.html ? I'm happy to sponsor > (and I'm sure Kyrill would be too). Wow, that’s exciting! Kyrill has agreed to sponsor but thanks nonetheless! :) Best, Soumya > Thanks, > Richard
Re: SVE intrinsics: Fold constant operands for svlsl.
Hi Richard, > On 23 Oct 2024, at 5:58 PM, Richard Sandiford > wrote: > > External email: Use caution opening links or attachments > > > Soumya AR writes: >> diff --git a/gcc/config/aarch64/aarch64-sve-builtins.cc >> b/gcc/config/aarch64/aarch64-sve-builtins.cc >> index 41673745cfe..aa556859d2e 100644 >> --- a/gcc/config/aarch64/aarch64-sve-builtins.cc >> +++ b/gcc/config/aarch64/aarch64-sve-builtins.cc >> @@ -1143,11 +1143,14 @@ aarch64_const_binop (enum tree_code code, tree arg1, >> tree arg2) >> tree type = TREE_TYPE (arg1); >> signop sign = TYPE_SIGN (type); >> wi::overflow_type overflow = wi::OVF_NONE; >> - >> + unsigned int element_bytes = tree_to_uhwi (TYPE_SIZE_UNIT (type)); >> /* Return 0 for division by 0, like SDIV and UDIV do. */ >> if (code == TRUNC_DIV_EXPR && integer_zerop (arg2)) >> return arg2; >> - >> + /* Return 0 if shift amount is out of range. */ >> + if (code == LSHIFT_EXPR >> + && tree_to_uhwi (arg2) >= (element_bytes * BITS_PER_UNIT)) > > tree_to_uhwi is dangerous because a general shift might be negative > (even if these particular shift amounts are unsigned). We should > probably also key off TYPE_PRECISION rather than TYPE_SIZE_UNIT. So: > >if (code == LSHIFT_EXPR >&& wi::geu_p (wi::to_wide (arg2), TYPE_PRECISION (type))) > > without the element_bytes variable. Also: the indentation looks a bit off; > it should be tabs only followed by spaces only. Thanks for the feedback, posting an updated patch with the suggested changes. Thanks, Soumya > OK with those change, thanks. > > Richard > > >> + return build_int_cst (type, 0); >> if (!poly_int_binop (poly_res, code, arg1, arg2, sign, &overflow)) >> return NULL_TREE; >> return force_fit_type (type, poly_res, false, >> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/const_fold_lsl_1.c >> b/gcc/testsuite/gcc.target/aarch64/sve/const_fold_lsl_1.c >> new file mode 100644 >> index 000..6109558001a >> --- /dev/null >> +++ b/gcc/testsuite/gcc.target/aarch64/sve/const_fold_lsl_1.c >> @@ -0,0 +1,142 @@ >> +/* { dg-final { check-function-bodies "**" "" } } */ >> +/* { dg-options "-O2" } */ >> + >> +#include "arm_sve.h" >> + >> +/* >> +** s64_x: >> +** mov z[0-9]+\.d, #20 >> +** ret >> +*/ >> +svint64_t s64_x (svbool_t pg) { >> +return svlsl_n_s64_x (pg, svdup_s64 (5), 2); >> +} >> + >> +/* >> +** s64_x_vect: >> +** mov z[0-9]+\.d, #20 >> +** ret >> +*/ >> +svint64_t s64_x_vect (svbool_t pg) { >> +return svlsl_s64_x (pg, svdup_s64 (5), svdup_u64 (2)); >> +} >> + >> +/* >> +** s64_z: >> +** mov z[0-9]+\.d, p[0-7]/z, #20 >> +** ret >> +*/ >> +svint64_t s64_z (svbool_t pg) { >> +return svlsl_n_s64_z (pg, svdup_s64 (5), 2); >> +} >> + >> +/* >> +** s64_z_vect: >> +** mov z[0-9]+\.d, p[0-7]/z, #20 >> +** ret >> +*/ >> +svint64_t s64_z_vect (svbool_t pg) { >> +return svlsl_s64_z (pg, svdup_s64 (5), svdup_u64 (2)); >> +} >> + >> +/* >> +** s64_m_ptrue: >> +** mov z[0-9]+\.d, #20 >> +** ret >> +*/ >> +svint64_t s64_m_ptrue () { >> +return svlsl_n_s64_m (svptrue_b64 (), svdup_s64 (5), 2); >> +} >> + >> +/* >> +** s64_m_ptrue_vect: >> +** mov z[0-9]+\.d, #20 >> +** ret >> +*/ >> +svint64_t s64_m_ptrue_vect () { >> +return svlsl_s64_m (svptrue_b64 (), svdup_s64 (5), svdup_u64 (2)); >> +} >> + >> +/* >> +** s64_m_pg: >> +** mov z[0-9]+\.d, #5 >> +** lsl z[0-9]+\.d, p[0-7]/m, z[0-9]+\.d, #2 >> +** ret >> +*/ >> +svint64_t s64_m_pg (svbool_t pg) { >> +return svlsl_n_s64_m (pg, svdup_s64 (5), 2); >> +} >> + >> +/* >> +** s64_m_pg_vect: >> +** mov z[0-9]+\.d, #5 >> +** lsl z[0-9]+\.d, p[0-7]/m, z[0-9]+\.d, #2 >> +** ret >> +*/ >> +svint64_t s64_m_pg_vect (svbool_t pg) { >> +return svlsl_s64_m (pg, svdup_s64 (5), svdup_u64 (2)); >> +} >> + >> +/* >> +** s64_x_0: >> +** mov z[0-9]+\.d, #5 >> +** ret >> +*/ >> +svint64_t s64_x_0 (svbool_t pg) { >> +return svlsl_n_s64_x (pg, svdup_s64 (5), 0); >> +} >> + >> +/* >> +** s64_x_bit_width: >> +** movi? [vdz]([0-9]+)\.?(
[PATCH v2] aarch64: Optimise calls to ldexp with SVE FSCALE instruction [PR111733]
Changes since v1: This revision makes use of the extended definition of aarch64_ptrue_reg to generate predicate registers with the appropriate set bits. Earlier, there was a suggestion to add support for half floats as well. I extended the patch to include HFs but GCC still emits a libcall for ldexpf16. For example, in the following case, the call does not lower to fscale: _Float16 test_ldexpf16 (_Float16 x, int i) { return __builtin_ldexpf16 (x, i); } Any suggestions as to why this may be? —— This patch uses the FSCALE instruction provided by SVE to implement the standard ldexp family of functions. Currently, with '-Ofast -mcpu=neoverse-v2', GCC generates libcalls for the following code: float test_ldexpf (float x, int i) { return __builtin_ldexpf (x, i); } double test_ldexp (double x, int i) { return __builtin_ldexp(x, i); } GCC Output: test_ldexpf: b ldexpf test_ldexp: b ldexp Since SVE has support for an FSCALE instruction, we can use this to process scalar floats by moving them to a vector register and performing an fscale call, similar to how LLVM tackles an ldexp builtin as well. New Output: test_ldexpf: fmovs31, w0 ptrue p7.b, vl4 fscale z0.s, p7/m, z0.s, z31.s ret test_ldexp: sxtwx0, w0 ptrue p7.b, vl8 fmovd31, x0 fscale z0.d, p7/m, z0.d, z31.d ret The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression. OK for mainline? Signed-off-by: Soumya AR gcc/ChangeLog: PR target/111733 * config/aarch64/aarch64-sve.md (ldexp3): Added a new pattern to match ldexp calls with scalar floating modes and expand to the existing pattern for FSCALE. (@aarch64_pred_): Extended the pattern to accept SVE operands as well as scalar floating modes. * config/aarch64/iterators.md: (SVE_FULL_F_SCALAR): Added an iterator to match all FP SVE modes as well as HF, SF, and DF. (VPRED): Extended the attribute to handle GPF_HF modes. gcc/testsuite/ChangeLog: * gcc.target/aarch64/sve/fscale.c: New test. 0001-aarch64-Optimise-calls-to-ldexp-with-SVE-FSCALE-inst.patch Description: 0001-aarch64-Optimise-calls-to-ldexp-with-SVE-FSCALE-inst.patch
Re: [PATCH] Match: Fold pow calls to ldexp when possible [PR57492]
> On 29 Oct 2024, at 7:16 PM, Richard Biener wrote: > > External email: Use caution opening links or attachments > > > On Mon, 28 Oct 2024, Soumya AR wrote: > >> This patch transforms the following POW calls to equivalent LDEXP calls, as >> discussed in PR57492: >> >> powi (2.0, i) -> ldexp (1.0, i) >> >> a * powi (2.0, i) -> ldexp (a, i) >> >> 2.0 * powi (2.0, i) -> ldexp (1.0, i + 1) >> >> pow (powof2, i) -> ldexp (1.0, i * log2 (powof2)) >> >> powof2 * pow (2, i) -> ldexp (1.0, i + log2 (powof2)) > > For the multiplication cases why not handle powof2 * ldexp (1., i) > to ldexp (1., i + log2 (powof2)) and a * ldexp (1., i) -> ldexp (a, i) > instead? exp2 * ldexp (1., i) is another candidate. > > So please split out the multiplication parts. > > + /* Simplify pow (powof2, i) to ldexp (1, i * log2 (powof2)). */ > > the below pattern handles POWI, not POW. > > + (simplify > + (POWI REAL_CST@0 @1) > + (with { HOST_WIDE_INT tmp = 0; > + tree integer_arg1 = NULL_TREE; } > + (if (integer_valued_real_p (@0) > + && real_isinteger (&TREE_REAL_CST (@0), &tmp) > + && integer_pow2p (integer_arg1 = build_int_cst (integer_type_node, > tmp))) > > && tmp > 0 > && pow2p_hwi (tmp) > > +(LDEXP { build_one_cst (type); } > + (mult @1 { build_int_cst (integer_type_node, > + tree_log2 (integer_arg1)); }) > > build_int_cst (integer_type_node, exact_log2 (tmp)) > > + /* Simplify powi (2.0, i) to ldexp (1, i). */ > + (simplify > + (POWI REAL_CST@0 @1) > + (if (real_equal (TREE_REAL_CST_PTR (@0), &dconst2)) > + (LDEXP { build_one_cst (type); } @1))) > + > > You'll have a duplicate pattern here, instead merge them. 2.0 > is power-of-two so I wonder why the pattern is needed. Thanks for the feedback! I've merged the extra case that was specifically checking for 2.0. Like you suggested, I also added two ldexp specific transforms: • powof2 * ldexp (x, i) -> ldexp (x, i + log2 (powof2)) • a * ldexp(1., i) -> ldexp (a, i) Regarding your suggestion to also fold exp2, a conversion like exp2 (x) -> ldexp (1., x) or exp2 (x) * ldexp (y, i) -> ldexp (y, i + x) is a bit tricky because we'd have to cast it to an integer before passing it to ldexp. real_to_integer only works for constants, which isn't helpful here as exp2 (CST) becomes a power of 2 anyway and matches with the above patterns. We'll have to explicitly convert it for non constants and I'm not sure if that is worth it for this patch. Let me know what you think. Best, Soumya > Richard. > >> >> This is especially helpful for SVE architectures as LDEXP calls can be >> implemented using the FSCALE instruction, as seen in the following patch: >> https://gcc.gnu.org/pipermail/gcc-patches/2024-September/664160.html >> >> SPEC2017 was run with this patch, while there are no noticeable improvements, >> there are no non-noise regressions either. >> >> The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression. >> OK for mainline? >> >> Signed-off-by: Soumya AR >> >> gcc/ChangeLog: >> PR target/57492 >> * match.pd: Added patterns to fold certain calls to pow to ldexp. >> >> gcc/testsuite/ChangeLog: >> PR target/57492 >> * gcc.dg/tree-ssa/pow-to-ldexp.c: New test. >> >> > > -- > Richard Biener > SUSE Software Solutions Germany GmbH, > Frankenstrasse 146, 90461 Nuernberg, Germany; > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg) v2-0001-Match-Fold-pow-calls-to-ldexp-when-possible-PR57492.patch Description: v2-0001-Match-Fold-pow-calls-to-ldexp-when-possible-PR57492.patch
[PATCH] Match: Fold pow calls to ldexp when possible [PR57492]
This patch transforms the following POW calls to equivalent LDEXP calls, as discussed in PR57492: powi (2.0, i) -> ldexp (1.0, i) a * powi (2.0, i) -> ldexp (a, i) 2.0 * powi (2.0, i) -> ldexp (1.0, i + 1) pow (powof2, i) -> ldexp (1.0, i * log2 (powof2)) powof2 * pow (2, i) -> ldexp (1.0, i + log2 (powof2)) This is especially helpful for SVE architectures as LDEXP calls can be implemented using the FSCALE instruction, as seen in the following patch: https://gcc.gnu.org/pipermail/gcc-patches/2024-September/664160.html SPEC2017 was run with this patch, while there are no noticeable improvements, there are no non-noise regressions either. The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression. OK for mainline? Signed-off-by: Soumya AR gcc/ChangeLog: PR target/57492 * match.pd: Added patterns to fold certain calls to pow to ldexp. gcc/testsuite/ChangeLog: PR target/57492 * gcc.dg/tree-ssa/pow-to-ldexp.c: New test. 0001-Match-Fold-pow-calls-to-ldexp-when-possible-PR57492.patch Description: 0001-Match-Fold-pow-calls-to-ldexp-when-possible-PR57492.patch
[PATCH] Match: Optimize log (x) CMP CST and exp (x) CMP CST operations
This patch implements transformations for the following optimizations. logN(x) CMP CST -> x CMP expN(CST) expN(x) CMP CST -> x CMP logN(CST) For example: int foo (float x) { return __builtin_logf (x) < 0.0f; } can just be: int foo (float x) { return x < 1.0f; } The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression. OK for mainline? Signed-off-by: Soumya AR gcc/ChangeLog: * match.pd: Fold logN(x) CMP CST -> x CMP expN(CST) and expN(x) CMP CST -> x CMP logN(CST) gcc/testsuite/ChangeLog: * gcc.dg/tree-ssa/log_exp.c: New test. 0001-Match-Optimize-log-x-CMP-CST-and-exp-x-CMP-CST-opera.patch Description: 0001-Match-Optimize-log-x-CMP-CST-and-exp-x-CMP-CST-opera.patch
Re: [PATCH] Match: Fold pow calls to ldexp when possible [PR57492]
Hi Richard, > On 7 Nov 2024, at 6:10 PM, Richard Biener wrote: > > External email: Use caution opening links or attachments > > > On Tue, 5 Nov 2024, Soumya AR wrote: > >> >> >>> On 29 Oct 2024, at 7:16 PM, Richard Biener wrote: >>> >>> External email: Use caution opening links or attachments >>> >>> >>> On Mon, 28 Oct 2024, Soumya AR wrote: >>> >>>> This patch transforms the following POW calls to equivalent LDEXP calls, as >>>> discussed in PR57492: >>>> >>>> powi (2.0, i) -> ldexp (1.0, i) >>>> >>>> a * powi (2.0, i) -> ldexp (a, i) >>>> >>>> 2.0 * powi (2.0, i) -> ldexp (1.0, i + 1) >>>> >>>> pow (powof2, i) -> ldexp (1.0, i * log2 (powof2)) >>>> >>>> powof2 * pow (2, i) -> ldexp (1.0, i + log2 (powof2)) >>> >>> For the multiplication cases why not handle powof2 * ldexp (1., i) >>> to ldexp (1., i + log2 (powof2)) and a * ldexp (1., i) -> ldexp (a, i) >>> instead? exp2 * ldexp (1., i) is another candidate. >>> >>> So please split out the multiplication parts. >>> >>> + /* Simplify pow (powof2, i) to ldexp (1, i * log2 (powof2)). */ >>> >>> the below pattern handles POWI, not POW. >>> >>> + (simplify >>> + (POWI REAL_CST@0 @1) >>> + (with { HOST_WIDE_INT tmp = 0; >>> + tree integer_arg1 = NULL_TREE; } >>> + (if (integer_valued_real_p (@0) >>> + && real_isinteger (&TREE_REAL_CST (@0), &tmp) >>> + && integer_pow2p (integer_arg1 = build_int_cst (integer_type_node, >>> tmp))) >>> >>> && tmp > 0 >>> && pow2p_hwi (tmp) >>> >>> +(LDEXP { build_one_cst (type); } >>> + (mult @1 { build_int_cst (integer_type_node, >>> + tree_log2 (integer_arg1)); }) >>> >>> build_int_cst (integer_type_node, exact_log2 (tmp)) >>> >>> + /* Simplify powi (2.0, i) to ldexp (1, i). */ >>> + (simplify >>> + (POWI REAL_CST@0 @1) >>> + (if (real_equal (TREE_REAL_CST_PTR (@0), &dconst2)) >>> + (LDEXP { build_one_cst (type); } @1))) >>> + >>> >>> You'll have a duplicate pattern here, instead merge them. 2.0 >>> is power-of-two so I wonder why the pattern is needed. >> >> Thanks for the feedback! >> >> I've merged the extra case that was specifically checking for 2.0. >> >> Like you suggested, I also added two ldexp specific transforms: >> >>• powof2 * ldexp (x, i) -> ldexp (x, i + log2 (powof2)) >>• a * ldexp(1., i) -> ldexp (a, i) >> >> Regarding your suggestion to also fold exp2, a conversion like >> exp2 (x) -> ldexp (1., x) or exp2 (x) * ldexp (y, i) -> ldexp (y, i + x) is a >> bit tricky because we'd have to cast it to an integer before passing it to >> ldexp. >> >> real_to_integer only works for constants, which isn't helpful here as exp2 >> (CST) >> becomes a power of 2 anyway and matches with the above patterns. >> >> We'll have to explicitly convert it for non constants and I'm not sure if >> that >> is worth it for this patch. >> >> Let me know what you think. > > + (simplify > + (mult:c REAL_CST@0 (POWI REAL_CST@1 @2)) > + (with { HOST_WIDE_INT tmp = 0; > + tree integer_arg1 = NULL_TREE; } > + (if (integer_valued_real_p (@0) > + && real_equal (TREE_REAL_CST_PTR (@1), &dconst2) > + && real_isinteger (&TREE_REAL_CST (@0), &tmp) > + && integer_pow2p (integer_arg1 = build_int_cst (integer_type_node, > tmp))) > > as said you can use tmp > 0 && pow2p_hwi (tmp) instead of > integer_pow2p and build_int_cst. This applies to all patterns. Thanks for the feedback! Updated the patch with this modification. > + (if (integer_valued_real_p (@0) > + && real_isinteger (&TREE_REAL_CST (@0), &tmp) > > also is a redundant check, real_isinteger gives you the answer already. > > + (mult @1 {build_int_cst (integer_type_node, > +tree_log2 (integer_arg1)); }) > > and use exact_log2 (tmp) instead of tree_log2 (integer_arg1). Fixed. > > + /* Simplify a * powi (2.0, i) to ldexp (a, i). */ > + (simplify > + (mult:c @0 (POWI REAL_CST@1 @2)) > + (if (real_equal (TREE_REAL_CST_PTR (@1), &dconst2
Re: [PATCH v2] aarch64: Optimise calls to ldexp with SVE FSCALE instruction [PR111733]
Hi Richard, > On 7 Nov 2024, at 3:19 PM, Richard Sandiford > wrote: > > External email: Use caution opening links or attachments > > > Soumya AR writes: >> Changes since v1: >> >> This revision makes use of the extended definition of aarch64_ptrue_reg to >> generate predicate registers with the appropriate set bits. >> >> Earlier, there was a suggestion to add support for half floats as well. I >> extended the patch to include HFs but GCC still emits a libcall for ldexpf16. >> For example, in the following case, the call does not lower to fscale: >> >> _Float16 test_ldexpf16 (_Float16 x, int i) { >> return __builtin_ldexpf16 (x, i); >> } >> >> Any suggestions as to why this may be? > > You'd need to change: > > diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def > index 2d455938271..469835b1d62 100644 > --- a/gcc/internal-fn.def > +++ b/gcc/internal-fn.def > @@ -441,7 +441,7 @@ DEF_INTERNAL_OPTAB_FN (VEC_FMADDSUB, ECF_CONST, > vec_fmaddsub, ternary) > DEF_INTERNAL_OPTAB_FN (VEC_FMSUBADD, ECF_CONST, vec_fmsubadd, ternary) > > /* FP scales. */ > -DEF_INTERNAL_FLT_FN (LDEXP, ECF_CONST, ldexp, binary) > +DEF_INTERNAL_FLT_FLOATN_FN (LDEXP, ECF_CONST, ldexp, binary) > > /* Ternary math functions. */ > DEF_INTERNAL_FLT_FLOATN_FN (FMA, ECF_CONST, fma, ternary) Thanks for this! It works for FP16 now. > > A couple of comments below, but otherwise it looks good: > >> diff --git a/gcc/config/aarch64/iterators.md >> b/gcc/config/aarch64/iterators.md >> index 0bc98315bb6..7f708ea14f9 100644 >> --- a/gcc/config/aarch64/iterators.md >> +++ b/gcc/config/aarch64/iterators.md >> @@ -449,6 +449,9 @@ >> ;; All fully-packed SVE floating-point vector modes. >> (define_mode_iterator SVE_FULL_F [VNx8HF VNx4SF VNx2DF]) >> >> +;; Fully-packed SVE floating-point vector modes and 32-bit and 64-bit >> floats. >> +(define_mode_iterator SVE_FULL_F_SCALAR [VNx8HF VNx4SF VNx2DF HF SF DF]) > > The comment is out of date. How about: > > ;; Fully-packed SVE floating-point vector modes and their scalar equivalents. > (define_mode_iterator SVE_FULL_F_SCALAR [SVE_FULL_F GPF_HF]) Edited. > >> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/fscale.c >> b/gcc/testsuite/gcc.target/aarch64/sve/fscale.c >> new file mode 100644 >> index 000..251b4ef9188 >> --- /dev/null >> +++ b/gcc/testsuite/gcc.target/aarch64/sve/fscale.c >> @@ -0,0 +1,16 @@ >> +/* { dg-do compile } */ >> +/* { dg-additional-options "-Ofast" } */ >> + >> +float >> +test_ldexpf (float x, int i) >> +{ >> + return __builtin_ldexpf (x, i); >> +} >> +/* { dg-final { scan-assembler-times {\tfscale\tz[0-9]+\.s, p[0-7]/m, >> z[0-9]+\.s, z[0-9]+\.s\n} 1 } } */ >> + >> +double >> +test_ldexp (double x, int i) >> +{ >> + return __builtin_ldexp (x, i); >> +} >> +/* { dg-final { scan-assembler-times {\tfscale\tz[0-9]+\.d, p[0-7]/m, >> z[0-9]+\.d, z[0-9]+\.d\n} 1 } } */ > > It would be good to check the ptrues as well, to make sure that we only > enable one lane. > Makes sense, I’ve changed the test case to use check-function-bodies instead so we can check for ptrues as well. Best, Soumya > Thanks, > Richard 0001-aarch64-Optimise-calls-to-ldexp-with-SVE-FSCALE-inst.patch Description: 0001-aarch64-Optimise-calls-to-ldexp-with-SVE-FSCALE-inst.patch
Re: [PATCH] Match: Optimize log (x) CMP CST and exp (x) CMP CST operations
Thanks, committed: e232dc3bb5c3e8f8a3749239135b7b859a204fc7 Best, Soumya > On 7 Nov 2024, at 3:32 AM, Jeff Law wrote: > > External email: Use caution opening links or attachments > > > On 11/6/24 1:12 AM, Soumya AR wrote: >> >> >>> On 29 Oct 2024, at 6:59 PM, Richard Biener wrote: >>> >>> External email: Use caution opening links or attachments >>> >>> >>> On Mon, 28 Oct 2024, Soumya AR wrote: >>> >>>> This patch implements transformations for the following optimizations. >>>> >>>> logN(x) CMP CST -> x CMP expN(CST) >>>> expN(x) CMP CST -> x CMP logN(CST) >>>> >>>> For example: >>>> >>>> int >>>> foo (float x) >>>> { >>>> return __builtin_logf (x) < 0.0f; >>>> } >>>> >>>> can just be: >>>> >>>> int >>>> foo (float x) >>>> { >>>> return x < 1.0f; >>>> } >>>> >>>> The patch was bootstrapped and regtested on aarch64-linux-gnu, no >>>> regression. >>>> OK for mainline? >>> >>> + (for cmp (lt le gt ge eq ne) >>> +(for logs (LOG LOG2 LOG10) >>> +exps (EXP EXP2 EXP10) >>> +/* Simplify logN (x) CMP CST into x CMP expN (CST) */ >>> +(simplify >>> +(cmp:c (logs:s @0) @1) >>> + (cmp @0 (exps @1))) >>> + >>> +/* Simplify expN (x) CMP CST into x CMP logN (CST) */ >>> +(simplify >>> +(cmp:c (exps:s @0) @1) >>> + (cmp @0 (logs @1)) >>> >>> this doesn't restrict @1 to be constant. You should use >>> >>> (cmp:c (exps:s @0) REAL_CST@1) >> >> Fixed. >> >>> I think this transform is also very susceptible to rounding >>> issues - esp. using it for eq and ne looks very dangerous >>> to me. Unless you check a roundtrip through exp/log gets >>> you back exactly the original constant. >>> >>> I think the compare kinds "most safe" would be le and ge. >> >> This makes sense, I’ve updated the patch to only optimize >> for le and ge. >> >> Thanks, >> Soumya >> >>> You can look at fold-const-call.cc:do_mpfr_arg1, mpfr gets >>> you the information on whether the result is exact for example. >>> >>> Richard. >>> >>> >>>> Signed-off-by: Soumya AR >>>> >>>> gcc/ChangeLog: >>>> >>>> * match.pd: Fold logN(x) CMP CST -> x CMP expN(CST) >>>> and expN(x) CMP CST -> x CMP logN(CST) >>>> >>>> gcc/testsuite/ChangeLog: >>>> >>>> * gcc.dg/tree-ssa/log_exp.c: New test. > The latest version looks good to me. > > Jeff
Re: [PATCH v2] aarch64: Optimise calls to ldexp with SVE FSCALE instruction [PR111733]
> On 12 Nov 2024, at 4:27 PM, Richard Sandiford > wrote: > > External email: Use caution opening links or attachments > > > Soumya AR writes: >> diff --git a/gcc/config/aarch64/aarch64-sve.md >> b/gcc/config/aarch64/aarch64-sve.md >> index 06bd3e4bb2c..119a0e53853 100644 >> --- a/gcc/config/aarch64/aarch64-sve.md >> +++ b/gcc/config/aarch64/aarch64-sve.md >> @@ -5088,6 +5088,21 @@ >> ;; - FTSSEL >> ;; - >> >> +(define_expand "ldexp3" >> + [(set (match_operand:GPF_HF 0 "register_operand") >> + (unspec:GPF_HF >> + [(match_dup 3) >> + (const_int SVE_RELAXED_GP) > > Sorry for only noticing now, but: this should be SVE_STRICT_GP instead of > SVE_RELAXED_GP, since we don't want to allow other lanes to be made > active later. > >> + (match_operand:GPF_HF 1 "register_operand") >> + (match_operand: 2 "register_operand")] >> + UNSPEC_COND_FSCALE))] >> + "TARGET_SVE" >> + { >> + operands[3] = aarch64_ptrue_reg (mode, >> + GET_MODE_UNIT_SIZE (mode)); >> + } >> +) >> + >> ;; Unpredicated floating-point binary operations that take an integer as >> ;; their second operand. >> (define_insn "@aarch64_sve_" >> @@ -5103,17 +5118,17 @@ >> ;; Predicated floating-point binary operations that take an integer >> ;; as their second operand. >> (define_insn "@aarch64_pred_" >> - [(set (match_operand:SVE_FULL_F 0 "register_operand") >> - (unspec:SVE_FULL_F >> + [(set (match_operand:SVE_FULL_F_SCALAR 0 "register_operand") >> + (unspec:SVE_FULL_F_SCALAR >>[(match_operand: 1 "register_operand") >> (match_operand:SI 4 "aarch64_sve_gp_strictness") >> -(match_operand:SVE_FULL_F 2 "register_operand") >> +(match_operand:SVE_FULL_F_SCALAR 2 "register_operand") >> (match_operand: 3 "register_operand")] >>SVE_COND_FP_BINARY_INT))] >> "TARGET_SVE" >> {@ [ cons: =0 , 1 , 2 , 3 ; attrs: movprfx ] >> - [ w, Upl , 0 , w ; * ] \t%0., >> %1/m, %0., %3. >> - [ ?&w , Upl , w , w ; yes] movprfx\t%0, >> %2\;\t%0., %1/m, %0., %3. >> + [ w, Upl , 0 , w ; * ] \t%Z0., >> %1/m, %Z0., %Z3. >> + [ ?&w , Upl , w , w ; yes] movprfx\t%Z0, >> %Z2\;\t%Z0., %1/m, %Z0., %Z3. >> } >> ) >> >> diff --git a/gcc/config/aarch64/iterators.md >> b/gcc/config/aarch64/iterators.md >> index 8269b0cdcd9..4153c72954e 100644 >> --- a/gcc/config/aarch64/iterators.md >> +++ b/gcc/config/aarch64/iterators.md >> @@ -452,6 +452,9 @@ >> ;; All fully-packed SVE floating-point vector modes. >> (define_mode_iterator SVE_FULL_F [VNx8HF VNx4SF VNx2DF]) >> >> +;; Fully-packed SVE floating-point vector modes and their scalar >> equivalents. >> +(define_mode_iterator SVE_FULL_F_SCALAR [SVE_FULL_F GPF_HF]) >> + >> ;; Fully-packed SVE integer vector modes that have 8-bit or 16-bit elements. >> (define_mode_iterator SVE_FULL_BHI [VNx16QI VNx8HI]) >> >> @@ -2302,7 +2305,8 @@ >> (VNx8DI "VNx2BI") (VNx8DF "VNx2BI") >> (V8QI "VNx8BI") (V16QI "VNx16BI") >> (V4HI "VNx4BI") (V8HI "VNx8BI") (V2SI "VNx2BI") >> - (V4SI "VNx4BI") (V2DI "VNx2BI") (V1DI "VNx2BI")]) >> + (V4SI "VNx4BI") (V2DI "VNx2BI") (V1DI "VNx2BI") >> + (HF "VNx8BI") (SF "VNx4BI") (DF "VNx2BI")]) >> >> ;; ...and again in lower case. >> (define_mode_attr vpred [(VNx16QI "vnx16bi") (VNx8QI "vnx8bi") >> diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def >> index c3d0efc0f2c..09b7844d094 100644 >> --- a/gcc/internal-fn.def >> +++ b/gcc/internal-fn.def >> @@ -441,7 +441,7 @@ DEF_INTERNAL_OPTAB_FN (VEC_FMADDSUB, ECF_CONST, >> vec_fmaddsub, ternary) >> DEF_INTERNAL_OPTAB_FN (VEC_FMSUBADD, ECF_CONST, vec_fmsubadd, ternary) >> >> /* FP scales. */ >> -DEF_INTERNAL_FLT_FN (LDEXP, ECF_CONST, ldexp, binary) >> +DEF_INTERNAL_FLT_FLOATN_FN (LDEXP, ECF_CONST, ldexp, binary) >> >> /* Ter
Re: [PATCH] Match: Fold pow calls to ldexp when possible [PR57492]
> On 12 Nov 2024, at 6:19 PM, Richard Biener wrote: > > External email: Use caution opening links or attachments > > > On Mon, 11 Nov 2024, Soumya AR wrote: > >> Hi Richard, >> >>> On 7 Nov 2024, at 6:10 PM, Richard Biener wrote: >>> >>> External email: Use caution opening links or attachments >>> >>> >>> On Tue, 5 Nov 2024, Soumya AR wrote: >>> >>>> >>>> >>>>> On 29 Oct 2024, at 7:16 PM, Richard Biener wrote: >>>>> >>>>> External email: Use caution opening links or attachments >>>>> >>>>> >>>>> On Mon, 28 Oct 2024, Soumya AR wrote: >>>>> >>>>>> This patch transforms the following POW calls to equivalent LDEXP calls, >>>>>> as >>>>>> discussed in PR57492: >>>>>> >>>>>> powi (2.0, i) -> ldexp (1.0, i) >>>>>> >>>>>> a * powi (2.0, i) -> ldexp (a, i) >>>>>> >>>>>> 2.0 * powi (2.0, i) -> ldexp (1.0, i + 1) >>>>>> >>>>>> pow (powof2, i) -> ldexp (1.0, i * log2 (powof2)) >>>>>> >>>>>> powof2 * pow (2, i) -> ldexp (1.0, i + log2 (powof2)) >>>>> >>>>> For the multiplication cases why not handle powof2 * ldexp (1., i) >>>>> to ldexp (1., i + log2 (powof2)) and a * ldexp (1., i) -> ldexp (a, i) >>>>> instead? exp2 * ldexp (1., i) is another candidate. >>>>> >>>>> So please split out the multiplication parts. >>>>> >>>>> + /* Simplify pow (powof2, i) to ldexp (1, i * log2 (powof2)). */ >>>>> >>>>> the below pattern handles POWI, not POW. >>>>> >>>>> + (simplify >>>>> + (POWI REAL_CST@0 @1) >>>>> + (with { HOST_WIDE_INT tmp = 0; >>>>> + tree integer_arg1 = NULL_TREE; } >>>>> + (if (integer_valued_real_p (@0) >>>>> + && real_isinteger (&TREE_REAL_CST (@0), &tmp) >>>>> + && integer_pow2p (integer_arg1 = build_int_cst (integer_type_node, >>>>> tmp))) >>>>> >>>>> && tmp > 0 >>>>> && pow2p_hwi (tmp) >>>>> >>>>> +(LDEXP { build_one_cst (type); } >>>>> + (mult @1 { build_int_cst (integer_type_node, >>>>> + tree_log2 (integer_arg1)); }) >>>>> >>>>> build_int_cst (integer_type_node, exact_log2 (tmp)) >>>>> >>>>> + /* Simplify powi (2.0, i) to ldexp (1, i). */ >>>>> + (simplify >>>>> + (POWI REAL_CST@0 @1) >>>>> + (if (real_equal (TREE_REAL_CST_PTR (@0), &dconst2)) >>>>> + (LDEXP { build_one_cst (type); } @1))) >>>>> + >>>>> >>>>> You'll have a duplicate pattern here, instead merge them. 2.0 >>>>> is power-of-two so I wonder why the pattern is needed. >>>> >>>> Thanks for the feedback! >>>> >>>> I've merged the extra case that was specifically checking for 2.0. >>>> >>>> Like you suggested, I also added two ldexp specific transforms: >>>> >>>> • powof2 * ldexp (x, i) -> ldexp (x, i + log2 (powof2)) >>>> • a * ldexp(1., i) -> ldexp (a, i) >>>> >>>> Regarding your suggestion to also fold exp2, a conversion like >>>> exp2 (x) -> ldexp (1., x) or exp2 (x) * ldexp (y, i) -> ldexp (y, i + x) >>>> is a >>>> bit tricky because we'd have to cast it to an integer before passing it to >>>> ldexp. >>>> >>>> real_to_integer only works for constants, which isn't helpful here as exp2 >>>> (CST) >>>> becomes a power of 2 anyway and matches with the above patterns. >>>> >>>> We'll have to explicitly convert it for non constants and I'm not sure if >>>> that >>>> is worth it for this patch. >>>> >>>> Let me know what you think. >>> >>> + (simplify >>> + (mult:c REAL_CST@0 (POWI REAL_CST@1 @2)) >>> + (with { HOST_WIDE_INT tmp = 0; >>> + tree integer_arg1 = NULL_TREE; } >>> + (if (integer_valued_real_p (@0) >>> + && real_equal
Re: [PATCH] Match: Fold pow calls to ldexp when possible [PR57492]
> On 13 Nov 2024, at 2:49 PM, Richard Biener wrote: > > External email: Use caution opening links or attachments > > > On Wed, 13 Nov 2024, Soumya AR wrote: > >> >> >>> On 12 Nov 2024, at 6:19 PM, Richard Biener wrote: >>> >>> External email: Use caution opening links or attachments >>> >>> >>> On Mon, 11 Nov 2024, Soumya AR wrote: >>> >>>> Hi Richard, >>>> >>>>> On 7 Nov 2024, at 6:10 PM, Richard Biener wrote: >>>>> >>>>> External email: Use caution opening links or attachments >>>>> >>>>> >>>>> On Tue, 5 Nov 2024, Soumya AR wrote: >>>>> >>>>>> >>>>>> >>>>>>> On 29 Oct 2024, at 7:16 PM, Richard Biener wrote: >>>>>>> >>>>>>> External email: Use caution opening links or attachments >>>>>>> >>>>>>> >>>>>>> On Mon, 28 Oct 2024, Soumya AR wrote: >>>>>>> >>>>>>>> This patch transforms the following POW calls to equivalent LDEXP >>>>>>>> calls, as >>>>>>>> discussed in PR57492: >>>>>>>> >>>>>>>> powi (2.0, i) -> ldexp (1.0, i) >>>>>>>> >>>>>>>> a * powi (2.0, i) -> ldexp (a, i) >>>>>>>> >>>>>>>> 2.0 * powi (2.0, i) -> ldexp (1.0, i + 1) >>>>>>>> >>>>>>>> pow (powof2, i) -> ldexp (1.0, i * log2 (powof2)) >>>>>>>> >>>>>>>> powof2 * pow (2, i) -> ldexp (1.0, i + log2 (powof2)) >>>>>>> >>>>>>> For the multiplication cases why not handle powof2 * ldexp (1., i) >>>>>>> to ldexp (1., i + log2 (powof2)) and a * ldexp (1., i) -> ldexp (a, i) >>>>>>> instead? exp2 * ldexp (1., i) is another candidate. >>>>>>> >>>>>>> So please split out the multiplication parts. >>>>>>> >>>>>>> + /* Simplify pow (powof2, i) to ldexp (1, i * log2 (powof2)). */ >>>>>>> >>>>>>> the below pattern handles POWI, not POW. >>>>>>> >>>>>>> + (simplify >>>>>>> + (POWI REAL_CST@0 @1) >>>>>>> + (with { HOST_WIDE_INT tmp = 0; >>>>>>> + tree integer_arg1 = NULL_TREE; } >>>>>>> + (if (integer_valued_real_p (@0) >>>>>>> + && real_isinteger (&TREE_REAL_CST (@0), &tmp) >>>>>>> + && integer_pow2p (integer_arg1 = build_int_cst >>>>>>> (integer_type_node, >>>>>>> tmp))) >>>>>>> >>>>>>> && tmp > 0 >>>>>>> && pow2p_hwi (tmp) >>>>>>> >>>>>>> +(LDEXP { build_one_cst (type); } >>>>>>> + (mult @1 { build_int_cst (integer_type_node, >>>>>>> + tree_log2 (integer_arg1)); }) >>>>>>> >>>>>>> build_int_cst (integer_type_node, exact_log2 (tmp)) >>>>>>> >>>>>>> + /* Simplify powi (2.0, i) to ldexp (1, i). */ >>>>>>> + (simplify >>>>>>> + (POWI REAL_CST@0 @1) >>>>>>> + (if (real_equal (TREE_REAL_CST_PTR (@0), &dconst2)) >>>>>>> + (LDEXP { build_one_cst (type); } @1))) >>>>>>> + >>>>>>> >>>>>>> You'll have a duplicate pattern here, instead merge them. 2.0 >>>>>>> is power-of-two so I wonder why the pattern is needed. >>>>>> >>>>>> Thanks for the feedback! >>>>>> >>>>>> I've merged the extra case that was specifically checking for 2.0. >>>>>> >>>>>> Like you suggested, I also added two ldexp specific transforms: >>>>>> >>>>>> • powof2 * ldexp (x, i) -> ldexp (x, i + log2 (powof2)) >>>>>> • a * ldexp(1., i) -> ldexp (a, i) >>>>>> >>>>>> Regarding your suggestion to also fold exp2, a conversion like >>>>>> exp2 (x) -> ldexp (1., x) or exp2 (x)
[committed] MAINTAINERS: Added myself to write after approval and DCO.
Pushed to trunk: 91577f0c8d955bc670ee76d1a8851df336bf240c Signed-off-by: Soumya AR ChangeLog: * MAINTAINERS: Add myself to write after approval and DCO. 0001-MAINTAINERS-Add-myself-to-write-after-approval-and-D.patch Description: 0001-MAINTAINERS-Add-myself-to-write-after-approval-and-D.patch
SVE intrinsics: Fold constant operands for svlsl.
This patch implements constant folding for svlsl. Test cases have been added to check for the following cases: Zero, merge, and don't care predication. Shift by 0. Shift by register width. Overflow shift on signed and unsigned integers. Shift on a negative integer. Maximum possible shift, eg. shift by 7 on an 8-bit integer. The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression. OK for mainline? Signed-off-by: Soumya AR gcc/ChangeLog: * config/aarch64/aarch64-sve-builtins-base.cc (svlsl_impl::fold): Try constant folding. gcc/testsuite/ChangeLog: * gcc.target/aarch64/sve/const_fold_lsl_1.c: New test. 0001-SVE-intrinsics-Fold-constant-operands-for-svlsl.patch Description: 0001-SVE-intrinsics-Fold-constant-operands-for-svlsl.patch
[PATCH] aarch64: Optimise calls to ldexp with SVE FSCALE instruction
This patch uses the FSCALE instruction provided by SVE to implement the standard ldexp family of functions. Currently, with '-Ofast -mcpu=neoverse-v2', GCC generates libcalls for the following code: float test_ldexpf (float x, int i) { return __builtin_ldexpf (x, i); } double test_ldexp (double x, int i) { return __builtin_ldexp(x, i); } GCC Output: test_ldexpf: b ldexpf test_ldexp: b ldexp Since SVE has support for an FSCALE instruction, we can use this to process scalar floats by moving them to a vector register and performing an fscale call, similar to how LLVM tackles an ldexp builtin as well. New Output: test_ldexpf: fmov s31, w0 ptrue p7.b, all fscale z0.s, p7/m, z0.s, z31.s ret test_ldexp: sxtw x0, w0 ptrue p7.b, all fmov d31, x0 fscale z0.d, p7/m, z0.d, z31.d ret The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression. OK for mainline? Signed-off-by: Soumya AR gcc/ChangeLog: * config/aarch64/aarch64-sve.md (ldexp3): Added a new pattern to match ldexp calls with scalar floating modes and expand to the existing pattern for FSCALE. (@aarch64_pred_): Extended the pattern to accept SVE operands as well as scalar floating modes. * config/aarch64/iterators.md: SVE_FULL_F_SCALAR: Added an iterator to match all FP SVE modes as well as SF and DF. VPRED: Extended the attribute to handle GPF modes. gcc/testsuite/ChangeLog: * gcc.target/aarch64/sve/fscale.c: New test. 0001-aarch64-Optimise-calls-to-ldexp-with-SVE-FSCALE-inst.patch Description: 0001-aarch64-Optimise-calls-to-ldexp-with-SVE-FSCALE-inst.patch
[PATCH] aarch64: Expand CTZ to RBIT + CLZ for SVE [PR109498]
Currently, we vectorize CTZ for SVE by using the following operation: .CTZ (X) = (PREC - 1) - .CLZ (X & -X) Instead, this patch expands CTZ to RBIT + CLZ for SVE, as suggested in PR109498. The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression. OK for mainline? Signed-off-by: Soumya AR gcc/ChangeLog: PR target/109498 * config/aarch64/aarch64-sve.md (ctz2): Added pattern to expand CTZ to RBIT + CLZ for SVE. gcc/testsuite/ChangeLog: PR target/109498 * gcc.target/aarch64/sve/ctz.c: New test. 0001-aarch64-Expand-CTZ-to-RBIT-CLZ-for-SVE-PR109498.patch Description: 0001-aarch64-Expand-CTZ-to-RBIT-CLZ-for-SVE-PR109498.patch
Re: SVE intrinsics: Fold constant operands for svlsl.
Ping. > On 24 Sep 2024, at 2:00 PM, Soumya AR wrote: > > This patch implements constant folding for svlsl. Test cases have been added > to > check for the following cases: > > Zero, merge, and don't care predication. > Shift by 0. > Shift by register width. > Overflow shift on signed and unsigned integers. > Shift on a negative integer. > Maximum possible shift, eg. shift by 7 on an 8-bit integer. > > The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression. > OK for mainline? > > Signed-off-by: Soumya AR > > gcc/ChangeLog: > > * config/aarch64/aarch64-sve-builtins-base.cc (svlsl_impl::fold): > Try constant folding. > > gcc/testsuite/ChangeLog: > > * gcc.target/aarch64/sve/const_fold_lsl_1.c: New test. > > <0001-SVE-intrinsics-Fold-constant-operands-for-svlsl.patch>
Re: [PATCH] aarch64: Optimise calls to ldexp with SVE FSCALE instruction
Hi Saurabh, > On 30 Sep 2024, at 10:36 PM, Saurabh Jha wrote: > > External email: Use caution opening links or attachments > > > Hi Soumya, > > Thank you for the patch. Two clarifications: > > In the instruction pattern's output string, why did you add the 'Z' > prefix before operands? (%0 -> %Z0). The ‘Z’ prefix is added to ensure that the register name is correctly printed. Normally, %n by default prints the register assigned to operand 'n' using PRINT_OPERAND target hook, as mentioned here: https://gcc.gnu.org/onlinedocs/gccint/Output-Template.html Using %n overrides that target hook for customizing the print name. In most cases, overriding is not necessary since the operands are SVE registers but in this case, since non-SVE registers are used at RTL level, we need to override the hook to ensure that something like: fscale v0.s, p7/m, v0.s, v31.s does not get printed. > Also, maybe you can make your test cases more precise by specifying > which functions generate which instructions. I don't have and SVE test > off the top of my head but have a look at > /gcc/testsuite/gcc.target/aarch64/simd/faminmax-codegen.c > for example. Thanks for the suggestion! I'll update the test case accordingly. Regards, Soumya > Regards, > Saurabh > > > > On 9/30/2024 5:26 PM, Soumya AR wrote: >> This patch uses the FSCALE instruction provided by SVE to implement the >> standard ldexp family of functions. >> >> Currently, with '-Ofast -mcpu=neoverse-v2', GCC generates libcalls for the >> following code: >> >> float >> test_ldexpf (float x, int i) >> { >> return __builtin_ldexpf (x, i); >> } >> >> double >> test_ldexp (double x, int i) >> { >> return __builtin_ldexp(x, i); >> } >> >> GCC Output: >> >> test_ldexpf: >> b ldexpf >> >> test_ldexp: >> b ldexp >> >> Since SVE has support for an FSCALE instruction, we can use this to process >> scalar floats by moving them to a vector register and performing an fscale >> call, >> similar to how LLVM tackles an ldexp builtin as well. >> >> New Output: >> >> test_ldexpf: >> fmov s31, w0 >> ptrue p7.b, all >> fscale z0.s, p7/m, z0.s, z31.s >> ret >> >> test_ldexp: >> sxtw x0, w0 >> ptrue p7.b, all >> fmov d31, x0 >> fscale z0.d, p7/m, z0.d, z31.d >> ret >> >> The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression. >> OK for mainline? >> >> Signed-off-by: Soumya AR >> >> gcc/ChangeLog: >> >> * config/aarch64/aarch64-sve.md >> (ldexp3): Added a new pattern to match ldexp calls with scalar >> floating modes and expand to the existing pattern for FSCALE. >> (@aarch64_pred_): Extended the pattern to accept SVE >> operands as well as scalar floating modes. >> >> * config/aarch64/iterators.md: >> SVE_FULL_F_SCALAR: Added an iterator to match all FP SVE modes as well >> as SF and DF. >> VPRED: Extended the attribute to handle GPF modes. >> >> gcc/testsuite/ChangeLog: >> >> * gcc.target/aarch64/sve/fscale.c: New test. >> >
Re: [PATCH] aarch64: Optimise calls to ldexp with SVE FSCALE instruction
> On 1 Oct 2024, at 1:18 PM, Tamar Christina wrote: > > External email: Use caution opening links or attachments > > > Hi Soumya, > > Nice patch! > >> -Original Message- >> From: Kyrylo Tkachov >> Sent: Tuesday, October 1, 2024 7:55 AM >> To: Soumya AR >> Cc: gcc-patches@gcc.gnu.org; Richard Sandiford >> Subject: Re: [PATCH] aarch64: Optimise calls to ldexp with SVE FSCALE >> instruction >> >> Hi Soumya >> >>> On 30 Sep 2024, at 18:26, Soumya AR wrote: >>> >>> External email: Use caution opening links or attachments >>> >>> >>> This patch uses the FSCALE instruction provided by SVE to implement the >>> standard ldexp family of functions. >>> >>> Currently, with '-Ofast -mcpu=neoverse-v2', GCC generates libcalls for the >>> following code: >>> >>> float >>> test_ldexpf (float x, int i) >>> { >>> return __builtin_ldexpf (x, i); >>> } >>> >>> double >>> test_ldexp (double x, int i) >>> { >>> return __builtin_ldexp(x, i); >>> } >>> >>> GCC Output: >>> >>> test_ldexpf: >>> b ldexpf >>> >>> test_ldexp: >>> b ldexp >>> >>> Since SVE has support for an FSCALE instruction, we can use this to process >>> scalar floats by moving them to a vector register and performing an fscale >>> call, >>> similar to how LLVM tackles an ldexp builtin as well. >>> >>> New Output: >>> >>> test_ldexpf: >>> fmov s31, w0 >>> ptrue p7.b, all >>> fscale z0.s, p7/m, z0.s, z31.s >>> ret >>> >>> test_ldexp: >>> sxtw x0, w0 >>> ptrue p7.b, all >>> fmov d31, x0 >>> fscale z0.d, p7/m, z0.d, z31.d >>> ret >>> >>> The patch was bootstrapped and regtested on aarch64-linux-gnu, no >>> regression. >>> OK for mainline? >>> >>> Signed-off-by: Soumya AR >>> >>> gcc/ChangeLog: >>> >>> * config/aarch64/aarch64-sve.md >>> (ldexp3): Added a new pattern to match ldexp calls with scalar >>> floating modes and expand to the existing pattern for FSCALE. >>> (@aarch64_pred_): Extended the pattern to accept SVE >>> operands as well as scalar floating modes. >>> >>> * config/aarch64/iterators.md: >>> SVE_FULL_F_SCALAR: Added an iterator to match all FP SVE modes as well >>> as SF and DF. >>> VPRED: Extended the attribute to handle GPF modes. >>> >>> gcc/testsuite/ChangeLog: >>> >>> * gcc.target/aarch64/sve/fscale.c: New test. >> >> This patch fixes the bugzilla report at >> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111733 >> So it should be referenced in the ChangeLog entries like so: >> >> PR target/111733 >> * config/aarch64/aarch64-sve.md >> >> That way the commit hooks will pick it up and updated the bug tracker >> accordingly >> >>> >>> <0001-aarch64-Optimise-calls-to-ldexp-with-SVE-FSCALE-inst.patch> >> >> +(define_expand "ldexp3" >> + [(set (match_operand:GPF 0 "register_operand" "=w") >> + (unspec:GPF >> + [(match_operand:GPF 1 "register_operand" "w") >> +(match_operand: 2 "register_operand" "w")] >> + UNSPEC_COND_FSCALE))] >> + "TARGET_SVE" >> + { >> +rtx ptrue = aarch64_ptrue_reg (mode); >> +rtx strictness = gen_int_mode (SVE_RELAXED_GP, SImode); >> +emit_insn (gen_aarch64_pred_fscale (operands[0], ptrue, >> operands[1], operands[2], strictness)); >> +DONE; >> + } >> >> Lines should not exceed 80 columns, this should be wrapped around > > And Nit: perhaps slightly more idiomatic to the other patterns in SVE is this: > > (define_expand "ldexp3" > [(set (match_operand:GPF 0 "register_operand") >(unspec:GPF > [(match_dup 3) > (const_int SVE_RELAXED_GP) > (match_operand:GPF 1 "register_operand") > (match_operand: 2 "register_operand")] > UNSPEC_COND_FSCALE))] > "TARGET_SVE" > { >operands[3] = aarch64_ptrue_reg (mode); > } > ) > > It removes the dependency on the exact name of the pattern. > Also note the dropping of the constraints, expand patterns don't use > the constraints, only the predicates are checked. > > Cheers, > Tamar Thanks for this suggestion! This makes a lot of sense. Edited the patch with this change. Also referenced the PR as suggested by Kyrill earlier. Thanks, Soumya >> >> The patch looks good to me otherwise. >> Thanks, >> Kyrill 0001-aarch64-Optimise-calls-to-ldexp-with-SVE-FSCALE-inst.patch Description: 0001-aarch64-Optimise-calls-to-ldexp-with-SVE-FSCALE-inst.patch
Re: [PATCH] Match: Optimize log (x) CMP CST and exp (x) CMP CST operations
> On 29 Oct 2024, at 6:59 PM, Richard Biener wrote: > > External email: Use caution opening links or attachments > > > On Mon, 28 Oct 2024, Soumya AR wrote: > >> This patch implements transformations for the following optimizations. >> >> logN(x) CMP CST -> x CMP expN(CST) >> expN(x) CMP CST -> x CMP logN(CST) >> >> For example: >> >> int >> foo (float x) >> { >> return __builtin_logf (x) < 0.0f; >> } >> >> can just be: >> >> int >> foo (float x) >> { >> return x < 1.0f; >> } >> >> The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression. >> OK for mainline? > > + (for cmp (lt le gt ge eq ne) > +(for logs (LOG LOG2 LOG10) > +exps (EXP EXP2 EXP10) > +/* Simplify logN (x) CMP CST into x CMP expN (CST) */ > +(simplify > +(cmp:c (logs:s @0) @1) > + (cmp @0 (exps @1))) > + > +/* Simplify expN (x) CMP CST into x CMP logN (CST) */ > +(simplify > +(cmp:c (exps:s @0) @1) > + (cmp @0 (logs @1)) > > this doesn't restrict @1 to be constant. You should use > > (cmp:c (exps:s @0) REAL_CST@1) Fixed. > I think this transform is also very susceptible to rounding > issues - esp. using it for eq and ne looks very dangerous > to me. Unless you check a roundtrip through exp/log gets > you back exactly the original constant. > > I think the compare kinds "most safe" would be le and ge. This makes sense, I’ve updated the patch to only optimize for le and ge. Thanks, Soumya > You can look at fold-const-call.cc:do_mpfr_arg1, mpfr gets > you the information on whether the result is exact for example. > > Richard. > > >> Signed-off-by: Soumya AR >> >> gcc/ChangeLog: >> >> * match.pd: Fold logN(x) CMP CST -> x CMP expN(CST) >> and expN(x) CMP CST -> x CMP logN(CST) >> >> gcc/testsuite/ChangeLog: >> >> * gcc.dg/tree-ssa/log_exp.c: New test. >> >> > > -- > Richard Biener > SUSE Software Solutions Germany GmbH, > Frankenstrasse 146, 90461 Nuernberg, Germany; > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg) v2-0001-Match-Optimize-log-x-CMP-CST-and-exp-x-CMP-CST.patch Description: v2-0001-Match-Optimize-log-x-CMP-CST-and-exp-x-CMP-CST.patch
[PATCH] aarch64: Use SVE ASRD instruction with Neon modes.
The ASRD instruction on SVE performs an arithmetic shift right by an immediate for divide. This patch enables the use of ASRD with Neon modes. For example: int in[N], out[N]; void foo (void) { for (int i = 0; i < N; i++) out[i] = in[i] / 4; } compiles to: ldr q31, [x1, x0] cmltv30.16b, v31.16b, #0 and z30.b, z30.b, 3 add v30.16b, v30.16b, v31.16b sshrv30.16b, v30.16b, 2 str q30, [x0, x2] add x0, x0, 16 cmp x0, 1024 but can just be: ldp q30, q31, [x0], 32 asrdz31.b, p7/m, z31.b, #2 asrdz30.b, p7/m, z30.b, #2 stp q30, q31, [x1], 32 cmp x0, x2 The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression. OK for mainline? Signed-off-by: Soumya AR gcc/ChangeLog: * config/aarch64/aarch64-sve.md: Extended sdiv_pow23 and *sdiv_pow23 to support Neon modes. gcc/testsuite/ChangeLog: * gcc.target/aarch64/sve/sve-asrd.c: New test. --- gcc/config/aarch64/aarch64-sve.md | 25 - .../gcc.target/aarch64/sve/sve-asrd.c | 54 +++ 2 files changed, 67 insertions(+), 12 deletions(-) create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/sve-asrd.c diff --git a/gcc/config/aarch64/aarch64-sve.md b/gcc/config/aarch64/aarch64-sve.md index affdb24a93d..96effe4abed 100644 --- a/gcc/config/aarch64/aarch64-sve.md +++ b/gcc/config/aarch64/aarch64-sve.md @@ -4972,34 +4972,35 @@ ;; Unpredicated ASRD. (define_expand "sdiv_pow23" - [(set (match_operand:SVE_I 0 "register_operand") - (unspec:SVE_I + [(set (match_operand:SVE_VDQ_I 0 "register_operand") + (unspec:SVE_VDQ_I [(match_dup 3) - (unspec:SVE_I -[(match_operand:SVE_I 1 "register_operand") + (unspec:SVE_VDQ_I +[(match_operand:SVE_VDQ_I 1 "register_operand") (match_operand 2 "aarch64_simd_rshift_imm")] UNSPEC_ASRD)] UNSPEC_PRED_X))] "TARGET_SVE" { -operands[3] = aarch64_ptrue_reg (mode); +operands[3] = aarch64_ptrue_reg (mode, + GET_MODE_UNIT_SIZE (mode)); } ) ;; Predicated ASRD. (define_insn "*sdiv_pow23" - [(set (match_operand:SVE_I 0 "register_operand") - (unspec:SVE_I + [(set (match_operand:SVE_VDQ_I 0 "register_operand") + (unspec:SVE_VDQ_I [(match_operand: 1 "register_operand") - (unspec:SVE_I -[(match_operand:SVE_I 2 "register_operand") - (match_operand:SVE_I 3 "aarch64_simd_rshift_imm")] + (unspec:SVE_VDQ_I +[(match_operand:SVE_VDQ_I 2 "register_operand") + (match_operand:SVE_VDQ_I 3 "aarch64_simd_rshift_imm")] UNSPEC_ASRD)] UNSPEC_PRED_X))] "TARGET_SVE" {@ [ cons: =0 , 1 , 2 ; attrs: movprfx ] - [ w, Upl , 0 ; * ] asrd\t%0., %1/m, %0., #%3 - [ ?&w , Upl , w ; yes] movprfx\t%0, %2\;asrd\t%0., %1/m, %0., #%3 + [ w, Upl , 0 ; * ] asrd\t%Z0., %1/m, %Z0., #%3 + [ ?&w , Upl , w ; yes] movprfx\t%Z0, %Z2\;asrd\t%Z0., %1/m, %Z0., #%3 } ) diff --git a/gcc/testsuite/gcc.target/aarch64/sve/sve-asrd.c b/gcc/testsuite/gcc.target/aarch64/sve/sve-asrd.c new file mode 100644 index 000..00aa8b2380d --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/sve/sve-asrd.c @@ -0,0 +1,54 @@ +/* { dg-do compile } */ +/* { dg-options "-Ofast --param aarch64-autovec-preference=asimd-only" } */ +/* { dg-final { check-function-bodies "**" "" "" } } */ + +#include +#define N 1024 + +#define FUNC(M) \ +M in_##M[N];\ +M out_##M[N]; \ +void asrd_##M() { \ + for (int i = 0; i < N; i++) \ +out_##M[i] = in_##M[i] / 4; \ +} + +/* +** asrd_int8_t: +** ... +** ptrue (p[0-7]).b, vl1 +** ... +** asrdz[0-9]+\.b, \1/m, z[0-9]+\.b, #2 +** ... +*/ +FUNC(int8_t) + +/* +** asrd_int16_t: +** ... +** ptrue (p[0-7]).b, vl2 +** ... +** asrdz[0-9]+\.h, \1/m, z[0-9]+\.h, #2 +** ... +*/ +FUNC(int16_t) + +/* +** asrd_int32_t: +** ... +** ptrue (p[0-7]).b, vl4 +** ... +** asrdz[0-9]+\.s, \1/m, z[0-9]+\.s, #2 +** ... +*/ +FUNC(int32_t) + +/* +** asrd_int64_t: +** ... +** ptrue (p[0-7]).b, vl8 +** ... +** asrdz[0-9]+\.d, \1/m, z[0-9]+\.d, #2 +** ... +*/ +FUNC(int64_t) -- 2.43.2
[PATCH] testsuite: Require C99 for pow-to-ldexp.c
pow-to-ldexp.c checks for calls to __builtin_ldexpf and __builtin_ldexpl, which will only be performed when the compiler knows the target has a C99 libm available. Modified the test to add a C99 runtime requirement. This fixes the failure on arm-eabi targets for this test case. Committed as obvious: 90645dba41bac29cab4c5996ba320c97a0325eb2 Signed-off-by: Soumya AR gcc/testsuite/ChangeLog: * gcc.dg/tree-ssa/pow-to-ldexp.c: Require c99_runtime. --- gcc/testsuite/gcc.dg/tree-ssa/pow-to-ldexp.c | 1 + 1 file changed, 1 insertion(+) diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pow-to-ldexp.c b/gcc/testsuite/gcc.dg/tree-ssa/pow-to-ldexp.c index 007949dbb53..5be3dc89955 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/pow-to-ldexp.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/pow-to-ldexp.c @@ -1,5 +1,6 @@ /* { dg-do compile } */ /* { dg-additional-options "-Ofast -fdump-tree-optimized" } */ +/* { dg-require-effective-target c99_runtime } */ /* { dg-final { scan-tree-dump-times "__builtin_ldexp\ " 7 "optimized" } } */ /* { dg-final { scan-tree-dump-times "__builtin_ldexpf\ " 7 "optimized" } } */ -- 2.43.2
[PATCH] aarch64: Use SVE SUBR instruction with Neon modes
The SVE SUBR instruction performs a reversed subtract from an immediate. This patches enables the emission of SUBR for Neon modes and avoids the need to materialise an explicit constant. For example, the below test case: typedef long long __attribute__ ((vector_size (16))) v2di; v2di subr_v2di (v2di x) { return 15 - x; } compiles to: subr_v2di: mov z31.d, #15 sub v0.2d, v31.2d, v0.2d ret but can just be: subr_v2di: subrz0.d, z0.d, #15 ret The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression. OK for mainline? Signed-off-by: Soumya AR gcc/ChangeLog: * config/aarch64/aarch64-simd.md: (sub3): Extended the pattern to emit SUBR for SVE targets if operand 1 is an immediate. * config/aarch64/predicates.md (aarch64_sve_arith_imm_or_reg_operand): New predicate that accepts aarch64_sve_arith_immediate in operand 1 but only for TARGET_SVE. gcc/testsuite/ChangeLog: * gcc.target/aarch64/sve/subr-sve.c: New test. 0001-aarch64-Use-SVE-SUBR-instruction-with-Neon-modes.patch Description: 0001-aarch64-Use-SVE-SUBR-instruction-with-Neon-modes.patch
[PATCH] aarch64: Extend SVE2 bit-select instructions for Neon modes.
NBSL, BSL1N, and BSL2N are bit-select intructions on SVE2 with certain operands inverted. These can be extended to work with Neon modes. Since these instructions are unpredicated, duplicate patterns were added with the predicate removed to generate these instructions for Neon modes. The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression. OK for mainline? Signed-off-by: Soumya AR gcc/ChangeLog: * config/aarch64/aarch64-sve2.md (*aarch64_sve2_nbsl_unpred): New pattern to match unpredicated form. (*aarch64_sve2_bsl1n_unpred): Likewise. (*aarch64_sve2_bsl2n_unpred): Likewise. gcc/testsuite/ChangeLog: * gcc.target/aarch64/sve/bitsel.c: New test. 0001-aarch64-Extend-SVE2-bit-select-instructions-for-Neon.patch Description: 0001-aarch64-Extend-SVE2-bit-select-instructions-for-Neon.patch
Re: [PATCH] aarch64: Extend SVE2 bit-select instructions for Neon modes.
> On 10 Dec 2024, at 7:09 PM, Richard Sandiford > wrote: > > External email: Use caution opening links or attachments > > > Soumya AR writes: >> @@ -1815,6 +1849,42 @@ >> } >> ) >> >> +(define_insn "*aarch64_sve2_bsl2n_unpred" >> + [(set (match_operand:VDQ_I 0 "register_operand") >> + (ior:VDQ_I >> + (and:VDQ_I >> + (match_operand:VDQ_I 1 "register_operand") >> + (match_operand:VDQ_I 2 "register_operand")) >> + (and:VDQ_I >> + (not:VDQ_I >> + (match_operand:VDQ_I 3 "register_operand")) >> + (not:VDQ_I >> + (match_dup BSL_DUP)] > > The second "and" should be indented by the same amount as the first "and": > > (define_insn "*aarch64_sve2_bsl2n_unpred" > [(set (match_operand:VDQ_I 0 "register_operand") >(ior:VDQ_I > (and:VDQ_I >(match_operand:VDQ_I 1 "register_operand") >(match_operand:VDQ_I 2 "register_operand")) > (and:VDQ_I >(not:VDQ_I (match_operand:VDQ_I 3 "register_operand")) >(not:VDQ_I (match_dup BSL_DUP)] > >> + "TARGET_SVE2" >> + {@ [ cons: =0 , 1 , 2 , 3 ; attrs: movprfx ] >> + [ w, , , w ; * ] >> bsl2n\t%Z0.d, %Z0.d, %Z3.d, %Z.d >> + [ ?&w , w , w , w ; yes] >> movprfx\t%Z0, %Z\;bsl2n\t%Z0.d, %Z0.d, %Z3.d, %Z.d >> + } >> +) >> + >> +(define_insn "*aarch64_sve2_bsl2n_unpred" >> + [(set (match_operand:VDQ_I 0 "register_operand") >> + (ior:VDQ_I >> + (and:VDQ_I >> + (match_operand:VDQ_I 1 "register_operand") >> + (match_operand:VDQ_I 2 "register_operand")) >> + (and:VDQ_I >> + (not:VDQ_I >> + (match_dup BSL_DUP)) >> + (not:VDQ_I >> + (match_operand:VDQ_I 3 "register_operand")] > > Similarly here. > > OK with those changes, thanks. > Ah my bad, fixed those. Committed: 65b7c8db9c61bcdfd07a3404047dd2d2beac4bbb Thank you. Best, Soumya > Richard > >> + "TARGET_SVE2" >> + {@ [ cons: =0 , 1 , 2 , 3 ; attrs: movprfx ] >> + [ w, , , w ; * ] >> bsl2n\t%Z0.d, %Z0.d, %Z3.d, %Z.d >> + [ ?&w , w , w , w ; yes] >> movprfx\t%Z0, %Z\;bsl2n\t%Z0.d, %Z0.d, %Z3.d, %Z.d >> + } >> +) >> + >> ;; - >> ;; [INT] Shift-and-accumulate operations >> ;; - >> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/bitsel.c >> b/gcc/testsuite/gcc.target/aarch64/sve/bitsel.c >> new file mode 100644 >> index 000..635bfefc17c >> --- /dev/null >> +++ b/gcc/testsuite/gcc.target/aarch64/sve/bitsel.c >> @@ -0,0 +1,35 @@ >> +/* { dg-options "-O2 -mcpu=neoverse-v2 --param >> aarch64-autovec-preference=asimd-only" } */ >> + >> +#include >> + >> +#define OPNBSL(x,y,z) (~(((x) & (z)) | ((y) & ~(z >> +#define OPBSL1N(x,y,z) ((~(x) & (z)) | ((y) & ~(z))) >> +#define OPBSL2N(x,y,z) (((x) & (z)) | (~(y) & ~(z))) >> + >> +#define N 1024 >> + >> +#define TYPE(N) int##N##_t >> + >> +#define TEST(SIZE, OP, SUFFIX) \ >> +void __attribute__ ((noinline, noclone))\ >> +f_##SIZE##_##SUFFIX \ >> + (TYPE(SIZE) *restrict a, TYPE(SIZE) *restrict b, \ >> + TYPE(SIZE) *restrict c, TYPE(SIZE) *restrict d) \ >> +{ \ >> + for (int i = 0; i < N; i++) \ >> +a[i] = OP (b[i], c[i], d[i]); \ >> +} >> + >> +#define TEST_ALL(SIZE) \ >> + TEST(SIZE, OPNBSL, nbsl) \ >> + TEST(SIZE, OPBSL1N, bsl1n)\ >> + TEST(SIZE, OPBSL2N, bsl2n) >> + >> +TEST_ALL(8); >> +TEST_ALL(16); >> +TEST_ALL(32); >> +TEST_ALL(64); >> + >> +/* { dg-final { scan-assembler-times {\tnbsl\tz[0-9]+\.d, z[0-9]+\.d, >> z[0-9]+\.d, z[0-9]+\.d\n} 4 } } */ >> +/* { dg-final { scan-assembler-times {\tbsl1n\tz[0-9]+\.d, z[0-9]+\.d, >> z[0-9]+\.d, z[0-9]+\.d\n} 4 } } */ >> +/* { dg-final { scan-assembler-times {\tbsl2n\tz[0-9]+\.d, z[0-9]+\.d, >> z[0-9]+\.d, z[0-9]+\.d\n} 4 } } */ >> \ No newline at end of file
Re: [PATCH] aarch64: Use SVE ASRD instruction with Neon modes.
> On 10 Dec 2024, at 7:03 PM, Richard Sandiford > wrote: > > External email: Use caution opening links or attachments > > > Soumya AR writes: >> Hi Richard, >> >> Thanks for reviewing this! >> >> I’ve made the suggested changes and added the aarch64_ptrue_reg overload. > > The updated patch is OK, thanks. Thanks, committed: e5569a20cf3791553ac324269001a7c7c0e56242 > >> Thank you for providing the implementation for this, I have added you >> as a co-author for the patch, hope that works :) > > Sure, that's ok with me FWIW. IMO it isn't really necessary though, > so it's ok without too :) > > Sometimes it just feels easier to write review comments in the form > of code snippets, since that's more direct & precise than describing > something in English. It is still fundamentally a code review though. > Yes I understand, but would prefer having it there rather than not :) Best, Soumya > Richard > >> >> Best, >> Soumya >> >> >>> On 5 Dec 2024, at 10:08 PM, Richard Sandiford >>> wrote: >>> >>> External email: Use caution opening links or attachments >>> >>> >>> Soumya AR writes: >>>> The ASRD instruction on SVE performs an arithmetic shift right by an >>>> immediate >>>> for divide. >>>> >>>> This patch enables the use of ASRD with Neon modes. >>>> >>>> For example: >>>> >>>> int in[N], out[N]; >>>> >>>> void >>>> foo (void) >>>> { >>>> for (int i = 0; i < N; i++) >>>> out[i] = in[i] / 4; >>>> } >>>> >>>> compiles to: >>>> >>>> ldr q31, [x1, x0] >>>> cmltv30.16b, v31.16b, #0 >>>> and z30.b, z30.b, 3 >>>> add v30.16b, v30.16b, v31.16b >>>> sshrv30.16b, v30.16b, 2 >>>> str q30, [x0, x2] >>>> add x0, x0, 16 >>>> cmp x0, 1024 >>>> >>>> but can just be: >>>> >>>> ldp q30, q31, [x0], 32 >>>> asrdz31.b, p7/m, z31.b, #2 >>>> asrdz30.b, p7/m, z30.b, #2 >>>> stp q30, q31, [x1], 32 >>>> cmp x0, x2 >>>> >>>> The patch was bootstrapped and regtested on aarch64-linux-gnu, no >>>> regression. >>>> OK for mainline? >>>> >>>> Signed-off-by: Soumya AR >>>> >>>> gcc/ChangeLog: >>>> >>>> * config/aarch64/aarch64-sve.md: Extended sdiv_pow23 and >>>> *sdiv_pow23 to support Neon modes. >>>> >>>> gcc/testsuite/ChangeLog: >>>> >>>> * gcc.target/aarch64/sve/sve-asrd.c: New test. >>>> --- >>>> gcc/config/aarch64/aarch64-sve.md | 25 - >>>> .../gcc.target/aarch64/sve/sve-asrd.c | 54 +++ >>>> 2 files changed, 67 insertions(+), 12 deletions(-) >>>> create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/sve-asrd.c >>> >>> Rearranging to put the testcase first: >>> >>>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/sve-asrd.c >>>> b/gcc/testsuite/gcc.target/aarch64/sve/sve-asrd.c >>>> new file mode 100644 >>>> index 000..00aa8b2380d >>>> --- /dev/null >>>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/sve-asrd.c >>>> @@ -0,0 +1,54 @@ >>>> +/* { dg-do compile } */ >>>> +/* { dg-options "-Ofast --param aarch64-autovec-preference=asimd-only" } >>>> */ >>>> +/* { dg-final { check-function-bodies "**" "" "" } } */ >>>> + >>>> +#include >>>> +#define N 1024 >>>> + >>>> +#define FUNC(M) \ >>>> +M in_##M[N];\ >>>> +M out_##M[N]; \ >>>> +void asrd_##M() { \ >>>> + for (int i = 0; i < N; i++) \ >>>> +out_##M[i] = in_##M[i] / 4; \ >>>> +} >>>> + >>>> +/* >>>> +** asrd_int8_t: >>>> +** ... >>>> +** ptrue (p[0-7]).b, vl1 >>>> +** ... >>>> +** asrdz[0-9]+\.b, \1/m, z[0-9]+\.b, #2 >>>> +** ... >>>> +*/
Re: [PATCH] aarch64: Use SVE ASRD instruction with Neon modes.
Hi Richard, Thanks for reviewing this! I’ve made the suggested changes and added the aarch64_ptrue_reg overload. Thank you for providing the implementation for this, I have added you as a co-author for the patch, hope that works :) Best, Soumya > On 5 Dec 2024, at 10:08 PM, Richard Sandiford > wrote: > > External email: Use caution opening links or attachments > > > Soumya AR writes: >> The ASRD instruction on SVE performs an arithmetic shift right by an >> immediate >> for divide. >> >> This patch enables the use of ASRD with Neon modes. >> >> For example: >> >> int in[N], out[N]; >> >> void >> foo (void) >> { >> for (int i = 0; i < N; i++) >>out[i] = in[i] / 4; >> } >> >> compiles to: >> >> ldr q31, [x1, x0] >> cmltv30.16b, v31.16b, #0 >> and z30.b, z30.b, 3 >> add v30.16b, v30.16b, v31.16b >> sshrv30.16b, v30.16b, 2 >> str q30, [x0, x2] >> add x0, x0, 16 >> cmp x0, 1024 >> >> but can just be: >> >> ldp q30, q31, [x0], 32 >> asrdz31.b, p7/m, z31.b, #2 >> asrdz30.b, p7/m, z30.b, #2 >> stp q30, q31, [x1], 32 >> cmp x0, x2 >> >> The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression. >> OK for mainline? >> >> Signed-off-by: Soumya AR >> >> gcc/ChangeLog: >> >> * config/aarch64/aarch64-sve.md: Extended sdiv_pow23 and >> *sdiv_pow23 to support Neon modes. >> >> gcc/testsuite/ChangeLog: >> >> * gcc.target/aarch64/sve/sve-asrd.c: New test. >> --- >> gcc/config/aarch64/aarch64-sve.md | 25 - >> .../gcc.target/aarch64/sve/sve-asrd.c | 54 +++ >> 2 files changed, 67 insertions(+), 12 deletions(-) >> create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/sve-asrd.c > > Rearranging to put the testcase first: > >> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/sve-asrd.c >> b/gcc/testsuite/gcc.target/aarch64/sve/sve-asrd.c >> new file mode 100644 >> index 000..00aa8b2380d >> --- /dev/null >> +++ b/gcc/testsuite/gcc.target/aarch64/sve/sve-asrd.c >> @@ -0,0 +1,54 @@ >> +/* { dg-do compile } */ >> +/* { dg-options "-Ofast --param aarch64-autovec-preference=asimd-only" } */ >> +/* { dg-final { check-function-bodies "**" "" "" } } */ >> + >> +#include >> +#define N 1024 >> + >> +#define FUNC(M) \ >> +M in_##M[N];\ >> +M out_##M[N]; \ >> +void asrd_##M() { \ >> + for (int i = 0; i < N; i++) \ >> +out_##M[i] = in_##M[i] / 4; \ >> +} >> + >> +/* >> +** asrd_int8_t: >> +** ... >> +** ptrue (p[0-7]).b, vl1 >> +** ... >> +** asrdz[0-9]+\.b, \1/m, z[0-9]+\.b, #2 >> +** ... >> +*/ >> +FUNC(int8_t) > > ptrue ..., vl1 will only do the shift/division on the first element. > It should instead be vl8 for 64-bit modes and vl16 for 128-bit modes. > >> + >> +/* >> +** asrd_int16_t: >> +** ... >> +** ptrue (p[0-7]).b, vl2 >> +** ... >> +** asrdz[0-9]+\.h, \1/m, z[0-9]+\.h, #2 >> +** ... >> +*/ >> +FUNC(int16_t) >> + >> +/* >> +** asrd_int32_t: >> +** ... >> +** ptrue (p[0-7]).b, vl4 >> +** ... >> +** asrdz[0-9]+\.s, \1/m, z[0-9]+\.s, #2 >> +** ... >> +*/ >> +FUNC(int32_t) >> + >> +/* >> +** asrd_int64_t: >> +** ... >> +** ptrue (p[0-7]).b, vl8 >> +** ... >> +** asrdz[0-9]+\.d, \1/m, z[0-9]+\.d, #2 >> +** ... >> +*/ >> +FUNC(int64_t) >> diff --git a/gcc/config/aarch64/aarch64-sve.md >> b/gcc/config/aarch64/aarch64-sve.md >> index affdb24a93d..96effe4abed 100644 >> --- a/gcc/config/aarch64/aarch64-sve.md >> +++ b/gcc/config/aarch64/aarch64-sve.md >> @@ -4972,34 +4972,35 @@ >> >> ;; Unpredicated ASRD. >> (define_expand "sdiv_pow23" >> - [(set (match_operand:SVE_I 0 "register_operand") >> - (unspec:SVE_I >> + [(set (match_operand:SVE_VDQ_I 0 "register_operand") >> + (unspec:SVE_VDQ_I >>[(match_dup 3) >> -(unspec:SVE_I >> - [(match_operand:SVE_I 1 "register_operand") >> +(unspec:SVE_VDQ_
Re: [PATCH] aarch64: Extend SVE2 bit-select instructions for Neon modes.
> On 5 Dec 2024, at 10:25 PM, Richard Sandiford > wrote: > > External email: Use caution opening links or attachments > > > Soumya AR writes: >> NBSL, BSL1N, and BSL2N are bit-select intructions on SVE2 with certain >> operands >> inverted. These can be extended to work with Neon modes. >> >> Since these instructions are unpredicated, duplicate patterns were added with >> the predicate removed to generate these instructions for Neon modes. >> >> The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression. >> OK for mainline? >> >> Signed-off-by: Soumya AR >> >> gcc/ChangeLog: >> >> * config/aarch64/aarch64-sve2.md >> (*aarch64_sve2_nbsl_unpred): New pattern to match unpredicated >> form. >> (*aarch64_sve2_bsl1n_unpred): Likewise. >> (*aarch64_sve2_bsl2n_unpred): Likewise. >> >> gcc/testsuite/ChangeLog: >> >> * gcc.target/aarch64/sve/bitsel.c: New test. >> >> From 4be7a0ba40150322e2f67214f54c55da51ad2fd2 Mon Sep 17 00:00:00 2001 >> From: Soumya AR >> Date: Mon, 25 Nov 2024 11:25:44 +0530 >> Subject: [PATCH] aarch64: Extend SVE2 bit-select instructions for Neon modes. >> >> NBSL, BSL1N, and BSL2N are bit-select intructions on SVE2 with certain >> operands >> inverted. These can be extended to work with Neon modes. >> >> Since these instructions are unpredicated, duplicate patterns were added with >> the predicate removed to generate these instructions for Neon modes. >> >> The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression. >> OK for mainline? >> >> Signed-off-by: Soumya AR >> >> gcc/ChangeLog: >> >> * config/aarch64/aarch64-sve2.md >> (*aarch64_sve2_nbsl_unpred): New pattern to match unpredicated >> form. >> (*aarch64_sve2_bsl1n_unpred): Likewise. >> (*aarch64_sve2_bsl2n_unpred): Likewise. >> >> gcc/testsuite/ChangeLog: >> >> * gcc.target/aarch64/sve/bitsel.c: New test. > > Looks good, but some comments about the formatting: > >> --- >> gcc/config/aarch64/aarch64-sve2.md| 71 +++ >> gcc/testsuite/gcc.target/aarch64/sve/bitsel.c | 35 + >> 2 files changed, 106 insertions(+) >> create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/bitsel.c >> >> diff --git a/gcc/config/aarch64/aarch64-sve2.md >> b/gcc/config/aarch64/aarch64-sve2.md >> index e67421bad84..6d38c40e573 100644 >> --- a/gcc/config/aarch64/aarch64-sve2.md >> +++ b/gcc/config/aarch64/aarch64-sve2.md >> @@ -1696,6 +1696,23 @@ >> } >> ) >> >> +(define_insn "*aarch64_sve2_nbsl_unpred" >> + [(set (match_operand:VDQ_I 0 "register_operand") >> +(not:VDQ_I >> + (xor:VDQ_I >> +(and:VDQ_I >> + (xor:VDQ_I >> +(match_operand:VDQ_I 1 "register_operand") >> +(match_operand:VDQ_I 2 "register_operand")) >> + (match_operand:VDQ_I 3 "register_operand")) >> +(match_dup BSL_DUP] >> + "TARGET_SVE2" >> + {@ [ cons: =0 , 1 , 2 , 3 ; attrs: movprfx ] >> + [ w, , , w ; * ] nbsl\t%Z0.d, >> %Z0.d, %Z.d, %Z3.d >> + [ ?&w , w , w , w ; yes] >> movprfx\t%Z0, %Z\;nbsl\t%Z0.d, %Z0.d, %Z.d, %Z3.d >> + } >> +) >> + >> ;; Unpredicated bitwise select with inverted first operand. >> ;; (op3 ? ~bsl_mov : bsl_dup) == ((~(bsl_mov ^ bsl_dup) & op3) ^ bsl_dup) >> (define_expand "@aarch64_sve2_bsl1n" >> @@ -1741,6 +1758,23 @@ >> } >> ) >> >> +(define_insn "*aarch64_sve2_bsl1n_unpred" >> + [(set (match_operand:VDQ_I 0 "register_operand") >> + (xor:VDQ_I >> + (and:VDQ_I >> + (not:VDQ_I >> + (xor:VDQ_I >> +(match_operand:VDQ_I 1 "register_operand") >> +(match_operand:VDQ_I 2 "register_operand"))) > > Removing the unspecs from the SVE patterns leaves these operations > indented by an unusual amount. Generally: > > - if the first operand is on the same line as the operator (as with > the set above), the other operands are indented to be underneath > the first operand. > > - if an operator appears on a line by itself, the operands are generally > indented by two spaces more than the operator. &
[PATCH] match.pd: Fix indefinite recursion during exp-log transformations [PR118490]
This patch fixes the ICE caused when comparing log or exp of a constant with another constant. The transform is now restricted to cases where the resultant log/exp (CST) can be constant folded. Signed-off-by: Soumya AR gcc/ChangeLog: PR target/118490 * match.pd: Added ! to verify that log/exp (CST) can be constant folded. gcc/testsuite/ChangeLog: PR target/118490 * gcc.dg/pr118490.c: New test. --- gcc/match.pd | 4 ++-- gcc/testsuite/gcc.dg/pr | 0 gcc/testsuite/gcc.dg/pr118490.c | 7 +++ 3 files changed, 9 insertions(+), 2 deletions(-) create mode 100644 gcc/testsuite/gcc.dg/pr create mode 100644 gcc/testsuite/gcc.dg/pr118490.c diff --git a/gcc/match.pd b/gcc/match.pd index b6cbb851897..fd1ddf627bf 100644 --- a/gcc/match.pd +++ b/gcc/match.pd @@ -8317,12 +8317,12 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) /* Simplify logN (x) CMP CST into x CMP expN (CST) */ (simplify (cmp:c (logs:s @0) REAL_CST@1) - (cmp @0 (exps @1))) + (cmp @0 (exps! @1))) /* Simplify expN (x) CMP CST into x CMP logN (CST) */ (simplify (cmp:c (exps:s @0) REAL_CST@1) - (cmp @0 (logs @1)) + (cmp @0 (logs! @1)) (for logs (LOG LOG2 LOG10 LOG10) exps (EXP EXP2 EXP10 POW10) diff --git a/gcc/testsuite/gcc.dg/pr b/gcc/testsuite/gcc.dg/pr new file mode 100644 index 000..e69de29bb2d diff --git a/gcc/testsuite/gcc.dg/pr118490.c b/gcc/testsuite/gcc.dg/pr118490.c new file mode 100644 index 000..4ae0dacefee --- /dev/null +++ b/gcc/testsuite/gcc.dg/pr118490.c @@ -0,0 +1,7 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -ffast-math -frounding-math -Wlogical-op" } */ + +double exp(double); +int foo(int v) { + return v && exp(1.) < 2.; +} -- 2.43.2
Re: [PATCH] aarch64: Use generic_armv8_a_prefetch_tune in generic_armv8_a.h
> On 18 Feb 2025, at 2:27 PM, Kyrylo Tkachov wrote: > > > >> On 18 Feb 2025, at 09:48, Kyrylo Tkachov wrote: >> >> >> >>> On 18 Feb 2025, at 09:41, Richard Sandiford >>> wrote: >>> >>> Kyrylo Tkachov writes: >>>> Hi Soumya >>>> >>>>> On 18 Feb 2025, at 09:12, Soumya AR wrote: >>>>> >>>>> generic_armv8_a.h defines generic_armv8_a_prefetch_tune but still uses >>>>> generic_prefetch_tune in generic_armv8_a_tunings. >>>>> >>>>> This patch updates the pointer to generic_armv8_a_prefetch_tune. >>>>> >>>>> This patch was bootstrapped and regtested on aarch64-linux-gnu, no >>>>> regression. >>>>> >>>>> Ok for GCC 15 now? >>>> >>>> Yes, this looks like a simple oversight. >>>> Ok to push to master. >>> >>> I suppose the alternative would be to remove generic_armv8_a_prefetch_tune, >>> since it's (deliberately) identical to generic_prefetch_tune. >> >> Looks like we have one prefetch_tune structure for each of the generic >> tunings (generic, generic_armv8_a, generic_armv9_a). >> For the sake of symmetry it feels a bit better to have them independently >> tunable. >> But as the effects are the same, it may be better to remove it in the >> interest of less code. >> > > I see Soumya has already pushed her patch. I’m okay with either approach tbh, > but if Richard prefers we can remove generic_armv8_a_prefetch_tune in a > separate commit. Yeah, missed Richard’s mail. Let me know which is preferable, thanks. Best, Soumya > Thanks, > Kyrill > > >> Thanks, >> Kyrill >> >>> >>>> Thanks, >>>> Kyrill >>>> >>>>> >>>>> Signed-off-by: Soumya AR >>>>> >>>>> gcc/ChangeLog: >>>>> >>>>> * config/aarch64/tuning_models/generic_armv8_a.h: Updated prefetch >>>>> struct pointer. >>>>> >>>>> --- >>>>> gcc/config/aarch64/tuning_models/generic_armv8_a.h | 2 +- >>>>> 1 file changed, 1 insertion(+), 1 deletion(-) >>>>> >>>>> diff --git a/gcc/config/aarch64/tuning_models/generic_armv8_a.h >>>>> b/gcc/config/aarch64/tuning_models/generic_armv8_a.h >>>>> index 35de3f03296..01080cade46 100644 >>>>> --- a/gcc/config/aarch64/tuning_models/generic_armv8_a.h >>>>> +++ b/gcc/config/aarch64/tuning_models/generic_armv8_a.h >>>>> @@ -184,7 +184,7 @@ static const struct tune_params >>>>> generic_armv8_a_tunings = >>>>> (AARCH64_EXTRA_TUNE_BASE >>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS >>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */ >>>>> - &generic_prefetch_tune, >>>>> + &generic_armv8_a_prefetch_tune, >>>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ >>>>> AARCH64_LDP_STP_POLICY_ALWAYS /* stp_policy_model. */ >>>>> }; >>>>> -- >>>>> 2.34.1
[PATCH] aarch64: Use generic_armv8_a_prefetch_tune in generic_armv8_a.h
generic_armv8_a.h defines generic_armv8_a_prefetch_tune but still uses generic_prefetch_tune in generic_armv8_a_tunings. This patch updates the pointer to generic_armv8_a_prefetch_tune. This patch was bootstrapped and regtested on aarch64-linux-gnu, no regression. Ok for GCC 15 now? Signed-off-by: Soumya AR gcc/ChangeLog: * config/aarch64/tuning_models/generic_armv8_a.h: Updated prefetch struct pointer. --- gcc/config/aarch64/tuning_models/generic_armv8_a.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/gcc/config/aarch64/tuning_models/generic_armv8_a.h b/gcc/config/aarch64/tuning_models/generic_armv8_a.h index 35de3f03296..01080cade46 100644 --- a/gcc/config/aarch64/tuning_models/generic_armv8_a.h +++ b/gcc/config/aarch64/tuning_models/generic_armv8_a.h @@ -184,7 +184,7 @@ static const struct tune_params generic_armv8_a_tunings = (AARCH64_EXTRA_TUNE_BASE | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */ - &generic_prefetch_tune, + &generic_armv8_a_prefetch_tune, AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ AARCH64_LDP_STP_POLICY_ALWAYS /* stp_policy_model. */ }; -- 2.34.1