[PATCH] aarch64: Emit ADD X, Y, Y instead of SHL X, Y, #1 for SVE instructions

2024-09-10 Thread Soumya AR
On Neoverse V2, SVE ADD instructions have a throughput of 4, while shift
instructions like SHL have a throughput of 2. We can lean on that to emit code
like:
 addz31.b, z31.b, z31.b
instead of:
 lslz31.b, z31.b, #1

The implementation of this change for SVE vectors is similar to a prior patch
<https://gcc.gnu.org/pipermail/gcc-patches/2024-August/659958.html> that adds
the above functionality for Neon vectors.

Here, the machine descriptor pattern is split up to separately accommodate left
and right shifts, so we can specifically emit an add for all left shifts by 1. 

The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression.
OK for mainline?

Signed-off-by: Soumya AR 

gcc/ChangeLog:

* config/aarch64/aarch64-sve.md (*post_ra_v3): Split 
pattern to
accomodate left and right shifts separately.
(*post_ra_v_ashl3): Matches left shifts with additional 
constraint to
check for shifts by 1.
(*post_ra_v_3): Matches right shifts.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/sve/acle/asm/lsl_s16.c: Updated instances of lsl-1 
with
corresponding add
* gcc.target/aarch64/sve/acle/asm/lsl_s32.c: Likewise. 
* gcc.target/aarch64/sve/acle/asm/lsl_s64.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/lsl_s8.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/lsl_u16.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/lsl_u32.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/lsl_u64.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/lsl_u8.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/lsl_wide_s16.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/lsl_wide_s32.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/lsl_wide_s8.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/lsl_wide_u16.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/lsl_wide_u32.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/lsl_wide_u8.c: Likewise.
* gcc.target/aarch64/sve/adr_1.c: Likewise.
* gcc.target/aarch64/sve/adr_6.c: Likewise.
* gcc.target/aarch64/sve/cond_mla_7.c: Likewise.
* gcc.target/aarch64/sve/cond_mla_8.c: Likewise.
* gcc.target/aarch64/sve/shift_2.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/ldnt1sh_gather Likewise.
* gcc.target/aarch64/sve2/acle/asm/ldnt1sh_gather_u64.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/ldnt1uh_gather_s64.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/ldnt1uh_gather_u64.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/rshl_s16.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/rshl_s32.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/rshl_s64.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/rshl_s8.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/rshl_u16.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/rshl_u32.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/rshl_u64.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/rshl_u8.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/stnt1h_scatter_s64.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/stnt1h_scatter_u64.c: Likewise.
* gcc.target/aarch64/sve/sve_shl_add.c: New test.



0001-aarch64-Emit-ADD-X-Y-Y-instead-of-SHL-X-Y-1-for-SVE-.patch
Description: 0001-aarch64-Emit-ADD-X-Y-Y-instead-of-SHL-X-Y-1-for-SVE-.patch


Re: [PATCH] aarch64: Emit ADD X, Y, Y instead of SHL X, Y, #1 for SVE instructions

2024-09-16 Thread Soumya AR


> On 12 Sep 2024, at 7:22 PM, Richard Sandiford  
> wrote:
> 
> External email: Use caution opening links or attachments
> 
> 
> Richard Biener  writes:
>> On Thu, Sep 12, 2024 at 2:35 PM Richard Sandiford
>>  wrote:
>>> 
>>> Soumya AR  writes:
>>>> On Neoverse V2, SVE ADD instructions have a throughput of 4, while shift
>>>> instructions like SHL have a throughput of 2. We can lean on that to emit 
>>>> code
>>>> like:
>>>> add  z31.b, z31.b, z31.b
>>>> instead of:
>>>> lsl  z31.b, z31.b, #1
>>>> 
>>>> The implementation of this change for SVE vectors is similar to a prior 
>>>> patch
>>>> <https://gcc.gnu.org/pipermail/gcc-patches/2024-August/659958.html> that 
>>>> adds
>>>> the above functionality for Neon vectors.
>>>> 
>>>> Here, the machine descriptor pattern is split up to separately accommodate 
>>>> left
>>>> and right shifts, so we can specifically emit an add for all left shifts 
>>>> by 1.
>>> 
>>> Thanks for doing this.
>> 
>> I do wonder whether our scheduling infrastructure has the ability to "mutate"
>> instructions in cases like here if either adds or shifts exceed their
>> available resources
>> but there is a resource readily available in an alternate instruction form?
> 
> Yeah, that sounds like a useful feature in general.  But in this particular
> case, the shift resources are a subset of the addition resources, so there
> should never be a specific advantage to using shifts.
> 
> Thanks,
> Richard
> 
>> Richard.
>> 
>>>> The patch was bootstrapped and regtested on aarch64-linux-gnu, no 
>>>> regression.
>>>> OK for mainline?
>>>> 
>>>> Signed-off-by: Soumya AR 
>>>> 
>>>> gcc/ChangeLog:
>>>> 
>>>>  * config/aarch64/aarch64-sve.md (*post_ra_v3): Split 
>>>> pattern to
>>>>  accomodate left and right shifts separately.
>>>>  (*post_ra_v_ashl3): Matches left shifts with additional 
>>>> constraint to
>>>>  check for shifts by 1.
>>>>  (*post_ra_v_3): Matches right shifts.
>>>> 
>>>> gcc/testsuite/ChangeLog:
>>>> 
>>>>  * gcc.target/aarch64/sve/acle/asm/lsl_s16.c: Updated instances of 
>>>> lsl-1 with
>>>>  corresponding add
>>>>  * gcc.target/aarch64/sve/acle/asm/lsl_s32.c: Likewise.
>>>>  * gcc.target/aarch64/sve/acle/asm/lsl_s64.c: Likewise.
>>>>  * gcc.target/aarch64/sve/acle/asm/lsl_s8.c: Likewise.
>>>>  * gcc.target/aarch64/sve/acle/asm/lsl_u16.c: Likewise.
>>>>  * gcc.target/aarch64/sve/acle/asm/lsl_u32.c: Likewise.
>>>>  * gcc.target/aarch64/sve/acle/asm/lsl_u64.c: Likewise.
>>>>  * gcc.target/aarch64/sve/acle/asm/lsl_u8.c: Likewise.
>>>>  * gcc.target/aarch64/sve/acle/asm/lsl_wide_s16.c: Likewise.
>>>>  * gcc.target/aarch64/sve/acle/asm/lsl_wide_s32.c: Likewise.
>>>>  * gcc.target/aarch64/sve/acle/asm/lsl_wide_s8.c: Likewise.
>>>>  * gcc.target/aarch64/sve/acle/asm/lsl_wide_u16.c: Likewise.
>>>>  * gcc.target/aarch64/sve/acle/asm/lsl_wide_u32.c: Likewise.
>>>>  * gcc.target/aarch64/sve/acle/asm/lsl_wide_u8.c: Likewise.
>>>>  * gcc.target/aarch64/sve/adr_1.c: Likewise.
>>>>  * gcc.target/aarch64/sve/adr_6.c: Likewise.
>>>>  * gcc.target/aarch64/sve/cond_mla_7.c: Likewise.
>>>>  * gcc.target/aarch64/sve/cond_mla_8.c: Likewise.
>>>>  * gcc.target/aarch64/sve/shift_2.c: Likewise.
>>>>  * gcc.target/aarch64/sve2/acle/asm/ldnt1sh_gather Likewise.
>>>>  * gcc.target/aarch64/sve2/acle/asm/ldnt1sh_gather_u64.c: Likewise.
>>>>  * gcc.target/aarch64/sve2/acle/asm/ldnt1uh_gather_s64.c: Likewise.
>>>>  * gcc.target/aarch64/sve2/acle/asm/ldnt1uh_gather_u64.c: Likewise.
>>>>  * gcc.target/aarch64/sve2/acle/asm/rshl_s16.c: Likewise.
>>>>  * gcc.target/aarch64/sve2/acle/asm/rshl_s32.c: Likewise.
>>>>  * gcc.target/aarch64/sve2/acle/asm/rshl_s64.c: Likewise.
>>>>  * gcc.target/aarch64/sve2/acle/asm/rshl_s8.c: Likewise.
>>>>  * gcc.target/aarch64/sve2/acle/asm/rshl_u16.c: Likewise.
>>>>  * gcc.target/aarch64/sve2/acle/asm/rshl_u32.c: Likewise.
>>>>  * gcc.target/aarch64/sve2/acle/asm/rshl_u64.c: 

Re: [PATCH] aarch64: Expand CTZ to RBIT + CLZ for SVE [PR109498]

2024-10-04 Thread Soumya AR


> On 1 Oct 2024, at 6:17 PM, Richard Sandiford  
> wrote:
>
> External email: Use caution opening links or attachments
>
>
> Soumya AR  writes:
>> Currently, we vectorize CTZ for SVE by using the following operation:
>> .CTZ (X) = (PREC - 1) - .CLZ (X & -X)
>>
>> Instead, this patch expands CTZ to RBIT + CLZ for SVE, as suggested in 
>> PR109498.
>>
>> The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression.
>> OK for mainline?
>>
>> Signed-off-by: Soumya AR 
>>
>> gcc/ChangeLog:
>>  PR target/109498
>>  * config/aarch64/aarch64-sve.md (ctz2): Added pattern to expand
>>CTZ to RBIT + CLZ for SVE.
>>
>> gcc/testsuite/ChangeLog:
>>  PR target/109498
>>  * gcc.target/aarch64/sve/ctz.c: New test.
>
> Generally looks good, but a couple of comments:
>
>> ---
>> gcc/config/aarch64/aarch64-sve.md  | 16 +++
>> gcc/testsuite/gcc.target/aarch64/sve/ctz.c | 49 ++
>> 2 files changed, 65 insertions(+)
>> create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/ctz.c
>>
>> diff --git a/gcc/config/aarch64/aarch64-sve.md 
>> b/gcc/config/aarch64/aarch64-sve.md
>> index bfa28849adf..10094f156b3 100644
>> --- a/gcc/config/aarch64/aarch64-sve.md
>> +++ b/gcc/config/aarch64/aarch64-sve.md
>> @@ -3088,6 +3088,22 @@
>> ;; - NOT
>> ;; -
>>
>> +(define_expand "ctz2"
>> +  [(set (match_operand:SVE_I 0 "register_operand")
>> + (unspec:SVE_I
>> +   [(match_dup 2)
>> +(ctz:SVE_I
>> +  (match_operand:SVE_I 1 "register_operand"))]
>> +   UNSPEC_PRED_X))]
>> +  "TARGET_SVE"
>> +  {
>> + operands[2] = aarch64_ptrue_reg (mode);
>
> There's no real need to use operands[...] here.  It can just be
> a local variable.
>
>> + emit_insn (gen_aarch64_pred_rbit (operands[0], 
>> operands[2],operands[1]));
>> + emit_insn (gen_aarch64_pred_clz (operands[0], operands[2], 
>> operands[0]));
>
> Formatting nit: C++ lines should be 80 characters or fewer.
>
> More importantly, I think we should use a fresh register for the
> temporary (RBIT) result, since that tends to permit more optimisation.

Thanks for the feedback! Attaching an updated patch with the suggested changes.

Regards,
Soumya

> Thanks,
> Richard
>
>
>
>> + DONE;
>> +  }
>> +)
>> +
>> ;; Unpredicated integer unary arithmetic.
>> (define_expand "2"
>>   [(set (match_operand:SVE_I 0 "register_operand")
>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/ctz.c 
>> b/gcc/testsuite/gcc.target/aarch64/sve/ctz.c
>> new file mode 100644
>> index 000..433a9174f48
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/ctz.c
>> @@ -0,0 +1,49 @@
>> +/* { dg-final { check-function-bodies "**" "" } } */
>> +/* { dg-options "-O3 --param aarch64-autovec-preference=sve-only" } */
>> +
>> +#include 
>> +
>> +#define FUNC(FUNCTION, NAME, DTYPE) \
>> +void \
>> +NAME (DTYPE *__restrict x, DTYPE *__restrict y, int n) { \
>> +  for (int i = 0; i < n; i++)\
>> +x[i] = FUNCTION (y[i]); \
>> +}\
>> +
>> +
>> +/*
>> +** ctz_uint8:
>> +**   ...
>> +**   rbitz[0-9]+\.b, p[0-7]/m, z[0-9]+\.b
>> +**   clz z[0-9]+\.b, p[0-7]/m, z[0-9]+\.b
>> +**   ...
>> +*/
>> +FUNC (__builtin_ctzg, ctz_uint8, uint8_t)
>> +
>> +/*
>> +** ctz_uint16:
>> +**   ...
>> +**   rbitz[0-9]+\.h, p[0-7]/m, z[0-9]+\.h
>> +**   clz z[0-9]+\.h, p[0-7]/m, z[0-9]+\.h
>> +**   ...
>> +*/
>> +FUNC (__builtin_ctzg, ctz_uint16, uint16_t)
>> +
>> +/*
>> +** ctz_uint32:
>> +**   ...
>> +**   rbitz[0-9]+\.s, p[0-7]/m, z[0-9]+\.s
>> +**   clz z[0-9]+\.s, p[0-7]/m, z[0-9]+\.s
>> +**   ...
>> +*/
>> +FUNC (__builtin_ctz, ctz_uint32, uint32_t)
>> +
>> +/*
>> +** ctz_uint64:
>> +**   ...
>> +**   rbitz[0-9]+\.d, p[0-7]/m, z[0-9]+\.d
>> +**   clz z[0-9]+\.d, p[0-7]/m, z[0-9]+\.d
>> +**   ...
>> +*/
>> +FUNC (__builtin_ctzll, ctz_uint64, uint64_t)
>> +
>> --
>> 2.43.2






0001-aarch64-Expand-CTZ-to-RBIT-CLZ-for-SVE-PR109498.patch
Description: 0001-aarch64-Expand-CTZ-to-RBIT-CLZ-for-SVE-PR109498.patch


Re: [PATCH] aarch64: Expand CTZ to RBIT + CLZ for SVE [PR109498]

2024-09-29 Thread Soumya AR
Reworked the patch to substitute immediate register values in the test case with
regular expressions. Apologies for the oversight.

Thanks,
Soumya



> On 24 Sep 2024, at 8:53 AM, Soumya AR  wrote:
>
> Currently, we vectorize CTZ for SVE by using the following operation:
> .CTZ (X) = (PREC - 1) - .CLZ (X & -X)
>
> Instead, this patch expands CTZ to RBIT + CLZ for SVE, as suggested in 
> PR109498.
>
> The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression.
> OK for mainline?
>
> Signed-off-by: Soumya AR 
>
> gcc/ChangeLog:
> PR target/109498
> * config/aarch64/aarch64-sve.md (ctz2): Added pattern to expand
>CTZ to RBIT + CLZ for SVE.
>
> gcc/testsuite/ChangeLog:
> PR target/109498
> * gcc.target/aarch64/sve/ctz.c: New test.
>



0001-aarch64-Expand-CTZ-to-RBIT-CLZ-for-SVE-PR109498.patch
Description: 0001-aarch64-Expand-CTZ-to-RBIT-CLZ-for-SVE-PR109498.patch


0001-aarch64-Expand-CTZ-to-RBIT-CLZ-for-SVE-PR109498.patch
Description: 0001-aarch64-Expand-CTZ-to-RBIT-CLZ-for-SVE-PR109498.patch


Re: SVE intrinsics: Fold constant operands for svlsl.

2024-10-21 Thread Soumya AR
Hi,

> On 17 Oct 2024, at 12:38 PM, Kyrylo Tkachov  wrote:
>
> Hi Soumya
>
>> On 17 Oct 2024, at 06:10, Soumya AR  wrote:
>>
>> Hi Richard,
>>
>> Thanks for the feedback. I’ve updated the patch with the suggested change.
>> Ok for mainline?
>>
>> Best,
>> Soumya
>>
>>> On 14 Oct 2024, at 6:40 PM, Richard Sandiford  
>>> wrote:
>>>
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> Soumya AR  writes:
>>>> This patch implements constant folding for svlsl. Test cases have been 
>>>> added to
>>>> check for the following cases:
>>>>
>>>> Zero, merge, and don't care predication.
>>>> Shift by 0.
>>>> Shift by register width.
>>>> Overflow shift on signed and unsigned integers.
>>>> Shift on a negative integer.
>>>> Maximum possible shift, eg. shift by 7 on an 8-bit integer.
>>>>
>>>> The patch was bootstrapped and regtested on aarch64-linux-gnu, no 
>>>> regression.
>>>> OK for mainline?
>>>>
>>>> Signed-off-by: Soumya AR 
>>>>
>>>> gcc/ChangeLog:
>>>>
>>>> * config/aarch64/aarch64-sve-builtins-base.cc (svlsl_impl::fold):
>>>> Try constant folding.
>>>>
>>>> gcc/testsuite/ChangeLog:
>>>>
>>>> * gcc.target/aarch64/sve/const_fold_lsl_1.c: New test.
>>>>
>>>> From 0cf5223e51623dcdbc47a06cbd17d927c74094e2 Mon Sep 17 00:00:00 2001
>>>> From: Soumya AR 
>>>> Date: Tue, 24 Sep 2024 09:09:32 +0530
>>>> Subject: [PATCH] SVE intrinsics: Fold constant operands for svlsl.
>>>>
>>>> This patch implements constant folding for svlsl. Test cases have been 
>>>> added to
>>>> check for the following cases:
>>>>
>>>> Zero, merge, and don't care predication.
>>>> Shift by 0.
>>>> Shift by register width.
>>>> Overflow shift on signed and unsigned integers.
>>>> Shift on a negative integer.
>>>> Maximum possible shift, eg. shift by 7 on an 8-bit integer.
>>>>
>>>> The patch was bootstrapped and regtested on aarch64-linux-gnu, no 
>>>> regression.
>>>> OK for mainline?
>>>>
>>>> Signed-off-by: Soumya AR 
>>>>
>>>> gcc/ChangeLog:
>>>>
>>>> * config/aarch64/aarch64-sve-builtins-base.cc (svlsl_impl::fold):
>>>> Try constant folding.
>>>>
>>>> gcc/testsuite/ChangeLog:
>>>>
>>>> * gcc.target/aarch64/sve/const_fold_lsl_1.c: New test.
>>>> ---
>>>> .../aarch64/aarch64-sve-builtins-base.cc  |  15 +-
>>>> .../gcc.target/aarch64/sve/const_fold_lsl_1.c | 133 ++
>>>> 2 files changed, 147 insertions(+), 1 deletion(-)
>>>> create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/const_fold_lsl_1.c
>>>>
>>>> diff --git a/gcc/config/aarch64/aarch64-sve-builtins-base.cc 
>>>> b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
>>>> index afce52a7e8d..be5d6eae525 100644
>>>> --- a/gcc/config/aarch64/aarch64-sve-builtins-base.cc
>>>> +++ b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
>>>> @@ -1893,6 +1893,19 @@ public:
>>>>  }
>>>> };
>>>>
>>>> +class svlsl_impl : public rtx_code_function
>>>> +{
>>>> +public:
>>>> +  CONSTEXPR svlsl_impl ()
>>>> +: rtx_code_function (ASHIFT, ASHIFT) {}
>>>> +
>>>> +  gimple *
>>>> +  fold (gimple_folder &f) const override
>>>> +  {
>>>> +return f.fold_const_binary (LSHIFT_EXPR);
>>>> +  }
>>>> +};
>>>> +
>>>
>>> Sorry for the slow review.  I think we should also make aarch64_const_binop
>>> return 0 for LSHIFT_EXPR when the shift is out of range, to match the
>>> behaviour of the underlying instruction.
>>>
>>> It looks good otherwise.
>>>
>>> Thanks,
>>> Richard
>>>
>>
>
> In the test case:
> +/*
> +** s64_x_bit_width:
> +** mov z[0-9]+\.b, #0
> +** ret
> +*/
> +svint64_t s64_x_bit_width (svbool_t pg) {
> +return svlsl_n_s64_x (pg, svdup_s64 (5), 64);
> +}
> +
> +/*
> +** s64_x_out_of_range:
> +** mov z[0-9]+\.b, #0
> +** ret
> +*/
>
> You’ll need to adjust the scan for register zeroing according to the upcoming 
> changes from Tamar as per:
> https://gcc.gnu.org/pipermail/gcc-patches/2024-October/665669.html
>
> The patch LGTM (but Richard should approve as it’s SVE-specific)
>

Adjusted this patch according to Tamar’s changes.
Ok for mainline?

Thanks,
Soumya



> Thanks,
> Kyrill



0001-SVE-intrinsics-Fold-constant-operands-for-svlsl.patch
Description: 0001-SVE-intrinsics-Fold-constant-operands-for-svlsl.patch


Re: SVE intrinsics: Fold constant operands for svlsl.

2024-10-16 Thread Soumya AR
Hi Richard,

Thanks for the feedback. I’ve updated the patch with the suggested change.
Ok for mainline?

Best,
Soumya

> On 14 Oct 2024, at 6:40 PM, Richard Sandiford  
> wrote:
>
> External email: Use caution opening links or attachments
>
>
> Soumya AR  writes:
>> This patch implements constant folding for svlsl. Test cases have been added 
>> to
>> check for the following cases:
>>
>> Zero, merge, and don't care predication.
>> Shift by 0.
>> Shift by register width.
>> Overflow shift on signed and unsigned integers.
>> Shift on a negative integer.
>> Maximum possible shift, eg. shift by 7 on an 8-bit integer.
>>
>> The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression.
>> OK for mainline?
>>
>> Signed-off-by: Soumya AR 
>>
>> gcc/ChangeLog:
>>
>>  * config/aarch64/aarch64-sve-builtins-base.cc (svlsl_impl::fold):
>>  Try constant folding.
>>
>> gcc/testsuite/ChangeLog:
>>
>>  * gcc.target/aarch64/sve/const_fold_lsl_1.c: New test.
>>
>> From 0cf5223e51623dcdbc47a06cbd17d927c74094e2 Mon Sep 17 00:00:00 2001
>> From: Soumya AR 
>> Date: Tue, 24 Sep 2024 09:09:32 +0530
>> Subject: [PATCH] SVE intrinsics: Fold constant operands for svlsl.
>>
>> This patch implements constant folding for svlsl. Test cases have been added 
>> to
>> check for the following cases:
>>
>> Zero, merge, and don't care predication.
>> Shift by 0.
>> Shift by register width.
>> Overflow shift on signed and unsigned integers.
>> Shift on a negative integer.
>> Maximum possible shift, eg. shift by 7 on an 8-bit integer.
>>
>> The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression.
>> OK for mainline?
>>
>> Signed-off-by: Soumya AR 
>>
>> gcc/ChangeLog:
>>
>>  * config/aarch64/aarch64-sve-builtins-base.cc (svlsl_impl::fold):
>>  Try constant folding.
>>
>> gcc/testsuite/ChangeLog:
>>
>>  * gcc.target/aarch64/sve/const_fold_lsl_1.c: New test.
>> ---
>> .../aarch64/aarch64-sve-builtins-base.cc  |  15 +-
>> .../gcc.target/aarch64/sve/const_fold_lsl_1.c | 133 ++
>> 2 files changed, 147 insertions(+), 1 deletion(-)
>> create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/const_fold_lsl_1.c
>>
>> diff --git a/gcc/config/aarch64/aarch64-sve-builtins-base.cc 
>> b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
>> index afce52a7e8d..be5d6eae525 100644
>> --- a/gcc/config/aarch64/aarch64-sve-builtins-base.cc
>> +++ b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
>> @@ -1893,6 +1893,19 @@ public:
>>   }
>> };
>>
>> +class svlsl_impl : public rtx_code_function
>> +{
>> +public:
>> +  CONSTEXPR svlsl_impl ()
>> +: rtx_code_function (ASHIFT, ASHIFT) {}
>> +
>> +  gimple *
>> +  fold (gimple_folder &f) const override
>> +  {
>> +return f.fold_const_binary (LSHIFT_EXPR);
>> +  }
>> +};
>> +
>
> Sorry for the slow review.  I think we should also make aarch64_const_binop
> return 0 for LSHIFT_EXPR when the shift is out of range, to match the
> behaviour of the underlying instruction.
>
> It looks good otherwise.
>
> Thanks,
> Richard
>
>> class svmad_impl : public function_base
>> {
>> public:
>> @@ -3199,7 +3212,7 @@ FUNCTION (svldnf1uh, svldxf1_extend_impl, 
>> (TYPE_SUFFIX_u16, UNSPEC_LDNF1))
>> FUNCTION (svldnf1uw, svldxf1_extend_impl, (TYPE_SUFFIX_u32, UNSPEC_LDNF1))
>> FUNCTION (svldnt1, svldnt1_impl,)
>> FUNCTION (svlen, svlen_impl,)
>> -FUNCTION (svlsl, rtx_code_function, (ASHIFT, ASHIFT))
>> +FUNCTION (svlsl, svlsl_impl,)
>> FUNCTION (svlsl_wide, shift_wide, (ASHIFT, UNSPEC_ASHIFT_WIDE))
>> FUNCTION (svlsr, rtx_code_function, (LSHIFTRT, LSHIFTRT))
>> FUNCTION (svlsr_wide, shift_wide, (LSHIFTRT, UNSPEC_LSHIFTRT_WIDE))
>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/const_fold_lsl_1.c 
>> b/gcc/testsuite/gcc.target/aarch64/sve/const_fold_lsl_1.c
>> new file mode 100644
>> index 000..4299dbd850e
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/const_fold_lsl_1.c
>> @@ -0,0 +1,133 @@
>> +/* { dg-final { check-function-bodies "**" "" } } */
>> +/* { dg-options "-O2" } */
>> +
>> +#include "arm_sve.h"
>> +
>> +/*
>> +** s64_x:
>> +**   mov z[0-9]+\.d, #20
>> +**   ret
>> +*/
>> +svint64_t s64_x (svbool_t pg) {
>> +return svlsl_n_s64_x (

Re: [PATCH] SVE intrinsics: Fold constant operands for svlsl.

2024-10-13 Thread Soumya AR
Pinging with updated subject, had missed the [PATCH] header before.

Regards,
Soumya

> On 24 Sep 2024, at 2:00 PM, Soumya AR  wrote:
> 
> This patch implements constant folding for svlsl. Test cases have been added 
> to
> check for the following cases:
> 
> Zero, merge, and don't care predication.
> Shift by 0.
> Shift by register width.
> Overflow shift on signed and unsigned integers.
> Shift on a negative integer.
> Maximum possible shift, eg. shift by 7 on an 8-bit integer.
> 
> The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression.
> OK for mainline?
> 
> Signed-off-by: Soumya AR 
> 
> gcc/ChangeLog:
> 
> * config/aarch64/aarch64-sve-builtins-base.cc (svlsl_impl::fold):
> Try constant folding.
> 
> gcc/testsuite/ChangeLog:
> 
> * gcc.target/aarch64/sve/const_fold_lsl_1.c: New test.
> 
> <0001-SVE-intrinsics-Fold-constant-operands-for-svlsl.patch>



Re: SVE intrinsics: Fold constant operands for svlsl.

2024-10-29 Thread Soumya AR


> On 24 Oct 2024, at 2:55 PM, Richard Sandiford  
> wrote:
> 
> External email: Use caution opening links or attachments
> 
> 
> Kyrylo Tkachov  writes:
>>> On 24 Oct 2024, at 10:39, Soumya AR  wrote:
>>> 
>>> Hi Richard,
>>> 
>>>> On 23 Oct 2024, at 5:58 PM, Richard Sandiford  
>>>> wrote:
>>>> 
>>>> External email: Use caution opening links or attachments
>>>> 
>>>> 
>>>> Soumya AR  writes:
>>>>> diff --git a/gcc/config/aarch64/aarch64-sve-builtins.cc 
>>>>> b/gcc/config/aarch64/aarch64-sve-builtins.cc
>>>>> index 41673745cfe..aa556859d2e 100644
>>>>> --- a/gcc/config/aarch64/aarch64-sve-builtins.cc
>>>>> +++ b/gcc/config/aarch64/aarch64-sve-builtins.cc
>>>>> @@ -1143,11 +1143,14 @@ aarch64_const_binop (enum tree_code code, tree 
>>>>> arg1, tree arg2)
>>>>>  tree type = TREE_TYPE (arg1);
>>>>>  signop sign = TYPE_SIGN (type);
>>>>>  wi::overflow_type overflow = wi::OVF_NONE;
>>>>> -
>>>>> +  unsigned int element_bytes = tree_to_uhwi (TYPE_SIZE_UNIT (type));
>>>>>  /* Return 0 for division by 0, like SDIV and UDIV do.  */
>>>>>  if (code == TRUNC_DIV_EXPR && integer_zerop (arg2))
>>>>> return arg2;
>>>>> -
>>>>> +  /* Return 0 if shift amount is out of range. */
>>>>> +  if (code == LSHIFT_EXPR
>>>>> +   && tree_to_uhwi (arg2) >= (element_bytes * BITS_PER_UNIT))
>>>> 
>>>> tree_to_uhwi is dangerous because a general shift might be negative
>>>> (even if these particular shift amounts are unsigned).  We should
>>>> probably also key off TYPE_PRECISION rather than TYPE_SIZE_UNIT.  So:
>>>> 
>>>>   if (code == LSHIFT_EXPR
>>>>   && wi::geu_p (wi::to_wide (arg2), TYPE_PRECISION (type)))
>>>> 
>>>> without the element_bytes variable.  Also: the indentation looks a bit off;
>>>> it should be tabs only followed by spaces only.
>>> 
>>> Thanks for the feedback, posting an updated patch with the suggested 
>>> changes.
>> 
>> Thanks Soumya, I’ve pushed this patch to trunk as commit 3e7549ece7c after 
>> adjusting
>> the ChangeLog slightly to start the lines with tabs instead of spaces.
> 
> Sorry Soumya, I forgot that you didn't have commit access yet.
> It's time you did though.  Could you follow the instructions
> on https://gcc.gnu.org/gitwrite.html ?  I'm happy to sponsor
> (and I'm sure Kyrill would be too).

Wow, that’s exciting! Kyrill has agreed to sponsor but thanks
nonetheless! :)

Best,
Soumya

> Thanks,
> Richard



Re: SVE intrinsics: Fold constant operands for svlsl.

2024-10-24 Thread Soumya AR
Hi Richard,

> On 23 Oct 2024, at 5:58 PM, Richard Sandiford  
> wrote:
>
> External email: Use caution opening links or attachments
>
>
> Soumya AR  writes:
>> diff --git a/gcc/config/aarch64/aarch64-sve-builtins.cc 
>> b/gcc/config/aarch64/aarch64-sve-builtins.cc
>> index 41673745cfe..aa556859d2e 100644
>> --- a/gcc/config/aarch64/aarch64-sve-builtins.cc
>> +++ b/gcc/config/aarch64/aarch64-sve-builtins.cc
>> @@ -1143,11 +1143,14 @@ aarch64_const_binop (enum tree_code code, tree arg1, 
>> tree arg2)
>>   tree type = TREE_TYPE (arg1);
>>   signop sign = TYPE_SIGN (type);
>>   wi::overflow_type overflow = wi::OVF_NONE;
>> -
>> +  unsigned int element_bytes = tree_to_uhwi (TYPE_SIZE_UNIT (type));
>>   /* Return 0 for division by 0, like SDIV and UDIV do.  */
>>   if (code == TRUNC_DIV_EXPR && integer_zerop (arg2))
>>  return arg2;
>> -
>> +  /* Return 0 if shift amount is out of range. */
>> +  if (code == LSHIFT_EXPR
>> +   && tree_to_uhwi (arg2) >= (element_bytes * BITS_PER_UNIT))
>
> tree_to_uhwi is dangerous because a general shift might be negative
> (even if these particular shift amounts are unsigned).  We should
> probably also key off TYPE_PRECISION rather than TYPE_SIZE_UNIT.  So:
>
>if (code == LSHIFT_EXPR
>&& wi::geu_p (wi::to_wide (arg2), TYPE_PRECISION (type)))
>
> without the element_bytes variable.  Also: the indentation looks a bit off;
> it should be tabs only followed by spaces only.

Thanks for the feedback, posting an updated patch with the suggested changes.

Thanks,
Soumya

> OK with those change, thanks.
>
> Richard
>
>
>> + return build_int_cst (type, 0);
>>   if (!poly_int_binop (poly_res, code, arg1, arg2, sign, &overflow))
>>  return NULL_TREE;
>>   return force_fit_type (type, poly_res, false,
>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/const_fold_lsl_1.c 
>> b/gcc/testsuite/gcc.target/aarch64/sve/const_fold_lsl_1.c
>> new file mode 100644
>> index 000..6109558001a
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/const_fold_lsl_1.c
>> @@ -0,0 +1,142 @@
>> +/* { dg-final { check-function-bodies "**" "" } } */
>> +/* { dg-options "-O2" } */
>> +
>> +#include "arm_sve.h"
>> +
>> +/*
>> +** s64_x:
>> +**   mov z[0-9]+\.d, #20
>> +**   ret
>> +*/
>> +svint64_t s64_x (svbool_t pg) {
>> +return svlsl_n_s64_x (pg, svdup_s64 (5), 2);
>> +}
>> +
>> +/*
>> +** s64_x_vect:
>> +**   mov z[0-9]+\.d, #20
>> +**   ret
>> +*/
>> +svint64_t s64_x_vect (svbool_t pg) {
>> +return svlsl_s64_x (pg, svdup_s64 (5), svdup_u64 (2));
>> +}
>> +
>> +/*
>> +** s64_z:
>> +**   mov z[0-9]+\.d, p[0-7]/z, #20
>> +**   ret
>> +*/
>> +svint64_t s64_z (svbool_t pg) {
>> +return svlsl_n_s64_z (pg, svdup_s64 (5), 2);
>> +}
>> +
>> +/*
>> +** s64_z_vect:
>> +**   mov z[0-9]+\.d, p[0-7]/z, #20
>> +**   ret
>> +*/
>> +svint64_t s64_z_vect (svbool_t pg) {
>> +return svlsl_s64_z (pg, svdup_s64 (5), svdup_u64 (2));
>> +}
>> +
>> +/*
>> +** s64_m_ptrue:
>> +**   mov z[0-9]+\.d, #20
>> +**   ret
>> +*/
>> +svint64_t s64_m_ptrue () {
>> +return svlsl_n_s64_m (svptrue_b64 (), svdup_s64 (5), 2);
>> +}
>> +
>> +/*
>> +** s64_m_ptrue_vect:
>> +**   mov z[0-9]+\.d, #20
>> +**   ret
>> +*/
>> +svint64_t s64_m_ptrue_vect () {
>> +return svlsl_s64_m (svptrue_b64 (), svdup_s64 (5), svdup_u64 (2));
>> +}
>> +
>> +/*
>> +** s64_m_pg:
>> +**   mov z[0-9]+\.d, #5
>> +**   lsl z[0-9]+\.d, p[0-7]/m, z[0-9]+\.d, #2
>> +**   ret
>> +*/
>> +svint64_t s64_m_pg (svbool_t pg) {
>> +return svlsl_n_s64_m (pg, svdup_s64 (5), 2);
>> +}
>> +
>> +/*
>> +** s64_m_pg_vect:
>> +**   mov z[0-9]+\.d, #5
>> +**   lsl z[0-9]+\.d, p[0-7]/m, z[0-9]+\.d, #2
>> +**   ret
>> +*/
>> +svint64_t s64_m_pg_vect (svbool_t pg) {
>> +return svlsl_s64_m (pg, svdup_s64 (5), svdup_u64 (2));
>> +}
>> +
>> +/*
>> +** s64_x_0:
>> +**   mov z[0-9]+\.d, #5
>> +**   ret
>> +*/
>> +svint64_t s64_x_0 (svbool_t pg) {
>> +return svlsl_n_s64_x (pg, svdup_s64 (5), 0);
>> +}
>> +
>> +/*
>> +** s64_x_bit_width:
>> +**   movi?   [vdz]([0-9]+)\.?(

[PATCH v2] aarch64: Optimise calls to ldexp with SVE FSCALE instruction [PR111733]

2024-11-03 Thread Soumya AR
Changes since v1:

This revision makes use of the extended definition of aarch64_ptrue_reg to
generate predicate registers with the appropriate set bits.

Earlier, there was a suggestion to add support for half floats as well. I
extended the patch to include HFs but GCC still emits a libcall for ldexpf16.
For example, in the following case, the call does not lower to fscale:

_Float16 test_ldexpf16 (_Float16 x, int i) {
return __builtin_ldexpf16 (x, i);
}

Any suggestions as to why this may be?

——

This patch uses the FSCALE instruction provided by SVE to implement the
standard ldexp family of functions.

Currently, with '-Ofast -mcpu=neoverse-v2', GCC generates libcalls for the
following code:

float
test_ldexpf (float x, int i)
{
return __builtin_ldexpf (x, i);
}

double
test_ldexp (double x, int i)
{
return __builtin_ldexp(x, i);
}

GCC Output:

test_ldexpf:
b ldexpf

test_ldexp:
b ldexp

Since SVE has support for an FSCALE instruction, we can use this to process
scalar floats by moving them to a vector register and performing an fscale call,
similar to how LLVM tackles an ldexp builtin as well.

New Output:

test_ldexpf:
fmovs31, w0
ptrue   p7.b, vl4
fscale  z0.s, p7/m, z0.s, z31.s
ret

test_ldexp:
sxtwx0, w0
ptrue   p7.b, vl8
fmovd31, x0
fscale  z0.d, p7/m, z0.d, z31.d
ret

The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression.
OK for mainline?

Signed-off-by: Soumya AR 

gcc/ChangeLog:

PR target/111733
* config/aarch64/aarch64-sve.md
(ldexp3): Added a new pattern to match ldexp calls with scalar
floating modes and expand to the existing pattern for FSCALE.
(@aarch64_pred_): Extended the pattern to accept SVE
operands as well as scalar floating modes.

* config/aarch64/iterators.md:
(SVE_FULL_F_SCALAR): Added an iterator to match all FP SVE modes as well
as HF, SF, and DF.
(VPRED): Extended the attribute to handle GPF_HF modes.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/sve/fscale.c: New test.




0001-aarch64-Optimise-calls-to-ldexp-with-SVE-FSCALE-inst.patch
Description: 0001-aarch64-Optimise-calls-to-ldexp-with-SVE-FSCALE-inst.patch


Re: [PATCH] Match: Fold pow calls to ldexp when possible [PR57492]

2024-11-05 Thread Soumya AR


> On 29 Oct 2024, at 7:16 PM, Richard Biener  wrote:
>
> External email: Use caution opening links or attachments
>
>
> On Mon, 28 Oct 2024, Soumya AR wrote:
>
>> This patch transforms the following POW calls to equivalent LDEXP calls, as
>> discussed in PR57492:
>>
>> powi (2.0, i) -> ldexp (1.0, i)
>>
>> a * powi (2.0, i) -> ldexp (a, i)
>>
>> 2.0 * powi (2.0, i) -> ldexp (1.0, i + 1)
>>
>> pow (powof2, i) -> ldexp (1.0, i * log2 (powof2))
>>
>> powof2 * pow (2, i) -> ldexp (1.0, i + log2 (powof2))
>
> For the multiplication cases why not handle powof2 * ldexp (1., i)
> to ldexp (1., i + log2 (powof2)) and a * ldexp (1., i) -> ldexp (a, i)
> instead?  exp2 * ldexp (1., i) is another candidate.
>
> So please split out the multiplication parts.
>
> + /* Simplify pow (powof2, i) to ldexp (1, i * log2 (powof2)). */
>
> the below pattern handles POWI, not POW.
>
> + (simplify
> +  (POWI REAL_CST@0 @1)
> +  (with { HOST_WIDE_INT tmp = 0;
> + tree integer_arg1 = NULL_TREE; }
> +  (if (integer_valued_real_p (@0)
> +   && real_isinteger (&TREE_REAL_CST (@0), &tmp)
> +   && integer_pow2p (integer_arg1 = build_int_cst (integer_type_node,
> tmp)))
>
>  && tmp > 0
>  && pow2p_hwi (tmp)
>
> +(LDEXP { build_one_cst (type); }
> +  (mult @1 { build_int_cst (integer_type_node,
> + tree_log2 (integer_arg1)); })
>
> build_int_cst (integer_type_node, exact_log2 (tmp))
>
> + /* Simplify powi (2.0, i) to ldexp (1, i). */
> + (simplify
> +  (POWI REAL_CST@0 @1)
> +  (if (real_equal (TREE_REAL_CST_PTR (@0), &dconst2))
> +   (LDEXP { build_one_cst (type); } @1)))
> +
>
> You'll have a duplicate pattern here, instead merge them.  2.0
> is power-of-two so I wonder why the pattern is needed.

Thanks for the feedback!

I've merged the extra case that was specifically checking for 2.0.

Like you suggested, I also added two ldexp specific transforms:

• powof2 * ldexp (x, i) -> ldexp (x, i + log2 (powof2))
• a * ldexp(1., i) -> ldexp (a, i)

Regarding your suggestion to also fold exp2, a conversion like
exp2 (x) -> ldexp (1., x) or exp2 (x) * ldexp (y, i) -> ldexp (y, i + x) is a
bit tricky because we'd have to cast it to an integer before passing it to
ldexp.

real_to_integer only works for constants, which isn't helpful here as exp2 (CST)
becomes a power of 2 anyway and matches with the above patterns.

We'll have to explicitly convert it for non constants and I'm not sure if that
is worth it for this patch.

Let me know what you think.

Best,
Soumya



> Richard.
>
>>
>> This is especially helpful for SVE architectures as LDEXP calls can be
>> implemented using the FSCALE instruction, as seen in the following patch:
>> https://gcc.gnu.org/pipermail/gcc-patches/2024-September/664160.html
>>
>> SPEC2017 was run with this patch, while there are no noticeable improvements,
>> there are no non-noise regressions either.
>>
>> The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression.
>> OK for mainline?
>>
>> Signed-off-by: Soumya AR 
>>
>> gcc/ChangeLog:
>>  PR target/57492
>>  * match.pd: Added patterns to fold certain calls to pow to ldexp.
>>
>> gcc/testsuite/ChangeLog:
>>  PR target/57492
>>  * gcc.dg/tree-ssa/pow-to-ldexp.c: New test.
>>
>>
>
> --
> Richard Biener 
> SUSE Software Solutions Germany GmbH,
> Frankenstrasse 146, 90461 Nuernberg, Germany;
> GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)




v2-0001-Match-Fold-pow-calls-to-ldexp-when-possible-PR57492.patch
Description: v2-0001-Match-Fold-pow-calls-to-ldexp-when-possible-PR57492.patch


[PATCH] Match: Fold pow calls to ldexp when possible [PR57492]

2024-10-27 Thread Soumya AR
This patch transforms the following POW calls to equivalent LDEXP calls, as
discussed in PR57492:

powi (2.0, i) -> ldexp (1.0, i)

a * powi (2.0, i) -> ldexp (a, i)

2.0 * powi (2.0, i) -> ldexp (1.0, i + 1)

pow (powof2, i) -> ldexp (1.0, i * log2 (powof2))

powof2 * pow (2, i) -> ldexp (1.0, i + log2 (powof2))

This is especially helpful for SVE architectures as LDEXP calls can be
implemented using the FSCALE instruction, as seen in the following patch:
https://gcc.gnu.org/pipermail/gcc-patches/2024-September/664160.html

SPEC2017 was run with this patch, while there are no noticeable improvements,
there are no non-noise regressions either.

The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression.
OK for mainline?

Signed-off-by: Soumya AR 

gcc/ChangeLog:
PR target/57492
* match.pd: Added patterns to fold certain calls to pow to ldexp.

gcc/testsuite/ChangeLog:
PR target/57492
* gcc.dg/tree-ssa/pow-to-ldexp.c: New test.



0001-Match-Fold-pow-calls-to-ldexp-when-possible-PR57492.patch
Description: 0001-Match-Fold-pow-calls-to-ldexp-when-possible-PR57492.patch


[PATCH] Match: Optimize log (x) CMP CST and exp (x) CMP CST operations

2024-10-27 Thread Soumya AR
This patch implements transformations for the following optimizations.

logN(x) CMP CST -> x CMP expN(CST)
expN(x) CMP CST -> x CMP logN(CST)

For example:

int
foo (float x)
{
  return __builtin_logf (x) < 0.0f;
}

can just be:

int
foo (float x)
{
  return x < 1.0f;
} 

The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression.
OK for mainline?

Signed-off-by: Soumya AR 

gcc/ChangeLog:

* match.pd: Fold logN(x) CMP CST -> x CMP expN(CST)
and expN(x) CMP CST -> x CMP logN(CST)

gcc/testsuite/ChangeLog:

* gcc.dg/tree-ssa/log_exp.c: New test.



0001-Match-Optimize-log-x-CMP-CST-and-exp-x-CMP-CST-opera.patch
Description: 0001-Match-Optimize-log-x-CMP-CST-and-exp-x-CMP-CST-opera.patch


Re: [PATCH] Match: Fold pow calls to ldexp when possible [PR57492]

2024-11-11 Thread Soumya AR
Hi Richard,

> On 7 Nov 2024, at 6:10 PM, Richard Biener  wrote:
> 
> External email: Use caution opening links or attachments
> 
> 
> On Tue, 5 Nov 2024, Soumya AR wrote:
> 
>> 
>> 
>>> On 29 Oct 2024, at 7:16 PM, Richard Biener  wrote:
>>> 
>>> External email: Use caution opening links or attachments
>>> 
>>> 
>>> On Mon, 28 Oct 2024, Soumya AR wrote:
>>> 
>>>> This patch transforms the following POW calls to equivalent LDEXP calls, as
>>>> discussed in PR57492:
>>>> 
>>>> powi (2.0, i) -> ldexp (1.0, i)
>>>> 
>>>> a * powi (2.0, i) -> ldexp (a, i)
>>>> 
>>>> 2.0 * powi (2.0, i) -> ldexp (1.0, i + 1)
>>>> 
>>>> pow (powof2, i) -> ldexp (1.0, i * log2 (powof2))
>>>> 
>>>> powof2 * pow (2, i) -> ldexp (1.0, i + log2 (powof2))
>>> 
>>> For the multiplication cases why not handle powof2 * ldexp (1., i)
>>> to ldexp (1., i + log2 (powof2)) and a * ldexp (1., i) -> ldexp (a, i)
>>> instead?  exp2 * ldexp (1., i) is another candidate.
>>> 
>>> So please split out the multiplication parts.
>>> 
>>> + /* Simplify pow (powof2, i) to ldexp (1, i * log2 (powof2)). */
>>> 
>>> the below pattern handles POWI, not POW.
>>> 
>>> + (simplify
>>> +  (POWI REAL_CST@0 @1)
>>> +  (with { HOST_WIDE_INT tmp = 0;
>>> + tree integer_arg1 = NULL_TREE; }
>>> +  (if (integer_valued_real_p (@0)
>>> +   && real_isinteger (&TREE_REAL_CST (@0), &tmp)
>>> +   && integer_pow2p (integer_arg1 = build_int_cst (integer_type_node,
>>> tmp)))
>>> 
>>> && tmp > 0
>>> && pow2p_hwi (tmp)
>>> 
>>> +(LDEXP { build_one_cst (type); }
>>> +  (mult @1 { build_int_cst (integer_type_node,
>>> + tree_log2 (integer_arg1)); })
>>> 
>>> build_int_cst (integer_type_node, exact_log2 (tmp))
>>> 
>>> + /* Simplify powi (2.0, i) to ldexp (1, i). */
>>> + (simplify
>>> +  (POWI REAL_CST@0 @1)
>>> +  (if (real_equal (TREE_REAL_CST_PTR (@0), &dconst2))
>>> +   (LDEXP { build_one_cst (type); } @1)))
>>> +
>>> 
>>> You'll have a duplicate pattern here, instead merge them.  2.0
>>> is power-of-two so I wonder why the pattern is needed.
>> 
>> Thanks for the feedback!
>> 
>> I've merged the extra case that was specifically checking for 2.0.
>> 
>> Like you suggested, I also added two ldexp specific transforms:
>> 
>>• powof2 * ldexp (x, i) -> ldexp (x, i + log2 (powof2))
>>• a * ldexp(1., i) -> ldexp (a, i)
>> 
>> Regarding your suggestion to also fold exp2, a conversion like
>> exp2 (x) -> ldexp (1., x) or exp2 (x) * ldexp (y, i) -> ldexp (y, i + x) is a
>> bit tricky because we'd have to cast it to an integer before passing it to
>> ldexp.
>> 
>> real_to_integer only works for constants, which isn't helpful here as exp2 
>> (CST)
>> becomes a power of 2 anyway and matches with the above patterns.
>> 
>> We'll have to explicitly convert it for non constants and I'm not sure if 
>> that
>> is worth it for this patch.
>> 
>> Let me know what you think.
> 
> + (simplify
> +  (mult:c REAL_CST@0 (POWI REAL_CST@1 @2))
> +  (with { HOST_WIDE_INT tmp = 0;
> + tree integer_arg1 = NULL_TREE; }
> +  (if (integer_valued_real_p (@0)
> +   && real_equal (TREE_REAL_CST_PTR (@1), &dconst2)
> +   && real_isinteger (&TREE_REAL_CST (@0), &tmp)
> +   && integer_pow2p (integer_arg1 = build_int_cst (integer_type_node,
> tmp)))
> 
> as said you can use tmp > 0 && pow2p_hwi (tmp) instead of
> integer_pow2p and build_int_cst.  This applies to all patterns.

Thanks for the feedback! Updated the patch with this modification.

> +  (if (integer_valued_real_p (@0)
> +   && real_isinteger (&TREE_REAL_CST (@0), &tmp)
> 
> also is a redundant check, real_isinteger gives you the answer already.
> 
> +  (mult @1 {build_int_cst (integer_type_node,
> +tree_log2 (integer_arg1)); })
> 
> and use exact_log2 (tmp) instead of tree_log2 (integer_arg1).

Fixed. 

> 
> + /* Simplify a * powi (2.0, i) to ldexp (a, i). */
> + (simplify
> +  (mult:c @0 (POWI REAL_CST@1 @2))
> +  (if (real_equal (TREE_REAL_CST_PTR (@1), &dconst2

Re: [PATCH v2] aarch64: Optimise calls to ldexp with SVE FSCALE instruction [PR111733]

2024-11-11 Thread Soumya AR

Hi Richard, 

> On 7 Nov 2024, at 3:19 PM, Richard Sandiford  
> wrote:
> 
> External email: Use caution opening links or attachments
> 
> 
> Soumya AR  writes:
>> Changes since v1:
>> 
>> This revision makes use of the extended definition of aarch64_ptrue_reg to
>> generate predicate registers with the appropriate set bits.
>> 
>> Earlier, there was a suggestion to add support for half floats as well. I
>> extended the patch to include HFs but GCC still emits a libcall for ldexpf16.
>> For example, in the following case, the call does not lower to fscale:
>> 
>> _Float16 test_ldexpf16 (_Float16 x, int i) {
>>  return __builtin_ldexpf16 (x, i);
>> }
>> 
>> Any suggestions as to why this may be?
> 
> You'd need to change:
> 
> diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
> index 2d455938271..469835b1d62 100644
> --- a/gcc/internal-fn.def
> +++ b/gcc/internal-fn.def
> @@ -441,7 +441,7 @@ DEF_INTERNAL_OPTAB_FN (VEC_FMADDSUB, ECF_CONST, 
> vec_fmaddsub, ternary)
> DEF_INTERNAL_OPTAB_FN (VEC_FMSUBADD, ECF_CONST, vec_fmsubadd, ternary)
> 
> /* FP scales.  */
> -DEF_INTERNAL_FLT_FN (LDEXP, ECF_CONST, ldexp, binary)
> +DEF_INTERNAL_FLT_FLOATN_FN (LDEXP, ECF_CONST, ldexp, binary)
> 
> /* Ternary math functions.  */
> DEF_INTERNAL_FLT_FLOATN_FN (FMA, ECF_CONST, fma, ternary)

Thanks for this! It works for FP16 now.

> 
> A couple of comments below, but otherwise it looks good:
> 
>> diff --git a/gcc/config/aarch64/iterators.md 
>> b/gcc/config/aarch64/iterators.md
>> index 0bc98315bb6..7f708ea14f9 100644
>> --- a/gcc/config/aarch64/iterators.md
>> +++ b/gcc/config/aarch64/iterators.md
>> @@ -449,6 +449,9 @@
>> ;; All fully-packed SVE floating-point vector modes.
>> (define_mode_iterator SVE_FULL_F [VNx8HF VNx4SF VNx2DF])
>> 
>> +;; Fully-packed SVE floating-point vector modes and 32-bit and 64-bit 
>> floats.
>> +(define_mode_iterator SVE_FULL_F_SCALAR [VNx8HF VNx4SF VNx2DF HF SF DF])
> 
> The comment is out of date.  How about:
> 
> ;; Fully-packed SVE floating-point vector modes and their scalar equivalents.
> (define_mode_iterator SVE_FULL_F_SCALAR [SVE_FULL_F GPF_HF])

Edited.

> 
>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/fscale.c 
>> b/gcc/testsuite/gcc.target/aarch64/sve/fscale.c
>> new file mode 100644
>> index 000..251b4ef9188
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/fscale.c
>> @@ -0,0 +1,16 @@
>> +/* { dg-do compile } */
>> +/* { dg-additional-options "-Ofast" } */
>> +
>> +float
>> +test_ldexpf (float x, int i)
>> +{
>> +  return __builtin_ldexpf (x, i);
>> +}
>> +/* { dg-final { scan-assembler-times {\tfscale\tz[0-9]+\.s, p[0-7]/m, 
>> z[0-9]+\.s, z[0-9]+\.s\n} 1 } } */
>> +
>> +double
>> +test_ldexp (double x, int i)
>> +{
>> +  return __builtin_ldexp (x, i);
>> +}
>> +/* { dg-final { scan-assembler-times {\tfscale\tz[0-9]+\.d, p[0-7]/m, 
>> z[0-9]+\.d, z[0-9]+\.d\n} 1 } } */
> 
> It would be good to check the ptrues as well, to make sure that we only
> enable one lane.
> 
Makes sense, I’ve changed the test case to use check-function-bodies instead so
we can check for ptrues as well.

Best,
Soumya

> Thanks,
> Richard




0001-aarch64-Optimise-calls-to-ldexp-with-SVE-FSCALE-inst.patch
Description: 0001-aarch64-Optimise-calls-to-ldexp-with-SVE-FSCALE-inst.patch


Re: [PATCH] Match: Optimize log (x) CMP CST and exp (x) CMP CST operations

2024-11-11 Thread Soumya AR
Thanks, committed: e232dc3bb5c3e8f8a3749239135b7b859a204fc7

Best,
Soumya

> On 7 Nov 2024, at 3:32 AM, Jeff Law  wrote:
> 
> External email: Use caution opening links or attachments
> 
> 
> On 11/6/24 1:12 AM, Soumya AR wrote:
>> 
>> 
>>> On 29 Oct 2024, at 6:59 PM, Richard Biener  wrote:
>>> 
>>> External email: Use caution opening links or attachments
>>> 
>>> 
>>> On Mon, 28 Oct 2024, Soumya AR wrote:
>>> 
>>>> This patch implements transformations for the following optimizations.
>>>> 
>>>> logN(x) CMP CST -> x CMP expN(CST)
>>>> expN(x) CMP CST -> x CMP logN(CST)
>>>> 
>>>> For example:
>>>> 
>>>> int
>>>> foo (float x)
>>>> {
>>>>   return __builtin_logf (x) < 0.0f;
>>>> }
>>>> 
>>>> can just be:
>>>> 
>>>> int
>>>> foo (float x)
>>>> {
>>>>   return x < 1.0f;
>>>> }
>>>> 
>>>> The patch was bootstrapped and regtested on aarch64-linux-gnu, no 
>>>> regression.
>>>> OK for mainline?
>>> 
>>> +   (for cmp (lt le gt ge eq ne)
>>> +(for logs (LOG LOG2 LOG10)
>>> +exps (EXP EXP2 EXP10)
>>> +/* Simplify logN (x) CMP CST into x CMP expN (CST) */
>>> +(simplify
>>> +(cmp:c (logs:s @0) @1)
>>> + (cmp @0 (exps @1)))
>>> +
>>> +/* Simplify expN (x) CMP CST into x CMP logN (CST) */
>>> +(simplify
>>> +(cmp:c (exps:s @0) @1)
>>> + (cmp @0 (logs @1))
>>> 
>>> this doesn't restrict @1 to be constant.  You should use
>>> 
>>> (cmp:c (exps:s @0) REAL_CST@1)
>> 
>> Fixed.
>> 
>>> I think this transform is also very susceptible to rounding
>>> issues - esp. using it for eq and ne looks very dangerous
>>> to me.  Unless you check a roundtrip through exp/log gets
>>> you back exactly the original constant.
>>> 
>>> I think the compare kinds "most safe" would be le and ge.
>> 
>> This makes sense, I’ve updated the patch to only optimize
>> for le and ge.
>> 
>> Thanks,
>> Soumya
>> 
>>> You can look at fold-const-call.cc:do_mpfr_arg1, mpfr gets
>>> you the information on whether the result is exact for example.
>>> 
>>> Richard.
>>> 
>>> 
>>>> Signed-off-by: Soumya AR 
>>>> 
>>>> gcc/ChangeLog:
>>>> 
>>>>   * match.pd: Fold logN(x) CMP CST -> x CMP expN(CST)
>>>>   and expN(x) CMP CST -> x CMP logN(CST)
>>>> 
>>>> gcc/testsuite/ChangeLog:
>>>> 
>>>>   * gcc.dg/tree-ssa/log_exp.c: New test.
> The latest version looks good to me.
> 
> Jeff




Re: [PATCH v2] aarch64: Optimise calls to ldexp with SVE FSCALE instruction [PR111733]

2024-11-12 Thread Soumya AR


> On 12 Nov 2024, at 4:27 PM, Richard Sandiford  
> wrote:
> 
> External email: Use caution opening links or attachments
> 
> 
> Soumya AR  writes:
>> diff --git a/gcc/config/aarch64/aarch64-sve.md 
>> b/gcc/config/aarch64/aarch64-sve.md
>> index 06bd3e4bb2c..119a0e53853 100644
>> --- a/gcc/config/aarch64/aarch64-sve.md
>> +++ b/gcc/config/aarch64/aarch64-sve.md
>> @@ -5088,6 +5088,21 @@
>> ;; - FTSSEL
>> ;; -
>> 
>> +(define_expand "ldexp3"
>> + [(set (match_operand:GPF_HF 0 "register_operand")
>> +   (unspec:GPF_HF
>> +  [(match_dup 3)
>> +   (const_int SVE_RELAXED_GP)
> 
> Sorry for only noticing now, but: this should be SVE_STRICT_GP instead of
> SVE_RELAXED_GP, since we don't want to allow other lanes to be made
> active later.
> 
>> +   (match_operand:GPF_HF 1 "register_operand")
>> +   (match_operand: 2 "register_operand")]
>> +  UNSPEC_COND_FSCALE))]
>> + "TARGET_SVE"
>> + {
>> +   operands[3] = aarch64_ptrue_reg (mode,
>> + GET_MODE_UNIT_SIZE (mode));
>> + }
>> +)
>> +
>> ;; Unpredicated floating-point binary operations that take an integer as
>> ;; their second operand.
>> (define_insn "@aarch64_sve_"
>> @@ -5103,17 +5118,17 @@
>> ;; Predicated floating-point binary operations that take an integer
>> ;; as their second operand.
>> (define_insn "@aarch64_pred_"
>> -  [(set (match_operand:SVE_FULL_F 0 "register_operand")
>> - (unspec:SVE_FULL_F
>> +  [(set (match_operand:SVE_FULL_F_SCALAR 0 "register_operand")
>> + (unspec:SVE_FULL_F_SCALAR
>>[(match_operand: 1 "register_operand")
>> (match_operand:SI 4 "aarch64_sve_gp_strictness")
>> -(match_operand:SVE_FULL_F 2 "register_operand")
>> +(match_operand:SVE_FULL_F_SCALAR 2 "register_operand")
>> (match_operand: 3 "register_operand")]
>>SVE_COND_FP_BINARY_INT))]
>>   "TARGET_SVE"
>>   {@ [ cons: =0 , 1   , 2 , 3 ; attrs: movprfx ]
>> - [ w, Upl , 0 , w ; *  ] \t%0., 
>> %1/m, %0., %3.
>> - [ ?&w  , Upl , w , w ; yes] movprfx\t%0, 
>> %2\;\t%0., %1/m, %0., %3.
>> + [ w, Upl , 0 , w ; *  ] \t%Z0., 
>> %1/m, %Z0., %Z3.
>> + [ ?&w  , Upl , w , w ; yes] movprfx\t%Z0, 
>> %Z2\;\t%Z0., %1/m, %Z0., %Z3.
>>   }
>> )
>> 
>> diff --git a/gcc/config/aarch64/iterators.md 
>> b/gcc/config/aarch64/iterators.md
>> index 8269b0cdcd9..4153c72954e 100644
>> --- a/gcc/config/aarch64/iterators.md
>> +++ b/gcc/config/aarch64/iterators.md
>> @@ -452,6 +452,9 @@
>> ;; All fully-packed SVE floating-point vector modes.
>> (define_mode_iterator SVE_FULL_F [VNx8HF VNx4SF VNx2DF])
>> 
>> +;; Fully-packed SVE floating-point vector modes and their scalar 
>> equivalents.
>> +(define_mode_iterator SVE_FULL_F_SCALAR [SVE_FULL_F GPF_HF])
>> +
>> ;; Fully-packed SVE integer vector modes that have 8-bit or 16-bit elements.
>> (define_mode_iterator SVE_FULL_BHI [VNx16QI VNx8HI])
>> 
>> @@ -2302,7 +2305,8 @@
>>   (VNx8DI "VNx2BI") (VNx8DF "VNx2BI")
>>   (V8QI "VNx8BI") (V16QI "VNx16BI")
>>   (V4HI "VNx4BI") (V8HI "VNx8BI") (V2SI "VNx2BI")
>> -  (V4SI "VNx4BI") (V2DI "VNx2BI") (V1DI "VNx2BI")])
>> +  (V4SI "VNx4BI") (V2DI "VNx2BI") (V1DI "VNx2BI")
>> +  (HF "VNx8BI") (SF "VNx4BI") (DF "VNx2BI")])
>> 
>> ;; ...and again in lower case.
>> (define_mode_attr vpred [(VNx16QI "vnx16bi") (VNx8QI "vnx8bi")
>> diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
>> index c3d0efc0f2c..09b7844d094 100644
>> --- a/gcc/internal-fn.def
>> +++ b/gcc/internal-fn.def
>> @@ -441,7 +441,7 @@ DEF_INTERNAL_OPTAB_FN (VEC_FMADDSUB, ECF_CONST, 
>> vec_fmaddsub, ternary)
>> DEF_INTERNAL_OPTAB_FN (VEC_FMSUBADD, ECF_CONST, vec_fmsubadd, ternary)
>> 
>> /* FP scales.  */
>> -DEF_INTERNAL_FLT_FN (LDEXP, ECF_CONST, ldexp, binary)
>> +DEF_INTERNAL_FLT_FLOATN_FN (LDEXP, ECF_CONST, ldexp, binary)
>> 
>> /* Ter

Re: [PATCH] Match: Fold pow calls to ldexp when possible [PR57492]

2024-11-13 Thread Soumya AR


> On 12 Nov 2024, at 6:19 PM, Richard Biener  wrote:
> 
> External email: Use caution opening links or attachments
> 
> 
> On Mon, 11 Nov 2024, Soumya AR wrote:
> 
>> Hi Richard,
>> 
>>> On 7 Nov 2024, at 6:10 PM, Richard Biener  wrote:
>>> 
>>> External email: Use caution opening links or attachments
>>> 
>>> 
>>> On Tue, 5 Nov 2024, Soumya AR wrote:
>>> 
>>>> 
>>>> 
>>>>> On 29 Oct 2024, at 7:16 PM, Richard Biener  wrote:
>>>>> 
>>>>> External email: Use caution opening links or attachments
>>>>> 
>>>>> 
>>>>> On Mon, 28 Oct 2024, Soumya AR wrote:
>>>>> 
>>>>>> This patch transforms the following POW calls to equivalent LDEXP calls, 
>>>>>> as
>>>>>> discussed in PR57492:
>>>>>> 
>>>>>> powi (2.0, i) -> ldexp (1.0, i)
>>>>>> 
>>>>>> a * powi (2.0, i) -> ldexp (a, i)
>>>>>> 
>>>>>> 2.0 * powi (2.0, i) -> ldexp (1.0, i + 1)
>>>>>> 
>>>>>> pow (powof2, i) -> ldexp (1.0, i * log2 (powof2))
>>>>>> 
>>>>>> powof2 * pow (2, i) -> ldexp (1.0, i + log2 (powof2))
>>>>> 
>>>>> For the multiplication cases why not handle powof2 * ldexp (1., i)
>>>>> to ldexp (1., i + log2 (powof2)) and a * ldexp (1., i) -> ldexp (a, i)
>>>>> instead?  exp2 * ldexp (1., i) is another candidate.
>>>>> 
>>>>> So please split out the multiplication parts.
>>>>> 
>>>>> + /* Simplify pow (powof2, i) to ldexp (1, i * log2 (powof2)). */
>>>>> 
>>>>> the below pattern handles POWI, not POW.
>>>>> 
>>>>> + (simplify
>>>>> +  (POWI REAL_CST@0 @1)
>>>>> +  (with { HOST_WIDE_INT tmp = 0;
>>>>> + tree integer_arg1 = NULL_TREE; }
>>>>> +  (if (integer_valued_real_p (@0)
>>>>> +   && real_isinteger (&TREE_REAL_CST (@0), &tmp)
>>>>> +   && integer_pow2p (integer_arg1 = build_int_cst (integer_type_node,
>>>>> tmp)))
>>>>> 
>>>>> && tmp > 0
>>>>> && pow2p_hwi (tmp)
>>>>> 
>>>>> +(LDEXP { build_one_cst (type); }
>>>>> +  (mult @1 { build_int_cst (integer_type_node,
>>>>> + tree_log2 (integer_arg1)); })
>>>>> 
>>>>> build_int_cst (integer_type_node, exact_log2 (tmp))
>>>>> 
>>>>> + /* Simplify powi (2.0, i) to ldexp (1, i). */
>>>>> + (simplify
>>>>> +  (POWI REAL_CST@0 @1)
>>>>> +  (if (real_equal (TREE_REAL_CST_PTR (@0), &dconst2))
>>>>> +   (LDEXP { build_one_cst (type); } @1)))
>>>>> +
>>>>> 
>>>>> You'll have a duplicate pattern here, instead merge them.  2.0
>>>>> is power-of-two so I wonder why the pattern is needed.
>>>> 
>>>> Thanks for the feedback!
>>>> 
>>>> I've merged the extra case that was specifically checking for 2.0.
>>>> 
>>>> Like you suggested, I also added two ldexp specific transforms:
>>>> 
>>>>   • powof2 * ldexp (x, i) -> ldexp (x, i + log2 (powof2))
>>>>   • a * ldexp(1., i) -> ldexp (a, i)
>>>> 
>>>> Regarding your suggestion to also fold exp2, a conversion like
>>>> exp2 (x) -> ldexp (1., x) or exp2 (x) * ldexp (y, i) -> ldexp (y, i + x) 
>>>> is a
>>>> bit tricky because we'd have to cast it to an integer before passing it to
>>>> ldexp.
>>>> 
>>>> real_to_integer only works for constants, which isn't helpful here as exp2 
>>>> (CST)
>>>> becomes a power of 2 anyway and matches with the above patterns.
>>>> 
>>>> We'll have to explicitly convert it for non constants and I'm not sure if 
>>>> that
>>>> is worth it for this patch.
>>>> 
>>>> Let me know what you think.
>>> 
>>> + (simplify
>>> +  (mult:c REAL_CST@0 (POWI REAL_CST@1 @2))
>>> +  (with { HOST_WIDE_INT tmp = 0;
>>> + tree integer_arg1 = NULL_TREE; }
>>> +  (if (integer_valued_real_p (@0)
>>> +   && real_equal 

Re: [PATCH] Match: Fold pow calls to ldexp when possible [PR57492]

2024-11-13 Thread Soumya AR


> On 13 Nov 2024, at 2:49 PM, Richard Biener  wrote:
> 
> External email: Use caution opening links or attachments
> 
> 
> On Wed, 13 Nov 2024, Soumya AR wrote:
> 
>> 
>> 
>>> On 12 Nov 2024, at 6:19 PM, Richard Biener  wrote:
>>> 
>>> External email: Use caution opening links or attachments
>>> 
>>> 
>>> On Mon, 11 Nov 2024, Soumya AR wrote:
>>> 
>>>> Hi Richard,
>>>> 
>>>>> On 7 Nov 2024, at 6:10 PM, Richard Biener  wrote:
>>>>> 
>>>>> External email: Use caution opening links or attachments
>>>>> 
>>>>> 
>>>>> On Tue, 5 Nov 2024, Soumya AR wrote:
>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On 29 Oct 2024, at 7:16 PM, Richard Biener  wrote:
>>>>>>> 
>>>>>>> External email: Use caution opening links or attachments
>>>>>>> 
>>>>>>> 
>>>>>>> On Mon, 28 Oct 2024, Soumya AR wrote:
>>>>>>> 
>>>>>>>> This patch transforms the following POW calls to equivalent LDEXP 
>>>>>>>> calls, as
>>>>>>>> discussed in PR57492:
>>>>>>>> 
>>>>>>>> powi (2.0, i) -> ldexp (1.0, i)
>>>>>>>> 
>>>>>>>> a * powi (2.0, i) -> ldexp (a, i)
>>>>>>>> 
>>>>>>>> 2.0 * powi (2.0, i) -> ldexp (1.0, i + 1)
>>>>>>>> 
>>>>>>>> pow (powof2, i) -> ldexp (1.0, i * log2 (powof2))
>>>>>>>> 
>>>>>>>> powof2 * pow (2, i) -> ldexp (1.0, i + log2 (powof2))
>>>>>>> 
>>>>>>> For the multiplication cases why not handle powof2 * ldexp (1., i)
>>>>>>> to ldexp (1., i + log2 (powof2)) and a * ldexp (1., i) -> ldexp (a, i)
>>>>>>> instead?  exp2 * ldexp (1., i) is another candidate.
>>>>>>> 
>>>>>>> So please split out the multiplication parts.
>>>>>>> 
>>>>>>> + /* Simplify pow (powof2, i) to ldexp (1, i * log2 (powof2)). */
>>>>>>> 
>>>>>>> the below pattern handles POWI, not POW.
>>>>>>> 
>>>>>>> + (simplify
>>>>>>> +  (POWI REAL_CST@0 @1)
>>>>>>> +  (with { HOST_WIDE_INT tmp = 0;
>>>>>>> + tree integer_arg1 = NULL_TREE; }
>>>>>>> +  (if (integer_valued_real_p (@0)
>>>>>>> +   && real_isinteger (&TREE_REAL_CST (@0), &tmp)
>>>>>>> +   && integer_pow2p (integer_arg1 = build_int_cst 
>>>>>>> (integer_type_node,
>>>>>>> tmp)))
>>>>>>> 
>>>>>>> && tmp > 0
>>>>>>> && pow2p_hwi (tmp)
>>>>>>> 
>>>>>>> +(LDEXP { build_one_cst (type); }
>>>>>>> +  (mult @1 { build_int_cst (integer_type_node,
>>>>>>> + tree_log2 (integer_arg1)); })
>>>>>>> 
>>>>>>> build_int_cst (integer_type_node, exact_log2 (tmp))
>>>>>>> 
>>>>>>> + /* Simplify powi (2.0, i) to ldexp (1, i). */
>>>>>>> + (simplify
>>>>>>> +  (POWI REAL_CST@0 @1)
>>>>>>> +  (if (real_equal (TREE_REAL_CST_PTR (@0), &dconst2))
>>>>>>> +   (LDEXP { build_one_cst (type); } @1)))
>>>>>>> +
>>>>>>> 
>>>>>>> You'll have a duplicate pattern here, instead merge them.  2.0
>>>>>>> is power-of-two so I wonder why the pattern is needed.
>>>>>> 
>>>>>> Thanks for the feedback!
>>>>>> 
>>>>>> I've merged the extra case that was specifically checking for 2.0.
>>>>>> 
>>>>>> Like you suggested, I also added two ldexp specific transforms:
>>>>>> 
>>>>>>  • powof2 * ldexp (x, i) -> ldexp (x, i + log2 (powof2))
>>>>>>  • a * ldexp(1., i) -> ldexp (a, i)
>>>>>> 
>>>>>> Regarding your suggestion to also fold exp2, a conversion like
>>>>>> exp2 (x) -> ldexp (1., x) or exp2 (x) 

[committed] MAINTAINERS: Added myself to write after approval and DCO.

2024-10-30 Thread Soumya AR
Pushed to trunk: 91577f0c8d955bc670ee76d1a8851df336bf240c

Signed-off-by: Soumya AR 

ChangeLog:

* MAINTAINERS: Add myself to write after approval and DCO.



0001-MAINTAINERS-Add-myself-to-write-after-approval-and-D.patch
Description: 0001-MAINTAINERS-Add-myself-to-write-after-approval-and-D.patch


SVE intrinsics: Fold constant operands for svlsl.

2024-09-24 Thread Soumya AR
This patch implements constant folding for svlsl. Test cases have been added to
check for the following cases:

Zero, merge, and don't care predication.
Shift by 0.
Shift by register width.
Overflow shift on signed and unsigned integers.
Shift on a negative integer.
Maximum possible shift, eg. shift by 7 on an 8-bit integer.

The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression.
OK for mainline?

Signed-off-by: Soumya AR 

gcc/ChangeLog:

* config/aarch64/aarch64-sve-builtins-base.cc (svlsl_impl::fold):
Try constant folding.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/sve/const_fold_lsl_1.c: New test.



0001-SVE-intrinsics-Fold-constant-operands-for-svlsl.patch
Description: 0001-SVE-intrinsics-Fold-constant-operands-for-svlsl.patch


[PATCH] aarch64: Optimise calls to ldexp with SVE FSCALE instruction

2024-09-30 Thread Soumya AR
This patch uses the FSCALE instruction provided by SVE to implement the
standard ldexp family of functions.

Currently, with '-Ofast -mcpu=neoverse-v2', GCC generates libcalls for the
following code:

float
test_ldexpf (float x, int i)
{
return __builtin_ldexpf (x, i);
}

double
test_ldexp (double x, int i)
{
return __builtin_ldexp(x, i);
}

GCC Output:

test_ldexpf:
b ldexpf

test_ldexp:
b ldexp

Since SVE has support for an FSCALE instruction, we can use this to process
scalar floats by moving them to a vector register and performing an fscale call,
similar to how LLVM tackles an ldexp builtin as well.

New Output:

test_ldexpf:
fmov s31, w0
ptrue p7.b, all
fscale z0.s, p7/m, z0.s, z31.s
ret

test_ldexp:
sxtw x0, w0
ptrue p7.b, all
fmov d31, x0
fscale z0.d, p7/m, z0.d, z31.d
ret

The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression.
OK for mainline?

Signed-off-by: Soumya AR 

gcc/ChangeLog:

* config/aarch64/aarch64-sve.md
(ldexp3): Added a new pattern to match ldexp calls with scalar
floating modes and expand to the existing pattern for FSCALE.
(@aarch64_pred_): Extended the pattern to accept SVE
operands as well as scalar floating modes.

* config/aarch64/iterators.md:
SVE_FULL_F_SCALAR: Added an iterator to match all FP SVE modes as well
as SF and DF.
VPRED: Extended the attribute to handle GPF modes.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/sve/fscale.c: New test.



0001-aarch64-Optimise-calls-to-ldexp-with-SVE-FSCALE-inst.patch
Description: 0001-aarch64-Optimise-calls-to-ldexp-with-SVE-FSCALE-inst.patch


[PATCH] aarch64: Expand CTZ to RBIT + CLZ for SVE [PR109498]

2024-09-23 Thread Soumya AR
Currently, we vectorize CTZ for SVE by using the following operation:
.CTZ (X) = (PREC - 1) - .CLZ (X & -X)

Instead, this patch expands CTZ to RBIT + CLZ for SVE, as suggested in PR109498.

The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression.
OK for mainline?

Signed-off-by: Soumya AR 

gcc/ChangeLog:
PR target/109498
* config/aarch64/aarch64-sve.md (ctz2): Added pattern to expand
CTZ to RBIT + CLZ for SVE.

gcc/testsuite/ChangeLog:
PR target/109498
* gcc.target/aarch64/sve/ctz.c: New test.



0001-aarch64-Expand-CTZ-to-RBIT-CLZ-for-SVE-PR109498.patch
Description: 0001-aarch64-Expand-CTZ-to-RBIT-CLZ-for-SVE-PR109498.patch


Re: SVE intrinsics: Fold constant operands for svlsl.

2024-10-03 Thread Soumya AR
Ping.

> On 24 Sep 2024, at 2:00 PM, Soumya AR  wrote:
> 
> This patch implements constant folding for svlsl. Test cases have been added 
> to
> check for the following cases:
> 
> Zero, merge, and don't care predication.
> Shift by 0.
> Shift by register width.
> Overflow shift on signed and unsigned integers.
> Shift on a negative integer.
> Maximum possible shift, eg. shift by 7 on an 8-bit integer.
> 
> The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression.
> OK for mainline?
> 
> Signed-off-by: Soumya AR 
> 
> gcc/ChangeLog:
> 
> * config/aarch64/aarch64-sve-builtins-base.cc (svlsl_impl::fold):
> Try constant folding.
> 
> gcc/testsuite/ChangeLog:
> 
> * gcc.target/aarch64/sve/const_fold_lsl_1.c: New test.
> 
> <0001-SVE-intrinsics-Fold-constant-operands-for-svlsl.patch>



Re: [PATCH] aarch64: Optimise calls to ldexp with SVE FSCALE instruction

2024-10-02 Thread Soumya AR
Hi Saurabh,

> On 30 Sep 2024, at 10:36 PM, Saurabh Jha  wrote:
> 
> External email: Use caution opening links or attachments
> 
> 
> Hi Soumya,
> 
> Thank you for the patch. Two clarifications:
> 
> In the instruction pattern's output string, why did you add the 'Z'
> prefix before operands? (%0 -> %Z0).

The ‘Z’ prefix is added to ensure that the register name is correctly printed.

Normally, %n by default prints the register assigned to operand 'n' using
PRINT_OPERAND target hook, as mentioned here:

https://gcc.gnu.org/onlinedocs/gccint/Output-Template.html

Using %n overrides that target hook for customizing the print name. 

In most cases, overriding is not necessary since the operands are SVE registers
but in this case, since non-SVE registers are used at RTL level, we need to
override the hook to ensure that something like:

fscale v0.s, p7/m, v0.s, v31.s

does not get printed.
 
> Also, maybe you can make your test cases more precise by specifying
> which functions generate which instructions. I don't have and SVE test
> off the top of my head but have a look at
> /gcc/testsuite/gcc.target/aarch64/simd/faminmax-codegen.c
> for example.

Thanks for the suggestion! I'll update the test case accordingly.

Regards,
Soumya

> Regards,
> Saurabh
> 
> 
> 
> On 9/30/2024 5:26 PM, Soumya AR wrote:
>> This patch uses the FSCALE instruction provided by SVE to implement the
>> standard ldexp family of functions.
>> 
>> Currently, with '-Ofast -mcpu=neoverse-v2', GCC generates libcalls for the
>> following code:
>> 
>> float
>> test_ldexpf (float x, int i)
>> {
>>  return __builtin_ldexpf (x, i);
>> }
>> 
>> double
>> test_ldexp (double x, int i)
>> {
>>  return __builtin_ldexp(x, i);
>> }
>> 
>> GCC Output:
>> 
>> test_ldexpf:
>>  b ldexpf
>> 
>> test_ldexp:
>>  b ldexp
>> 
>> Since SVE has support for an FSCALE instruction, we can use this to process
>> scalar floats by moving them to a vector register and performing an fscale 
>> call,
>> similar to how LLVM tackles an ldexp builtin as well.
>> 
>> New Output:
>> 
>> test_ldexpf:
>>  fmov s31, w0
>>  ptrue p7.b, all
>>  fscale z0.s, p7/m, z0.s, z31.s
>>  ret
>> 
>> test_ldexp:
>>  sxtw x0, w0
>>  ptrue p7.b, all
>>  fmov d31, x0
>>  fscale z0.d, p7/m, z0.d, z31.d
>>  ret
>> 
>> The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression.
>> OK for mainline?
>> 
>> Signed-off-by: Soumya AR 
>> 
>> gcc/ChangeLog:
>> 
>> * config/aarch64/aarch64-sve.md
>> (ldexp3): Added a new pattern to match ldexp calls with scalar
>> floating modes and expand to the existing pattern for FSCALE.
>> (@aarch64_pred_): Extended the pattern to accept SVE
>> operands as well as scalar floating modes.
>> 
>> * config/aarch64/iterators.md:
>> SVE_FULL_F_SCALAR: Added an iterator to match all FP SVE modes as well
>> as SF and DF.
>> VPRED: Extended the attribute to handle GPF modes.
>> 
>> gcc/testsuite/ChangeLog:
>> 
>> * gcc.target/aarch64/sve/fscale.c: New test.
>> 
> 



Re: [PATCH] aarch64: Optimise calls to ldexp with SVE FSCALE instruction

2024-10-03 Thread Soumya AR


> On 1 Oct 2024, at 1:18 PM, Tamar Christina  wrote:
>
> External email: Use caution opening links or attachments
>
>
> Hi Soumya,
>
> Nice patch!
>
>> -Original Message-
>> From: Kyrylo Tkachov 
>> Sent: Tuesday, October 1, 2024 7:55 AM
>> To: Soumya AR 
>> Cc: gcc-patches@gcc.gnu.org; Richard Sandiford 
>> Subject: Re: [PATCH] aarch64: Optimise calls to ldexp with SVE FSCALE 
>> instruction
>>
>> Hi Soumya
>>
>>> On 30 Sep 2024, at 18:26, Soumya AR  wrote:
>>>
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> This patch uses the FSCALE instruction provided by SVE to implement the
>>> standard ldexp family of functions.
>>>
>>> Currently, with '-Ofast -mcpu=neoverse-v2', GCC generates libcalls for the
>>> following code:
>>>
>>> float
>>> test_ldexpf (float x, int i)
>>> {
>>>   return __builtin_ldexpf (x, i);
>>> }
>>>
>>> double
>>> test_ldexp (double x, int i)
>>> {
>>>   return __builtin_ldexp(x, i);
>>> }
>>>
>>> GCC Output:
>>>
>>> test_ldexpf:
>>>   b ldexpf
>>>
>>> test_ldexp:
>>>   b ldexp
>>>
>>> Since SVE has support for an FSCALE instruction, we can use this to process
>>> scalar floats by moving them to a vector register and performing an fscale 
>>> call,
>>> similar to how LLVM tackles an ldexp builtin as well.
>>>
>>> New Output:
>>>
>>> test_ldexpf:
>>>   fmov s31, w0
>>>   ptrue p7.b, all
>>>   fscale z0.s, p7/m, z0.s, z31.s
>>>   ret
>>>
>>> test_ldexp:
>>>   sxtw x0, w0
>>>   ptrue p7.b, all
>>>   fmov d31, x0
>>>   fscale z0.d, p7/m, z0.d, z31.d
>>>   ret
>>>
>>> The patch was bootstrapped and regtested on aarch64-linux-gnu, no 
>>> regression.
>>> OK for mainline?
>>>
>>> Signed-off-by: Soumya AR 
>>>
>>> gcc/ChangeLog:
>>>
>>> * config/aarch64/aarch64-sve.md
>>> (ldexp3): Added a new pattern to match ldexp calls with scalar
>>> floating modes and expand to the existing pattern for FSCALE.
>>> (@aarch64_pred_): Extended the pattern to accept SVE
>>> operands as well as scalar floating modes.
>>>
>>> * config/aarch64/iterators.md:
>>> SVE_FULL_F_SCALAR: Added an iterator to match all FP SVE modes as well
>>> as SF and DF.
>>> VPRED: Extended the attribute to handle GPF modes.
>>>
>>> gcc/testsuite/ChangeLog:
>>>
>>> * gcc.target/aarch64/sve/fscale.c: New test.
>>
>> This patch fixes the bugzilla report at
>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111733
>> So it should be referenced in the ChangeLog entries like so:
>>
>>  PR target/111733
>>  * config/aarch64/aarch64-sve.md 
>>
>> That way the commit hooks will pick it up and updated the bug tracker 
>> accordingly
>>
>>>
>>> <0001-aarch64-Optimise-calls-to-ldexp-with-SVE-FSCALE-inst.patch>
>>
>> +(define_expand "ldexp3"
>> +  [(set (match_operand:GPF 0 "register_operand" "=w")
>> + (unspec:GPF
>> +   [(match_operand:GPF 1 "register_operand" "w")
>> +(match_operand: 2 "register_operand" "w")]
>> +   UNSPEC_COND_FSCALE))]
>> +  "TARGET_SVE"
>> +  {
>> +rtx ptrue = aarch64_ptrue_reg (mode);
>> +rtx strictness = gen_int_mode (SVE_RELAXED_GP, SImode);
>> +emit_insn (gen_aarch64_pred_fscale (operands[0], ptrue,
>> operands[1], operands[2], strictness));
>> +DONE;
>> +  }
>>
>> Lines should not exceed 80 columns, this should be wrapped around
>
> And Nit: perhaps slightly more idiomatic to the other patterns in SVE is this:
>
> (define_expand "ldexp3"
>  [(set (match_operand:GPF 0 "register_operand")
>(unspec:GPF
>  [(match_dup 3)
>   (const_int SVE_RELAXED_GP)
>   (match_operand:GPF 1 "register_operand")
>   (match_operand: 2 "register_operand")]
>  UNSPEC_COND_FSCALE))]
>  "TARGET_SVE"
>  {
>operands[3] = aarch64_ptrue_reg (mode);
>  }
> )
>
> It removes the dependency on the exact name of the pattern.
> Also note the dropping of the constraints, expand patterns don't use
> the constraints, only the predicates are checked.
>
> Cheers,
> Tamar

Thanks for this suggestion! This makes a lot of sense. Edited the patch with
this change.

Also referenced the PR as suggested by Kyrill earlier.

Thanks,
Soumya

>>
>> The patch looks good to me otherwise.
>> Thanks,
>> Kyrill




0001-aarch64-Optimise-calls-to-ldexp-with-SVE-FSCALE-inst.patch
Description: 0001-aarch64-Optimise-calls-to-ldexp-with-SVE-FSCALE-inst.patch


Re: [PATCH] Match: Optimize log (x) CMP CST and exp (x) CMP CST operations

2024-11-06 Thread Soumya AR


> On 29 Oct 2024, at 6:59 PM, Richard Biener  wrote:
>
> External email: Use caution opening links or attachments
>
>
> On Mon, 28 Oct 2024, Soumya AR wrote:
>
>> This patch implements transformations for the following optimizations.
>>
>> logN(x) CMP CST -> x CMP expN(CST)
>> expN(x) CMP CST -> x CMP logN(CST)
>>
>> For example:
>>
>> int
>> foo (float x)
>> {
>>  return __builtin_logf (x) < 0.0f;
>> }
>>
>> can just be:
>>
>> int
>> foo (float x)
>> {
>>  return x < 1.0f;
>> }
>>
>> The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression.
>> OK for mainline?
>
> +   (for cmp (lt le gt ge eq ne)
> +(for logs (LOG LOG2 LOG10)
> +exps (EXP EXP2 EXP10)
> +/* Simplify logN (x) CMP CST into x CMP expN (CST) */
> +(simplify
> +(cmp:c (logs:s @0) @1)
> + (cmp @0 (exps @1)))
> +
> +/* Simplify expN (x) CMP CST into x CMP logN (CST) */
> +(simplify
> +(cmp:c (exps:s @0) @1)
> + (cmp @0 (logs @1))
>
> this doesn't restrict @1 to be constant.  You should use
>
> (cmp:c (exps:s @0) REAL_CST@1)

Fixed.

> I think this transform is also very susceptible to rounding
> issues - esp. using it for eq and ne looks very dangerous
> to me.  Unless you check a roundtrip through exp/log gets
> you back exactly the original constant.
>
> I think the compare kinds "most safe" would be le and ge.

This makes sense, I’ve updated the patch to only optimize
for le and ge.

Thanks,
Soumya

> You can look at fold-const-call.cc:do_mpfr_arg1, mpfr gets
> you the information on whether the result is exact for example.
>
> Richard.
>
>
>> Signed-off-by: Soumya AR 
>>
>> gcc/ChangeLog:
>>
>>  * match.pd: Fold logN(x) CMP CST -> x CMP expN(CST)
>>  and expN(x) CMP CST -> x CMP logN(CST)
>>
>> gcc/testsuite/ChangeLog:
>>
>>  * gcc.dg/tree-ssa/log_exp.c: New test.
>>
>>
>
> --
> Richard Biener 
> SUSE Software Solutions Germany GmbH,
> Frankenstrasse 146, 90461 Nuernberg, Germany;
> GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)






v2-0001-Match-Optimize-log-x-CMP-CST-and-exp-x-CMP-CST.patch
Description: v2-0001-Match-Optimize-log-x-CMP-CST-and-exp-x-CMP-CST.patch


[PATCH] aarch64: Use SVE ASRD instruction with Neon modes.

2024-11-24 Thread Soumya AR
The ASRD instruction on SVE performs an arithmetic shift right by an immediate
for divide.

This patch enables the use of ASRD with Neon modes.

For example:

int in[N], out[N];

void
foo (void)
{
  for (int i = 0; i < N; i++)
out[i] = in[i] / 4;
}

compiles to:

ldr q31, [x1, x0]
cmltv30.16b, v31.16b, #0
and z30.b, z30.b, 3
add v30.16b, v30.16b, v31.16b
sshrv30.16b, v30.16b, 2
str q30, [x0, x2]
add x0, x0, 16
cmp x0, 1024

but can just be:

ldp q30, q31, [x0], 32
asrdz31.b, p7/m, z31.b, #2
asrdz30.b, p7/m, z30.b, #2
stp q30, q31, [x1], 32
cmp x0, x2

The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression.
OK for mainline?

Signed-off-by: Soumya AR 

gcc/ChangeLog:

* config/aarch64/aarch64-sve.md: Extended sdiv_pow23 and
*sdiv_pow23 to support Neon modes.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/sve/sve-asrd.c: New test.
---
 gcc/config/aarch64/aarch64-sve.md | 25 -
 .../gcc.target/aarch64/sve/sve-asrd.c | 54 +++
 2 files changed, 67 insertions(+), 12 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/sve-asrd.c

diff --git a/gcc/config/aarch64/aarch64-sve.md 
b/gcc/config/aarch64/aarch64-sve.md
index affdb24a93d..96effe4abed 100644
--- a/gcc/config/aarch64/aarch64-sve.md
+++ b/gcc/config/aarch64/aarch64-sve.md
@@ -4972,34 +4972,35 @@
 
 ;; Unpredicated ASRD.
 (define_expand "sdiv_pow23"
-  [(set (match_operand:SVE_I 0 "register_operand")
-   (unspec:SVE_I
+  [(set (match_operand:SVE_VDQ_I 0 "register_operand")
+   (unspec:SVE_VDQ_I
  [(match_dup 3)
-  (unspec:SVE_I
-[(match_operand:SVE_I 1 "register_operand")
+  (unspec:SVE_VDQ_I
+[(match_operand:SVE_VDQ_I 1 "register_operand")
  (match_operand 2 "aarch64_simd_rshift_imm")]
 UNSPEC_ASRD)]
 UNSPEC_PRED_X))]
   "TARGET_SVE"
   {
-operands[3] = aarch64_ptrue_reg (mode);
+operands[3] = aarch64_ptrue_reg (mode,
+   GET_MODE_UNIT_SIZE (mode));
   }
 )
 
 ;; Predicated ASRD.
 (define_insn "*sdiv_pow23"
-  [(set (match_operand:SVE_I 0 "register_operand")
-   (unspec:SVE_I
+  [(set (match_operand:SVE_VDQ_I 0 "register_operand")
+   (unspec:SVE_VDQ_I
  [(match_operand: 1 "register_operand")
-  (unspec:SVE_I
-[(match_operand:SVE_I 2 "register_operand")
- (match_operand:SVE_I 3 "aarch64_simd_rshift_imm")]
+  (unspec:SVE_VDQ_I
+[(match_operand:SVE_VDQ_I 2 "register_operand")
+ (match_operand:SVE_VDQ_I 3 "aarch64_simd_rshift_imm")]
 UNSPEC_ASRD)]
  UNSPEC_PRED_X))]
   "TARGET_SVE"
   {@ [ cons: =0 , 1   , 2 ; attrs: movprfx ]
- [ w, Upl , 0 ; *  ] asrd\t%0., %1/m, 
%0., #%3
- [ ?&w  , Upl , w ; yes] movprfx\t%0, 
%2\;asrd\t%0., %1/m, %0., #%3
+ [ w, Upl , 0 ; *  ] asrd\t%Z0., %1/m, 
%Z0., #%3
+ [ ?&w  , Upl , w ; yes] movprfx\t%Z0, 
%Z2\;asrd\t%Z0., %1/m, %Z0., #%3
   }
 )
 
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/sve-asrd.c 
b/gcc/testsuite/gcc.target/aarch64/sve/sve-asrd.c
new file mode 100644
index 000..00aa8b2380d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/sve-asrd.c
@@ -0,0 +1,54 @@
+/* { dg-do compile } */
+/* { dg-options "-Ofast --param aarch64-autovec-preference=asimd-only" } */
+/* { dg-final { check-function-bodies "**" "" "" } } */
+
+#include 
+#define N 1024
+
+#define FUNC(M) \
+M in_##M[N];\
+M out_##M[N];   \
+void asrd_##M() {   \
+  for (int i = 0; i < N; i++)   \
+out_##M[i] = in_##M[i] / 4; \
+}
+
+/*
+** asrd_int8_t:
+** ...
+** ptrue   (p[0-7]).b, vl1
+** ...
+** asrdz[0-9]+\.b, \1/m, z[0-9]+\.b, #2
+** ...
+*/
+FUNC(int8_t)
+
+/*
+** asrd_int16_t:
+** ...
+** ptrue   (p[0-7]).b, vl2
+** ...
+** asrdz[0-9]+\.h, \1/m, z[0-9]+\.h, #2
+** ...
+*/
+FUNC(int16_t)
+
+/*
+** asrd_int32_t:
+** ...
+** ptrue   (p[0-7]).b, vl4
+** ...
+** asrdz[0-9]+\.s, \1/m, z[0-9]+\.s, #2
+** ...
+*/
+FUNC(int32_t)
+
+/*
+** asrd_int64_t:
+** ...
+** ptrue   (p[0-7]).b, vl8
+** ...
+** asrdz[0-9]+\.d, \1/m, z[0-9]+\.d, #2
+** ...
+*/
+FUNC(int64_t)
-- 
2.43.2



[PATCH] testsuite: Require C99 for pow-to-ldexp.c

2024-11-20 Thread Soumya AR
pow-to-ldexp.c checks for calls to __builtin_ldexpf and __builtin_ldexpl, which
will only be performed when the compiler knows the target has a C99 libm
available.

Modified the test to add a C99 runtime requirement.

This fixes the failure on arm-eabi targets for this test case.

Committed as obvious: 90645dba41bac29cab4c5996ba320c97a0325eb2

Signed-off-by: Soumya AR 

gcc/testsuite/ChangeLog:

* gcc.dg/tree-ssa/pow-to-ldexp.c: Require c99_runtime.
---
gcc/testsuite/gcc.dg/tree-ssa/pow-to-ldexp.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pow-to-ldexp.c 
b/gcc/testsuite/gcc.dg/tree-ssa/pow-to-ldexp.c
index 007949dbb53..5be3dc89955 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/pow-to-ldexp.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pow-to-ldexp.c
@@ -1,5 +1,6 @@
/* { dg-do compile } */
/* { dg-additional-options "-Ofast -fdump-tree-optimized" } */
+/* { dg-require-effective-target c99_runtime } */
/* { dg-final { scan-tree-dump-times "__builtin_ldexp\ " 7 "optimized" } } */
/* { dg-final { scan-tree-dump-times "__builtin_ldexpf\ " 7 "optimized" } } */
-- 
2.43.2




[PATCH] aarch64: Use SVE SUBR instruction with Neon modes

2024-11-14 Thread Soumya AR
The SVE SUBR instruction performs a reversed subtract from an immediate.

This patches enables the emission of SUBR for Neon modes and avoids the need to
materialise an explicit constant.

For example, the below test case:

typedef long long __attribute__ ((vector_size (16))) v2di;

v2di subr_v2di (v2di x)
{
return 15 - x;
}

compiles to:

subr_v2di:
mov z31.d, #15
sub v0.2d, v31.2d, v0.2d
ret

but can just be:

subr_v2di:
subrz0.d, z0.d, #15
ret

The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression.
OK for mainline?

Signed-off-by: Soumya AR 

gcc/ChangeLog:

* config/aarch64/aarch64-simd.md:
(sub3): Extended the pattern to emit SUBR for SVE
targets if operand 1 is an immediate.
* config/aarch64/predicates.md (aarch64_sve_arith_imm_or_reg_operand):
New predicate that accepts aarch64_sve_arith_immediate in operand 1 but
only for TARGET_SVE.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/sve/subr-sve.c: New test.



0001-aarch64-Use-SVE-SUBR-instruction-with-Neon-modes.patch
Description: 0001-aarch64-Use-SVE-SUBR-instruction-with-Neon-modes.patch


[PATCH] aarch64: Extend SVE2 bit-select instructions for Neon modes.

2024-11-25 Thread Soumya AR
NBSL, BSL1N, and BSL2N are bit-select intructions on SVE2 with certain operands
inverted. These can be extended to work with Neon modes.

Since these instructions are unpredicated, duplicate patterns were added with
the predicate removed to generate these instructions for Neon modes.

The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression.
OK for mainline?

Signed-off-by: Soumya AR 

gcc/ChangeLog:

* config/aarch64/aarch64-sve2.md
(*aarch64_sve2_nbsl_unpred): New pattern to match unpredicated
form.
(*aarch64_sve2_bsl1n_unpred): Likewise.
(*aarch64_sve2_bsl2n_unpred): Likewise.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/sve/bitsel.c: New test.


0001-aarch64-Extend-SVE2-bit-select-instructions-for-Neon.patch
Description: 0001-aarch64-Extend-SVE2-bit-select-instructions-for-Neon.patch


Re: [PATCH] aarch64: Extend SVE2 bit-select instructions for Neon modes.

2024-12-10 Thread Soumya AR


> On 10 Dec 2024, at 7:09 PM, Richard Sandiford  
> wrote:
> 
> External email: Use caution opening links or attachments
> 
> 
> Soumya AR  writes:
>> @@ -1815,6 +1849,42 @@
>>   }
>> )
>> 
>> +(define_insn "*aarch64_sve2_bsl2n_unpred"
>> +  [(set (match_operand:VDQ_I 0 "register_operand")
>> + (ior:VDQ_I
>> +   (and:VDQ_I
>> + (match_operand:VDQ_I 1 "register_operand")
>> + (match_operand:VDQ_I 2 "register_operand"))
>> +   (and:VDQ_I
>> + (not:VDQ_I
>> +   (match_operand:VDQ_I 3 "register_operand"))
>> + (not:VDQ_I
>> +   (match_dup BSL_DUP)]
> 
> The second "and" should be indented by the same amount as the first "and":
> 
> (define_insn "*aarch64_sve2_bsl2n_unpred"
>  [(set (match_operand:VDQ_I 0 "register_operand")
>(ior:VDQ_I
>  (and:VDQ_I
>(match_operand:VDQ_I 1 "register_operand")
>(match_operand:VDQ_I 2 "register_operand"))
>  (and:VDQ_I
>(not:VDQ_I (match_operand:VDQ_I 3 "register_operand"))
>(not:VDQ_I (match_dup BSL_DUP)]
> 
>> +  "TARGET_SVE2"
>> +  {@ [ cons: =0 , 1 , 2 , 3 ; attrs: movprfx ]
>> + [ w,  ,  , w ; *  ] 
>> bsl2n\t%Z0.d, %Z0.d, %Z3.d, %Z.d
>> + [ ?&w  , w , w , w ; yes] 
>> movprfx\t%Z0, %Z\;bsl2n\t%Z0.d, %Z0.d, %Z3.d, %Z.d
>> +  }
>> +)
>> +
>> +(define_insn "*aarch64_sve2_bsl2n_unpred"
>> +  [(set (match_operand:VDQ_I 0 "register_operand")
>> + (ior:VDQ_I
>> +   (and:VDQ_I
>> + (match_operand:VDQ_I 1 "register_operand")
>> + (match_operand:VDQ_I 2 "register_operand"))
>> +   (and:VDQ_I
>> + (not:VDQ_I
>> +   (match_dup BSL_DUP))
>> + (not:VDQ_I
>> +   (match_operand:VDQ_I 3 "register_operand")]
> 
> Similarly here.
> 
> OK with those changes, thanks.
> 
Ah my bad, fixed those.
Committed: 65b7c8db9c61bcdfd07a3404047dd2d2beac4bbb

Thank you.

Best,
Soumya
> Richard
> 
>> +  "TARGET_SVE2"
>> +  {@ [ cons: =0 , 1 , 2 , 3 ; attrs: movprfx ]
>> + [ w,  ,  , w ; *  ] 
>> bsl2n\t%Z0.d, %Z0.d, %Z3.d, %Z.d
>> + [ ?&w  , w , w , w ; yes] 
>> movprfx\t%Z0, %Z\;bsl2n\t%Z0.d, %Z0.d, %Z3.d, %Z.d
>> +  }
>> +)
>> +
>> ;; -
>> ;;  [INT] Shift-and-accumulate operations
>> ;; -
>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/bitsel.c 
>> b/gcc/testsuite/gcc.target/aarch64/sve/bitsel.c
>> new file mode 100644
>> index 000..635bfefc17c
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/bitsel.c
>> @@ -0,0 +1,35 @@
>> +/* { dg-options "-O2 -mcpu=neoverse-v2 --param 
>> aarch64-autovec-preference=asimd-only" } */
>> +
>> +#include 
>> +
>> +#define OPNBSL(x,y,z) (~(((x) & (z)) | ((y) & ~(z
>> +#define OPBSL1N(x,y,z) ((~(x) & (z)) | ((y) & ~(z)))
>> +#define OPBSL2N(x,y,z) (((x) & (z)) | (~(y) & ~(z)))
>> +
>> +#define N 1024
>> +
>> +#define TYPE(N) int##N##_t
>> +
>> +#define TEST(SIZE, OP, SUFFIX)  \
>> +void __attribute__ ((noinline, noclone))\
>> +f_##SIZE##_##SUFFIX \
>> +  (TYPE(SIZE) *restrict a, TYPE(SIZE) *restrict b,  \
>> +   TYPE(SIZE) *restrict c, TYPE(SIZE) *restrict d)  \
>> +{   \
>> +  for (int i = 0; i < N; i++)   \
>> +a[i] = OP (b[i], c[i], d[i]);   \
>> +}
>> +
>> +#define TEST_ALL(SIZE)  \
>> +  TEST(SIZE, OPNBSL, nbsl)  \
>> +  TEST(SIZE, OPBSL1N, bsl1n)\
>> +  TEST(SIZE, OPBSL2N, bsl2n)
>> +
>> +TEST_ALL(8);
>> +TEST_ALL(16);
>> +TEST_ALL(32);
>> +TEST_ALL(64);
>> +
>> +/* { dg-final { scan-assembler-times {\tnbsl\tz[0-9]+\.d, z[0-9]+\.d, 
>> z[0-9]+\.d, z[0-9]+\.d\n} 4 } } */
>> +/* { dg-final { scan-assembler-times {\tbsl1n\tz[0-9]+\.d, z[0-9]+\.d, 
>> z[0-9]+\.d, z[0-9]+\.d\n} 4 } } */
>> +/* { dg-final { scan-assembler-times {\tbsl2n\tz[0-9]+\.d, z[0-9]+\.d, 
>> z[0-9]+\.d, z[0-9]+\.d\n} 4 } } */
>> \ No newline at end of file




Re: [PATCH] aarch64: Use SVE ASRD instruction with Neon modes.

2024-12-10 Thread Soumya AR


> On 10 Dec 2024, at 7:03 PM, Richard Sandiford  
> wrote:
> 
> External email: Use caution opening links or attachments
> 
> 
> Soumya AR  writes:
>> Hi Richard,
>> 
>> Thanks for reviewing this!
>> 
>> I’ve made the suggested changes and added the aarch64_ptrue_reg overload.
> 
> The updated patch is OK, thanks.

Thanks, committed: e5569a20cf3791553ac324269001a7c7c0e56242
> 
>> Thank you for providing the implementation for this, I have added you
>> as a co-author for the patch, hope that works :)
> 
> Sure, that's ok with me FWIW.  IMO it isn't really necessary though,
> so it's ok without too :)
> 
> Sometimes it just feels easier to write review comments in the form
> of code snippets, since that's more direct & precise than describing
> something in English.  It is still fundamentally a code review though.
> 
Yes I understand, but would prefer having it there rather than not :)

Best,
Soumya
> Richard
> 
>> 
>> Best,
>> Soumya
>> 
>> 
>>> On 5 Dec 2024, at 10:08 PM, Richard Sandiford  
>>> wrote:
>>> 
>>> External email: Use caution opening links or attachments
>>> 
>>> 
>>> Soumya AR  writes:
>>>> The ASRD instruction on SVE performs an arithmetic shift right by an 
>>>> immediate
>>>> for divide.
>>>> 
>>>> This patch enables the use of ASRD with Neon modes.
>>>> 
>>>> For example:
>>>> 
>>>> int in[N], out[N];
>>>> 
>>>> void
>>>> foo (void)
>>>> {
>>>> for (int i = 0; i < N; i++)
>>>>   out[i] = in[i] / 4;
>>>> }
>>>> 
>>>> compiles to:
>>>> 
>>>> ldr q31, [x1, x0]
>>>> cmltv30.16b, v31.16b, #0
>>>> and z30.b, z30.b, 3
>>>> add v30.16b, v30.16b, v31.16b
>>>> sshrv30.16b, v30.16b, 2
>>>> str q30, [x0, x2]
>>>> add x0, x0, 16
>>>> cmp x0, 1024
>>>> 
>>>> but can just be:
>>>> 
>>>> ldp q30, q31, [x0], 32
>>>> asrdz31.b, p7/m, z31.b, #2
>>>> asrdz30.b, p7/m, z30.b, #2
>>>> stp q30, q31, [x1], 32
>>>> cmp x0, x2
>>>> 
>>>> The patch was bootstrapped and regtested on aarch64-linux-gnu, no 
>>>> regression.
>>>> OK for mainline?
>>>> 
>>>> Signed-off-by: Soumya AR 
>>>> 
>>>> gcc/ChangeLog:
>>>> 
>>>> * config/aarch64/aarch64-sve.md: Extended sdiv_pow23 and
>>>> *sdiv_pow23 to support Neon modes.
>>>> 
>>>> gcc/testsuite/ChangeLog:
>>>> 
>>>> * gcc.target/aarch64/sve/sve-asrd.c: New test.
>>>> ---
>>>> gcc/config/aarch64/aarch64-sve.md | 25 -
>>>> .../gcc.target/aarch64/sve/sve-asrd.c | 54 +++
>>>> 2 files changed, 67 insertions(+), 12 deletions(-)
>>>> create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/sve-asrd.c
>>> 
>>> Rearranging to put the testcase first:
>>> 
>>>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/sve-asrd.c 
>>>> b/gcc/testsuite/gcc.target/aarch64/sve/sve-asrd.c
>>>> new file mode 100644
>>>> index 000..00aa8b2380d
>>>> --- /dev/null
>>>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/sve-asrd.c
>>>> @@ -0,0 +1,54 @@
>>>> +/* { dg-do compile } */
>>>> +/* { dg-options "-Ofast --param aarch64-autovec-preference=asimd-only" } 
>>>> */
>>>> +/* { dg-final { check-function-bodies "**" "" "" } } */
>>>> +
>>>> +#include 
>>>> +#define N 1024
>>>> +
>>>> +#define FUNC(M) \
>>>> +M in_##M[N];\
>>>> +M out_##M[N];   \
>>>> +void asrd_##M() {   \
>>>> +  for (int i = 0; i < N; i++)   \
>>>> +out_##M[i] = in_##M[i] / 4; \
>>>> +}
>>>> +
>>>> +/*
>>>> +** asrd_int8_t:
>>>> +**   ...
>>>> +**   ptrue   (p[0-7]).b, vl1
>>>> +**   ...
>>>> +**   asrdz[0-9]+\.b, \1/m, z[0-9]+\.b, #2
>>>> +**   ...
>>>> +*/

Re: [PATCH] aarch64: Use SVE ASRD instruction with Neon modes.

2024-12-10 Thread Soumya AR
Hi Richard,

Thanks for reviewing this!

I’ve made the suggested changes and added the aarch64_ptrue_reg overload.
Thank you for providing the implementation for this, I have added you 
as a co-author for the patch, hope that works :)

Best,
Soumya


> On 5 Dec 2024, at 10:08 PM, Richard Sandiford  
> wrote:
> 
> External email: Use caution opening links or attachments
> 
> 
> Soumya AR  writes:
>> The ASRD instruction on SVE performs an arithmetic shift right by an 
>> immediate
>> for divide.
>> 
>> This patch enables the use of ASRD with Neon modes.
>> 
>> For example:
>> 
>> int in[N], out[N];
>> 
>> void
>> foo (void)
>> {
>>  for (int i = 0; i < N; i++)
>>out[i] = in[i] / 4;
>> }
>> 
>> compiles to:
>> 
>>  ldr q31, [x1, x0]
>>  cmltv30.16b, v31.16b, #0
>>  and z30.b, z30.b, 3
>>  add v30.16b, v30.16b, v31.16b
>>  sshrv30.16b, v30.16b, 2
>>  str q30, [x0, x2]
>>  add x0, x0, 16
>>  cmp x0, 1024
>> 
>> but can just be:
>> 
>>  ldp q30, q31, [x0], 32
>>  asrdz31.b, p7/m, z31.b, #2
>>  asrdz30.b, p7/m, z30.b, #2
>>  stp q30, q31, [x1], 32
>>  cmp x0, x2
>> 
>> The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression.
>> OK for mainline?
>> 
>> Signed-off-by: Soumya AR 
>> 
>> gcc/ChangeLog:
>> 
>>  * config/aarch64/aarch64-sve.md: Extended sdiv_pow23 and
>>  *sdiv_pow23 to support Neon modes.
>> 
>> gcc/testsuite/ChangeLog:
>> 
>>  * gcc.target/aarch64/sve/sve-asrd.c: New test.
>> ---
>> gcc/config/aarch64/aarch64-sve.md | 25 -
>> .../gcc.target/aarch64/sve/sve-asrd.c | 54 +++
>> 2 files changed, 67 insertions(+), 12 deletions(-)
>> create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/sve-asrd.c
> 
> Rearranging to put the testcase first:
> 
>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/sve-asrd.c 
>> b/gcc/testsuite/gcc.target/aarch64/sve/sve-asrd.c
>> new file mode 100644
>> index 000..00aa8b2380d
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/sve-asrd.c
>> @@ -0,0 +1,54 @@
>> +/* { dg-do compile } */
>> +/* { dg-options "-Ofast --param aarch64-autovec-preference=asimd-only" } */
>> +/* { dg-final { check-function-bodies "**" "" "" } } */
>> +
>> +#include 
>> +#define N 1024
>> +
>> +#define FUNC(M) \
>> +M in_##M[N];\
>> +M out_##M[N];   \
>> +void asrd_##M() {   \
>> +  for (int i = 0; i < N; i++)   \
>> +out_##M[i] = in_##M[i] / 4; \
>> +}
>> +
>> +/*
>> +** asrd_int8_t:
>> +**   ...
>> +**   ptrue   (p[0-7]).b, vl1
>> +**   ...
>> +**   asrdz[0-9]+\.b, \1/m, z[0-9]+\.b, #2
>> +**   ...
>> +*/
>> +FUNC(int8_t)
> 
> ptrue ..., vl1 will only do the shift/division on the first element.
> It should instead be vl8 for 64-bit modes and vl16 for 128-bit modes.
> 
>> +
>> +/*
>> +** asrd_int16_t:
>> +**   ...
>> +**   ptrue   (p[0-7]).b, vl2
>> +**   ...
>> +**   asrdz[0-9]+\.h, \1/m, z[0-9]+\.h, #2
>> +**   ...
>> +*/
>> +FUNC(int16_t)
>> +
>> +/*
>> +** asrd_int32_t:
>> +**   ...
>> +**   ptrue   (p[0-7]).b, vl4
>> +**   ...
>> +**   asrdz[0-9]+\.s, \1/m, z[0-9]+\.s, #2
>> +**   ...
>> +*/
>> +FUNC(int32_t)
>> +
>> +/*
>> +** asrd_int64_t:
>> +**   ...
>> +**   ptrue   (p[0-7]).b, vl8
>> +**   ...
>> +**   asrdz[0-9]+\.d, \1/m, z[0-9]+\.d, #2
>> +**   ...
>> +*/
>> +FUNC(int64_t)
>> diff --git a/gcc/config/aarch64/aarch64-sve.md 
>> b/gcc/config/aarch64/aarch64-sve.md
>> index affdb24a93d..96effe4abed 100644
>> --- a/gcc/config/aarch64/aarch64-sve.md
>> +++ b/gcc/config/aarch64/aarch64-sve.md
>> @@ -4972,34 +4972,35 @@
>> 
>> ;; Unpredicated ASRD.
>> (define_expand "sdiv_pow23"
>> -  [(set (match_operand:SVE_I 0 "register_operand")
>> - (unspec:SVE_I
>> +  [(set (match_operand:SVE_VDQ_I 0 "register_operand")
>> + (unspec:SVE_VDQ_I
>>[(match_dup 3)
>> -(unspec:SVE_I
>> -  [(match_operand:SVE_I 1 "register_operand")
>> +(unspec:SVE_VDQ_

Re: [PATCH] aarch64: Extend SVE2 bit-select instructions for Neon modes.

2024-12-10 Thread Soumya AR

> On 5 Dec 2024, at 10:25 PM, Richard Sandiford  
> wrote:
> 
> External email: Use caution opening links or attachments
> 
> 
> Soumya AR  writes:
>> NBSL, BSL1N, and BSL2N are bit-select intructions on SVE2 with certain 
>> operands
>> inverted. These can be extended to work with Neon modes.
>> 
>> Since these instructions are unpredicated, duplicate patterns were added with
>> the predicate removed to generate these instructions for Neon modes.
>> 
>> The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression.
>> OK for mainline?
>> 
>> Signed-off-by: Soumya AR 
>> 
>> gcc/ChangeLog:
>> 
>>  * config/aarch64/aarch64-sve2.md
>>  (*aarch64_sve2_nbsl_unpred): New pattern to match unpredicated
>>  form.
>>  (*aarch64_sve2_bsl1n_unpred): Likewise.
>>  (*aarch64_sve2_bsl2n_unpred): Likewise.
>> 
>> gcc/testsuite/ChangeLog:
>> 
>>  * gcc.target/aarch64/sve/bitsel.c: New test.
>> 
>> From 4be7a0ba40150322e2f67214f54c55da51ad2fd2 Mon Sep 17 00:00:00 2001
>> From: Soumya AR 
>> Date: Mon, 25 Nov 2024 11:25:44 +0530
>> Subject: [PATCH] aarch64: Extend SVE2 bit-select instructions for Neon modes.
>> 
>> NBSL, BSL1N, and BSL2N are bit-select intructions on SVE2 with certain 
>> operands
>> inverted. These can be extended to work with Neon modes.
>> 
>> Since these instructions are unpredicated, duplicate patterns were added with
>> the predicate removed to generate these instructions for Neon modes.
>> 
>> The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression.
>> OK for mainline?
>> 
>> Signed-off-by: Soumya AR 
>> 
>> gcc/ChangeLog:
>> 
>>  * config/aarch64/aarch64-sve2.md
>>  (*aarch64_sve2_nbsl_unpred): New pattern to match unpredicated
>>  form.
>>  (*aarch64_sve2_bsl1n_unpred): Likewise.
>>  (*aarch64_sve2_bsl2n_unpred): Likewise.
>> 
>> gcc/testsuite/ChangeLog:
>> 
>>  * gcc.target/aarch64/sve/bitsel.c: New test.
> 
> Looks good, but some comments about the formatting:
> 
>> ---
>> gcc/config/aarch64/aarch64-sve2.md| 71 +++
>> gcc/testsuite/gcc.target/aarch64/sve/bitsel.c | 35 +
>> 2 files changed, 106 insertions(+)
>> create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/bitsel.c
>> 
>> diff --git a/gcc/config/aarch64/aarch64-sve2.md 
>> b/gcc/config/aarch64/aarch64-sve2.md
>> index e67421bad84..6d38c40e573 100644
>> --- a/gcc/config/aarch64/aarch64-sve2.md
>> +++ b/gcc/config/aarch64/aarch64-sve2.md
>> @@ -1696,6 +1696,23 @@
>>   }
>> )
>> 
>> +(define_insn "*aarch64_sve2_nbsl_unpred"
>> +  [(set (match_operand:VDQ_I 0 "register_operand")
>> +(not:VDQ_I
>> +  (xor:VDQ_I
>> +(and:VDQ_I
>> +  (xor:VDQ_I
>> +(match_operand:VDQ_I 1 "register_operand")
>> +(match_operand:VDQ_I 2 "register_operand"))
>> +  (match_operand:VDQ_I 3 "register_operand"))
>> +(match_dup BSL_DUP]
>> +  "TARGET_SVE2"
>> +  {@ [ cons: =0 , 1 , 2 , 3 ; attrs: movprfx ]
>> + [ w,  ,  , w ; *  ] nbsl\t%Z0.d, 
>> %Z0.d, %Z.d, %Z3.d
>> + [ ?&w  , w , w , w ; yes] 
>> movprfx\t%Z0, %Z\;nbsl\t%Z0.d, %Z0.d, %Z.d, %Z3.d
>> +  }
>> +)
>> +
>> ;; Unpredicated bitwise select with inverted first operand.
>> ;; (op3 ? ~bsl_mov : bsl_dup) == ((~(bsl_mov ^ bsl_dup) & op3) ^ bsl_dup)
>> (define_expand "@aarch64_sve2_bsl1n"
>> @@ -1741,6 +1758,23 @@
>>   }
>> )
>> 
>> +(define_insn "*aarch64_sve2_bsl1n_unpred"
>> +  [(set (match_operand:VDQ_I 0 "register_operand")
>> + (xor:VDQ_I
>> +   (and:VDQ_I
>> +   (not:VDQ_I
>> +  (xor:VDQ_I
>> +(match_operand:VDQ_I 1 "register_operand")
>> +(match_operand:VDQ_I 2 "register_operand")))
> 
> Removing the unspecs from the SVE patterns leaves these operations
> indented by an unusual amount.  Generally:
> 
> - if the first operand is on the same line as the operator (as with
>  the set above), the other operands are indented to be underneath
>  the first operand.
> 
> - if an operator appears on a line by itself, the operands are generally
>  indented by two spaces more than the operator.
&

[PATCH] match.pd: Fix indefinite recursion during exp-log transformations [PR118490]

2025-01-19 Thread Soumya AR
This patch fixes the ICE caused when comparing log or exp of a constant with
another constant.

The transform is now restricted to cases where the resultant
log/exp (CST) can be constant folded.

Signed-off-by: Soumya AR 

gcc/ChangeLog:

PR target/118490
* match.pd: Added ! to verify that log/exp (CST) can be constant folded.

gcc/testsuite/ChangeLog:

PR target/118490
* gcc.dg/pr118490.c: New test.
---
gcc/match.pd | 4 ++--
gcc/testsuite/gcc.dg/pr | 0
gcc/testsuite/gcc.dg/pr118490.c | 7 +++
3 files changed, 9 insertions(+), 2 deletions(-)
create mode 100644 gcc/testsuite/gcc.dg/pr
create mode 100644 gcc/testsuite/gcc.dg/pr118490.c

diff --git a/gcc/match.pd b/gcc/match.pd
index b6cbb851897..fd1ddf627bf 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -8317,12 +8317,12 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
/* Simplify logN (x) CMP CST into x CMP expN (CST) */
(simplify
(cmp:c (logs:s @0) REAL_CST@1)
- (cmp @0 (exps @1)))
+ (cmp @0 (exps! @1)))

/* Simplify expN (x) CMP CST into x CMP logN (CST) */
(simplify
(cmp:c (exps:s @0) REAL_CST@1)
- (cmp @0 (logs @1))
+ (cmp @0 (logs! @1))

(for logs (LOG LOG2 LOG10 LOG10)
exps (EXP EXP2 EXP10 POW10)
diff --git a/gcc/testsuite/gcc.dg/pr b/gcc/testsuite/gcc.dg/pr
new file mode 100644
index 000..e69de29bb2d
diff --git a/gcc/testsuite/gcc.dg/pr118490.c b/gcc/testsuite/gcc.dg/pr118490.c
new file mode 100644
index 000..4ae0dacefee
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr118490.c
@@ -0,0 +1,7 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ffast-math -frounding-math -Wlogical-op" } */
+
+double exp(double);
+int foo(int v) {
+ return v && exp(1.) < 2.;
+}
-- 
2.43.2


Re: [PATCH] aarch64: Use generic_armv8_a_prefetch_tune in generic_armv8_a.h

2025-02-18 Thread Soumya AR


> On 18 Feb 2025, at 2:27 PM, Kyrylo Tkachov  wrote:
> 
> 
> 
>> On 18 Feb 2025, at 09:48, Kyrylo Tkachov  wrote:
>> 
>> 
>> 
>>> On 18 Feb 2025, at 09:41, Richard Sandiford  
>>> wrote:
>>> 
>>> Kyrylo Tkachov  writes:
>>>> Hi Soumya
>>>> 
>>>>> On 18 Feb 2025, at 09:12, Soumya AR  wrote:
>>>>> 
>>>>> generic_armv8_a.h defines generic_armv8_a_prefetch_tune but still uses
>>>>> generic_prefetch_tune in generic_armv8_a_tunings.
>>>>> 
>>>>> This patch updates the pointer to generic_armv8_a_prefetch_tune.
>>>>> 
>>>>> This patch was bootstrapped and regtested on aarch64-linux-gnu, no 
>>>>> regression.
>>>>> 
>>>>> Ok for GCC 15 now?
>>>> 
>>>> Yes, this looks like a simple oversight.
>>>> Ok to push to master.
>>> 
>>> I suppose the alternative would be to remove generic_armv8_a_prefetch_tune,
>>> since it's (deliberately) identical to generic_prefetch_tune.
>> 
>> Looks like we have one prefetch_tune structure for each of the generic 
>> tunings (generic, generic_armv8_a, generic_armv9_a).
>> For the sake of symmetry it feels a bit better to have them independently 
>> tunable.
>> But as the effects are the same, it may be better to remove it in the 
>> interest of less code.
>> 
> 
> I see Soumya has already pushed her patch. I’m okay with either approach tbh, 
> but if Richard prefers we can remove generic_armv8_a_prefetch_tune in a 
> separate commit.

Yeah, missed Richard’s mail.

Let me know which is preferable, thanks.

Best,
Soumya

> Thanks,
> Kyrill
> 
> 
>> Thanks,
>> Kyrill
>> 
>>> 
>>>> Thanks,
>>>> Kyrill
>>>> 
>>>>> 
>>>>> Signed-off-by: Soumya AR 
>>>>> 
>>>>> gcc/ChangeLog:
>>>>> 
>>>>> * config/aarch64/tuning_models/generic_armv8_a.h: Updated prefetch
>>>>> struct pointer.
>>>>> 
>>>>> ---
>>>>> gcc/config/aarch64/tuning_models/generic_armv8_a.h | 2 +-
>>>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>>> 
>>>>> diff --git a/gcc/config/aarch64/tuning_models/generic_armv8_a.h 
>>>>> b/gcc/config/aarch64/tuning_models/generic_armv8_a.h
>>>>> index 35de3f03296..01080cade46 100644
>>>>> --- a/gcc/config/aarch64/tuning_models/generic_armv8_a.h
>>>>> +++ b/gcc/config/aarch64/tuning_models/generic_armv8_a.h
>>>>> @@ -184,7 +184,7 @@ static const struct tune_params 
>>>>> generic_armv8_a_tunings =
>>>>> (AARCH64_EXTRA_TUNE_BASE
>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */
>>>>> - &generic_prefetch_tune,
>>>>> + &generic_armv8_a_prefetch_tune,
>>>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */
>>>>> AARCH64_LDP_STP_POLICY_ALWAYS /* stp_policy_model. */
>>>>> };
>>>>> -- 
>>>>> 2.34.1




[PATCH] aarch64: Use generic_armv8_a_prefetch_tune in generic_armv8_a.h

2025-02-18 Thread Soumya AR
generic_armv8_a.h defines generic_armv8_a_prefetch_tune but still uses
generic_prefetch_tune in generic_armv8_a_tunings.

This patch updates the pointer to generic_armv8_a_prefetch_tune.

This patch was bootstrapped and regtested on aarch64-linux-gnu, no regression.

Ok for GCC 15 now?

Signed-off-by: Soumya AR 

gcc/ChangeLog:

* config/aarch64/tuning_models/generic_armv8_a.h: Updated prefetch
struct pointer.

---
gcc/config/aarch64/tuning_models/generic_armv8_a.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/config/aarch64/tuning_models/generic_armv8_a.h 
b/gcc/config/aarch64/tuning_models/generic_armv8_a.h
index 35de3f03296..01080cade46 100644
--- a/gcc/config/aarch64/tuning_models/generic_armv8_a.h
+++ b/gcc/config/aarch64/tuning_models/generic_armv8_a.h
@@ -184,7 +184,7 @@ static const struct tune_params generic_armv8_a_tunings =
(AARCH64_EXTRA_TUNE_BASE
| AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
| AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */
- &generic_prefetch_tune,
+ &generic_armv8_a_prefetch_tune,
AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */
AARCH64_LDP_STP_POLICY_ALWAYS /* stp_policy_model. */
};
-- 
2.34.1