[gcc r15-3666] aarch64: Emit ADD X, Y, Y instead of SHL X, Y, #1 for SVE instructions.

2024-09-16 Thread Kyrylo Tkachov via Gcc-cvs
https://gcc.gnu.org/g:4af196b2ebd662c5183f1998b0184985e85479b2

commit r15-3666-g4af196b2ebd662c5183f1998b0184985e85479b2
Author: Soumya AR 
Date:   Tue Sep 10 14:18:44 2024 +0530

aarch64: Emit ADD X, Y, Y instead of SHL X, Y, #1 for SVE instructions.

On Neoverse V2, SVE ADD instructions have a throughput of 4, while shift
instructions like SHL have a throughput of 2. We can lean on that to emit 
code
like:
 addz31.b, z31.b, z31.b
instead of:
 lslz31.b, z31.b, #1

The implementation of this change for SVE vectors is similar to a prior 
patch
 that 
adds
the above functionality for Neon vectors.

Here, the machine descriptor pattern is split up to separately accommodate 
left
and right shifts, so we can specifically emit an add for all left shifts by 
1.

The patch was bootstrapped and regtested on aarch64-linux-gnu, no 
regression.
OK for mainline?

Signed-off-by: Soumya AR 

gcc/ChangeLog:

* config/aarch64/aarch64-sve.md (*post_ra_v3): Split 
pattern
to accomodate left and right shifts separately.
(*post_ra_v_ashl3): Matches left shifts with additional
constraint to check for shifts by 1.
(*post_ra_v_3): Matches right shifts.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/sve/acle/asm/lsl_s16.c: Updated instances of 
lsl-1
with corresponding add.
* gcc.target/aarch64/sve/acle/asm/lsl_s32.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/lsl_s64.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/lsl_s8.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/lsl_u16.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/lsl_u32.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/lsl_u64.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/lsl_u8.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/lsl_wide_s16.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/lsl_wide_s32.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/lsl_wide_s8.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/lsl_wide_u16.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/lsl_wide_u32.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/lsl_wide_u8.c: Likewise.
* gcc.target/aarch64/sve/adr_1.c: Likewise.
* gcc.target/aarch64/sve/adr_6.c: Likewise.
* gcc.target/aarch64/sve/cond_mla_7.c: Likewise.
* gcc.target/aarch64/sve/cond_mla_8.c: Likewise.
* gcc.target/aarch64/sve/shift_2.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/ldnt1sh_gather_s64.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/ldnt1sh_gather_u64.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/ldnt1uh_gather_s64.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/ldnt1uh_gather_u64.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/rshl_s16.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/rshl_s32.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/rshl_s64.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/rshl_s8.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/rshl_u16.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/rshl_u32.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/rshl_u64.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/rshl_u8.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/stnt1h_scatter_s64.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/stnt1h_scatter_u64.c: Likewise.
* gcc.target/aarch64/sve/sve_shl_add.c: New test.

Diff:
---
 gcc/config/aarch64/aarch64-sve.md  | 18 +++--
 .../gcc.target/aarch64/sve/acle/asm/lsl_s16.c  |  4 +-
 .../gcc.target/aarch64/sve/acle/asm/lsl_s32.c  |  4 +-
 .../gcc.target/aarch64/sve/acle/asm/lsl_s64.c  |  4 +-
 .../gcc.target/aarch64/sve/acle/asm/lsl_s8.c   |  4 +-
 .../gcc.target/aarch64/sve/acle/asm/lsl_u16.c  |  4 +-
 .../gcc.target/aarch64/sve/acle/asm/lsl_u32.c  |  4 +-
 .../gcc.target/aarch64/sve/acle/asm/lsl_u64.c  |  4 +-
 .../gcc.target/aarch64/sve/acle/asm/lsl_u8.c   |  4 +-
 .../gcc.target/aarch64/sve/acle/asm/lsl_wide_s16.c |  4 +-
 .../gcc.target/aarch64/sve/acle/asm/lsl_wide_s32.c |  4 +-
 .../gcc.target/aarch64/sve/acle/asm/lsl_wide_s8.c  |  4 +-
 .../gcc.target/aarch64/sve/acle/asm/lsl_wide_u16.c |  4 +-
 .../gcc.target/aarch64/sve/acle/asm/lsl_wide_u32.c |  4 +-
 .../gcc.target/aarch64/sve/acle/asm/lsl_wide_u8.c  |  4 +-
 gcc/testsuite/gcc.target/aarch64/sve/adr_1.c   |  6 +--
 gcc/testsuite/gcc.target/aarch64/sve/adr_6.c   |  4 +-
 gcc/testsuite/gcc.target/aarch64/sve/cond_mla_7.c  |  8 ++--
 gcc/testsuite/gcc.target/aarch64/sve/cond_mla_8.c  |  8 ++--
 gcc/testsuite/gcc.target/

[gcc r15-3018] aarch64: Reduce FP reassociation width for Neoverse V2 and set AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FM

2024-08-19 Thread Kyrylo Tkachov via Gcc-cvs
https://gcc.gnu.org/g:cc572242688f0c6f8733c173038163efb09560fa

commit r15-3018-gcc572242688f0c6f8733c173038163efb09560fa
Author: Kyrylo Tkachov 
Date:   Fri Aug 2 06:48:47 2024 -0700

aarch64: Reduce FP reassociation width for Neoverse V2 and set 
AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA

The fp reassociation width for Neoverse V2 was set to 6 since its
introduction and I guess it was empirically tuned.  But since
AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA was added the tree reassociation
pass seems to be more deliberate in forming FMAs and when that flag is
used it seems to more properly evaluate the FMA vs non-FMA reassociation
widths.
According to the Neoverse V2 SWOG the core has a throughput of 4 for
most FP operations, so the value 6 is not accurate anyway.
Also, the SWOG does state that FMADD operations are pipelined and the
results can be forwarded from FP multiplies to the accumulation operands
of FMADD instructions, which seems to be what
AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA expresses.

This patch sets the fp_reassoc_width field to 4 and enables
AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA for -mcpu=neoverse-v2.

On SPEC2017 fprate I see the following changes on a Grace system:
503.bwaves_r0.16%
507.cactuBSSN_r -0.32%
508.namd_r  3.04%
510.parest_r0.00%
511.povray_r0.78%
519.lbm_r   0.35%
521.wrf_r   0.69%
526.blender_r   -0.53%
527.cam4_r  0.84%
538.imagick_r   0.00%
544.nab_r   -0.97%
549.fotonik3d_r -0.45%
554.roms_r  0.97%
Geomean 0.35%

with -Ofast -mcpu=grace -flto.

So slight overall improvement with a meaningful improvement in
508.namd_r.

I think other tunings in aarch64 should look into
AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA as well, but I'll leave the
benchmarking to someone else.

Signed-off-by: Kyrylo Tkachov 

gcc/ChangeLog:

* config/aarch64/tuning_models/neoversev2.h (fp_reassoc_width):
Set to 4.
(tune_flags): Add AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA.

Diff:
---
 gcc/config/aarch64/tuning_models/neoversev2.h | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/gcc/config/aarch64/tuning_models/neoversev2.h 
b/gcc/config/aarch64/tuning_models/neoversev2.h
index 1ebb96b296d..52aad7d4a43 100644
--- a/gcc/config/aarch64/tuning_models/neoversev2.h
+++ b/gcc/config/aarch64/tuning_models/neoversev2.h
@@ -231,7 +231,7 @@ static const struct tune_params neoversev2_tunings =
   "4", /* jump_align.  */
   "32:16", /* loop_align.  */
   3,   /* int_reassoc_width.  */
-  6,   /* fp_reassoc_width.  */
+  4,   /* fp_reassoc_width.  */
   4,   /* fma_reassoc_width.  */
   3,   /* vec_reassoc_width.  */
   2,   /* min_div_recip_mul_sf.  */
@@ -242,10 +242,11 @@ static const struct tune_params neoversev2_tunings =
| AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
| AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
| AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
-   | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),   /* tune_flags.  */
+   | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW
+   | AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA),  /* tune_flags.  */
   &generic_prefetch_tune,
   AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
   AARCH64_LDP_STP_POLICY_ALWAYS   /* stp_policy_model.  */
 };
 
-#endif /* GCC_AARCH64_H_NEOVERSEV2.  */
\ No newline at end of file
+#endif /* GCC_AARCH64_H_NEOVERSEV2.  */


[gcc r15-1647] [aarch64] Add support for -mcpu=grace

2024-06-26 Thread Kyrylo Tkachov via Gcc-cvs
https://gcc.gnu.org/g:7fada36c77829a197f63dde0d48ca33139105202

commit r15-1647-g7fada36c77829a197f63dde0d48ca33139105202
Author: Kyrylo Tkachov 
Date:   Wed Jun 26 09:42:11 2024 +0200

[aarch64] Add support for -mcpu=grace

This adds support for the NVIDIA Grace CPU to aarch64.
We reuse the tuning decisions for the Neoverse V2 core, but include a
number of architecture features that are not enabled by default in
-mcpu=neoverse-v2.

This allows Grace users to more simply target the CPU with -mcpu=grace
rather than remembering what extensions to tag on top of
-mcpu=neoverse-v2.

Bootstrapped and tested on aarch64-none-linux-gnu.
gcc/

* config/aarch64/aarch64-cores.def (grace): New entry.
* config/aarch64/aarch64-tune.md: Regenerate.
* doc/invoke.texi (AArch64 Options): Document the above.

Signed-off-by: Kyrylo Tkachov 

Diff:
---
 gcc/config/aarch64/aarch64-cores.def | 2 ++
 gcc/config/aarch64/aarch64-tune.md   | 2 +-
 gcc/doc/invoke.texi  | 4 ++--
 3 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-cores.def 
b/gcc/config/aarch64/aarch64-cores.def
index 0e05e81761c..e58bc0f27de 100644
--- a/gcc/config/aarch64/aarch64-cores.def
+++ b/gcc/config/aarch64/aarch64-cores.def
@@ -194,6 +194,8 @@ AARCH64_CORE("neoverse-n2", neoversen2, cortexa57, V9A, 
(I8MM, BF16, SVE2_BITPER
 AARCH64_CORE("cobalt-100",   cobalt100, cortexa57, V9A, (I8MM, BF16, 
SVE2_BITPERM, RNG, MEMTAG, PROFILE), neoversen2, 0x6d, 0xd49, -1)
 
 AARCH64_CORE("neoverse-v2", neoversev2, cortexa57, V9A, (I8MM, BF16, 
SVE2_BITPERM, RNG, MEMTAG, PROFILE), neoversev2, 0x41, 0xd4f, -1)
+AARCH64_CORE("grace", grace, cortexa57, V9A, (I8MM, BF16, SVE2_BITPERM, 
SVE2_AES, SVE2_SHA3, SVE2_SM4, PROFILE), neoversev2, 0x41, 0xd4f, -1)
+
 AARCH64_CORE("demeter", demeter, cortexa57, V9A, (I8MM, BF16, SVE2_BITPERM, 
RNG, MEMTAG, PROFILE), neoversev2, 0x41, 0xd4f, -1)
 
 /* Generic Architecture Processors.  */
diff --git a/gcc/config/aarch64/aarch64-tune.md 
b/gcc/config/aarch64/aarch64-tune.md
index 9b1f32a0330..719fd3dc62a 100644
--- a/gcc/config/aarch64/aarch64-tune.md
+++ b/gcc/config/aarch64/aarch64-tune.md
@@ -1,5 +1,5 @@
 ;; -*- buffer-read-only: t -*-
 ;; Generated automatically by gentune.sh from aarch64-cores.def
 (define_attr "tune"
-   
"cortexa34,cortexa35,cortexa53,cortexa57,cortexa72,cortexa73,thunderx,thunderxt88,thunderxt88p1,octeontx,octeontxt81,octeontxt83,thunderxt81,thunderxt83,ampere1,ampere1a,ampere1b,emag,xgene1,falkor,qdf24xx,exynosm1,phecda,thunderx2t99p1,vulcan,thunderx2t99,cortexa55,cortexa75,cortexa76,cortexa76ae,cortexa77,cortexa78,cortexa78ae,cortexa78c,cortexa65,cortexa65ae,cortexx1,cortexx1c,neoversen1,ares,neoversee1,octeontx2,octeontx2t98,octeontx2t96,octeontx2t93,octeontx2f95,octeontx2f95n,octeontx2f95mm,a64fx,tsv110,thunderx3t110,neoversev1,zeus,neoverse512tvb,saphira,oryon1,cortexa57cortexa53,cortexa72cortexa53,cortexa73cortexa35,cortexa73cortexa53,cortexa75cortexa55,cortexa76cortexa55,cortexr82,cortexa510,cortexa520,cortexa710,cortexa715,cortexa720,cortexx2,cortexx3,cortexx4,neoversen2,cobalt100,neoversev2,demeter,generic,generic_armv8_a,generic_armv9_a"
+   
"cortexa34,cortexa35,cortexa53,cortexa57,cortexa72,cortexa73,thunderx,thunderxt88,thunderxt88p1,octeontx,octeontxt81,octeontxt83,thunderxt81,thunderxt83,ampere1,ampere1a,ampere1b,emag,xgene1,falkor,qdf24xx,exynosm1,phecda,thunderx2t99p1,vulcan,thunderx2t99,cortexa55,cortexa75,cortexa76,cortexa76ae,cortexa77,cortexa78,cortexa78ae,cortexa78c,cortexa65,cortexa65ae,cortexx1,cortexx1c,neoversen1,ares,neoversee1,octeontx2,octeontx2t98,octeontx2t96,octeontx2t93,octeontx2f95,octeontx2f95n,octeontx2f95mm,a64fx,tsv110,thunderx3t110,neoversev1,zeus,neoverse512tvb,saphira,oryon1,cortexa57cortexa53,cortexa72cortexa53,cortexa73cortexa35,cortexa73cortexa53,cortexa75cortexa55,cortexa76cortexa55,cortexr82,cortexa510,cortexa520,cortexa710,cortexa715,cortexa720,cortexx2,cortexx3,cortexx4,neoversen2,cobalt100,neoversev2,grace,demeter,generic,generic_armv8_a,generic_armv9_a"
(const (symbol_ref "((enum attr_tune) aarch64_tune)")))
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 729dbc1691e..30c4b002d1f 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -21437,8 +21437,8 @@ performance of the code.  Permissible values for this 
option are:
 @samp{ares}, @samp{exynos-m1}, @samp{emag}, @samp{falkor},
 @samp{oryon-1},
 @samp{neoverse-512tvb}, @samp{neoverse-e1}, @samp{neoverse-n1},
-@samp{neoverse-n2}, @samp{neoverse-v1}, @samp{neoverse-v2}, @samp{qdf24xx},
-@samp{saphira}, @samp{phecda}, @samp{xgene1}, @samp{vulcan},
+@samp{neoverse-n2}, @samp{neoverse-v1}, @samp{neoverse-v2}, @samp{grace},
+@samp{qdf24xx}, @samp{saphira}, @samp{phecda}, @samp{xgene1}, @samp{vulcan},
 @samp{octeontx}, @samp{octeontx81},  @samp{octeontx83},
 @samp{octeontx2}, @samp{octeontx2t98}, @samp{octeontx2t96}
 @samp{octeontx2t93}

[gcc r14-10351] aarch64: Add support for -mcpu=grace

2024-06-27 Thread Kyrylo Tkachov via Gcc-cvs
https://gcc.gnu.org/g:c2878a9a1719e067b1476377bd2a292350482e61

commit r14-10351-gc2878a9a1719e067b1476377bd2a292350482e61
Author: Kyrylo Tkachov 
Date:   Wed Jun 19 14:56:02 2024 +0530

aarch64: Add support for -mcpu=grace

This adds support for the NVIDIA Grace CPU to aarch64.
We reuse the tuning decisions for the Neoverse V2 core, but include a
number of architecture features that are not enabled by default in
-mcpu=neoverse-v2.

This allows Grace users to more simply target the CPU with -mcpu=grace
rather than remembering what extensions to tag on top of
-mcpu=neoverse-v2.

Bootstrapped and tested on aarch64-none-linux-gnu.
gcc/

* config/aarch64/aarch64-cores.def (grace): New entry.
* config/aarch64/aarch64-tune.md: Regenerate.
* doc/invoke.texi (AArch64 Options): Document the above.

Signed-off-by: Kyrylo Tkachov 

Diff:
---
 gcc/config/aarch64/aarch64-cores.def | 2 ++
 gcc/config/aarch64/aarch64-tune.md   | 2 +-
 gcc/doc/invoke.texi  | 4 ++--
 3 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-cores.def 
b/gcc/config/aarch64/aarch64-cores.def
index f69fc212d56..f5536388f61 100644
--- a/gcc/config/aarch64/aarch64-cores.def
+++ b/gcc/config/aarch64/aarch64-cores.def
@@ -189,6 +189,8 @@ AARCH64_CORE("neoverse-n2", neoversen2, cortexa57, V9A, 
(I8MM, BF16, SVE2_BITPER
 AARCH64_CORE("cobalt-100",   cobalt100, cortexa57, V9A, (I8MM, BF16, 
SVE2_BITPERM, RNG, MEMTAG, PROFILE), neoversen2, 0x6d, 0xd49, -1)
 
 AARCH64_CORE("neoverse-v2", neoversev2, cortexa57, V9A, (I8MM, BF16, 
SVE2_BITPERM, RNG, MEMTAG, PROFILE), neoversev2, 0x41, 0xd4f, -1)
+AARCH64_CORE("grace", grace, cortexa57, V9A, (I8MM, BF16, SVE2_BITPERM, 
SVE2_AES, SVE2_SHA3, SVE2_SM4, PROFILE), neoversev2, 0x41, 0xd4f, -1)
+
 AARCH64_CORE("demeter", demeter, cortexa57, V9A, (I8MM, BF16, SVE2_BITPERM, 
RNG, MEMTAG, PROFILE), neoversev2, 0x41, 0xd4f, -1)
 
 /* Generic Architecture Processors.  */
diff --git a/gcc/config/aarch64/aarch64-tune.md 
b/gcc/config/aarch64/aarch64-tune.md
index abd3c9e0822..80254836e0e 100644
--- a/gcc/config/aarch64/aarch64-tune.md
+++ b/gcc/config/aarch64/aarch64-tune.md
@@ -1,5 +1,5 @@
 ;; -*- buffer-read-only: t -*-
 ;; Generated automatically by gentune.sh from aarch64-cores.def
 (define_attr "tune"
-   
"cortexa34,cortexa35,cortexa53,cortexa57,cortexa72,cortexa73,thunderx,thunderxt88p1,thunderxt88,octeontx,octeontxt81,octeontxt83,thunderxt81,thunderxt83,ampere1,ampere1a,ampere1b,emag,xgene1,falkor,qdf24xx,exynosm1,phecda,thunderx2t99p1,vulcan,thunderx2t99,cortexa55,cortexa75,cortexa76,cortexa76ae,cortexa77,cortexa78,cortexa78ae,cortexa78c,cortexa65,cortexa65ae,cortexx1,cortexx1c,neoversen1,ares,neoversee1,octeontx2,octeontx2t98,octeontx2t96,octeontx2t93,octeontx2f95,octeontx2f95n,octeontx2f95mm,a64fx,tsv110,thunderx3t110,neoversev1,zeus,neoverse512tvb,saphira,cortexa57cortexa53,cortexa72cortexa53,cortexa73cortexa35,cortexa73cortexa53,cortexa75cortexa55,cortexa76cortexa55,cortexr82,cortexa510,cortexa520,cortexa710,cortexa715,cortexa720,cortexx2,cortexx3,cortexx4,neoversen2,cobalt100,neoversev2,demeter,generic,generic_armv8_a,generic_armv9_a"
+   
"cortexa34,cortexa35,cortexa53,cortexa57,cortexa72,cortexa73,thunderx,thunderxt88p1,thunderxt88,octeontx,octeontxt81,octeontxt83,thunderxt81,thunderxt83,ampere1,ampere1a,ampere1b,emag,xgene1,falkor,qdf24xx,exynosm1,phecda,thunderx2t99p1,vulcan,thunderx2t99,cortexa55,cortexa75,cortexa76,cortexa76ae,cortexa77,cortexa78,cortexa78ae,cortexa78c,cortexa65,cortexa65ae,cortexx1,cortexx1c,neoversen1,ares,neoversee1,octeontx2,octeontx2t98,octeontx2t96,octeontx2t93,octeontx2f95,octeontx2f95n,octeontx2f95mm,a64fx,tsv110,thunderx3t110,neoversev1,zeus,neoverse512tvb,saphira,cortexa57cortexa53,cortexa72cortexa53,cortexa73cortexa35,cortexa73cortexa53,cortexa75cortexa55,cortexa76cortexa55,cortexr82,cortexa510,cortexa520,cortexa710,cortexa715,cortexa720,cortexx2,cortexx3,cortexx4,neoversen2,cobalt100,neoversev2,grace,demeter,generic,generic_armv8_a,generic_armv9_a"
(const (symbol_ref "((enum attr_tune) aarch64_tune)")))
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index a916d618960..67220051a5b 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -21324,8 +21324,8 @@ performance of the code.  Permissible values for this 
option are:
 @samp{cortex-a78}, @samp{cortex-a78ae}, @samp{cortex-a78c},
 @samp{ares}, @samp{exynos-m1}, @samp{emag}, @samp{falkor},
 @samp{neoverse-512tvb}, @samp{neoverse-e1}, @samp{neoverse-n1},
-@samp{neoverse-n2}, @samp{neoverse-v1}, @samp{neoverse-v2}, @samp{qdf24xx},
-@samp{saphira}, @samp{phecda}, @samp{xgene1}, @samp{vulcan},
+@samp{neoverse-n2}, @samp{neoverse-v1}, @samp{neoverse-v2}, @samp{grace},
+@samp{qdf24xx}, @samp{saphira}, @samp{phecda}, @samp{xgene1}, @samp{vulcan},
 @samp{octeontx}, @samp{octeontx81},  @samp{octeontx83},
 @samp{octeontx2}, @samp{octeontx2t98}, @samp{octe

[gcc r13-8871] Add support for -mcpu=grace

2024-06-27 Thread Kyrylo Tkachov via Gcc-cvs
https://gcc.gnu.org/g:952ea3260e40992d3bf5e1f17b4845a4e5c908b5

commit r13-8871-g952ea3260e40992d3bf5e1f17b4845a4e5c908b5
Author: Kyrylo Tkachov 
Date:   Wed Jun 19 14:56:02 2024 +0530

Add support for -mcpu=grace

This adds support for the NVIDIA Grace CPU to aarch64.
We reuse the tuning decisions for the Neoverse V2 core, but include a
number of architecture features that are not enabled by default in
-mcpu=neoverse-v2.

This allows Grace users to more simply target the CPU with -mcpu=grace
rather than remembering what extensions to tag on top of
-mcpu=neoverse-v2.

Bootstrapped and tested on aarch64-none-linux-gnu.
gcc/

* config/aarch64/aarch64-cores.def (grace): New entry.
* config/aarch64/aarch64-tune.md: Regenerate.
* doc/invoke.texi (AArch64 Options): Document the above.

Signed-off-by: Kyrylo Tkachov 

Diff:
---
 gcc/config/aarch64/aarch64-cores.def | 2 ++
 gcc/config/aarch64/aarch64-tune.md   | 2 +-
 gcc/doc/invoke.texi  | 4 ++--
 3 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-cores.def 
b/gcc/config/aarch64/aarch64-cores.def
index fdda0697b88..bec08ca1910 100644
--- a/gcc/config/aarch64/aarch64-cores.def
+++ b/gcc/config/aarch64/aarch64-cores.def
@@ -182,6 +182,8 @@ AARCH64_CORE("neoverse-n2", neoversen2, cortexa57, V9A, 
(I8MM, BF16, SVE2_BITPER
 AARCH64_CORE("cobalt-100",   cobalt100, cortexa57, V9A, (I8MM, BF16, 
SVE2_BITPERM, RNG, MEMTAG, PROFILE), neoversen2, 0x6d, 0xd49, -1)
 
 AARCH64_CORE("neoverse-v2", neoversev2, cortexa57, V9A, (I8MM, BF16, 
SVE2_BITPERM, RNG, MEMTAG, PROFILE), neoversev2, 0x41, 0xd4f, -1)
+AARCH64_CORE("grace", grace, cortexa57, V9A, (I8MM, BF16, CRYPTO, 
SVE2_BITPERM, SVE2_AES, SVE2_SHA3, SVE2_SM4, PROFILE), neoversev2, 0x41, 0xd4f, 
-1)
+
 AARCH64_CORE("demeter", demeter, cortexa57, V9A, (I8MM, BF16, SVE2_BITPERM, 
RNG, MEMTAG, PROFILE), neoversev2, 0x41, 0xd4f, -1)
 
 #undef AARCH64_CORE
diff --git a/gcc/config/aarch64/aarch64-tune.md 
b/gcc/config/aarch64/aarch64-tune.md
index 9d46d38a292..6eae8522593 100644
--- a/gcc/config/aarch64/aarch64-tune.md
+++ b/gcc/config/aarch64/aarch64-tune.md
@@ -1,5 +1,5 @@
 ;; -*- buffer-read-only: t -*-
 ;; Generated automatically by gentune.sh from aarch64-cores.def
 (define_attr "tune"
-   
"cortexa34,cortexa35,cortexa53,cortexa57,cortexa72,cortexa73,thunderx,thunderxt88p1,thunderxt88,octeontx,octeontxt81,octeontxt83,thunderxt81,thunderxt83,ampere1,ampere1a,emag,xgene1,falkor,qdf24xx,exynosm1,phecda,thunderx2t99p1,vulcan,thunderx2t99,cortexa55,cortexa75,cortexa76,cortexa76ae,cortexa77,cortexa78,cortexa78ae,cortexa78c,cortexa65,cortexa65ae,cortexx1,cortexx1c,neoversen1,ares,neoversee1,octeontx2,octeontx2t98,octeontx2t96,octeontx2t93,octeontx2f95,octeontx2f95n,octeontx2f95mm,a64fx,tsv110,thunderx3t110,neoversev1,zeus,neoverse512tvb,saphira,cortexa57cortexa53,cortexa72cortexa53,cortexa73cortexa35,cortexa73cortexa53,cortexa75cortexa55,cortexa76cortexa55,cortexr82,cortexa510,cortexa710,cortexa715,cortexx2,cortexx3,neoversen2,cobalt100,neoversev2,demeter"
+   
"cortexa34,cortexa35,cortexa53,cortexa57,cortexa72,cortexa73,thunderx,thunderxt88p1,thunderxt88,octeontx,octeontxt81,octeontxt83,thunderxt81,thunderxt83,ampere1,ampere1a,emag,xgene1,falkor,qdf24xx,exynosm1,phecda,thunderx2t99p1,vulcan,thunderx2t99,cortexa55,cortexa75,cortexa76,cortexa76ae,cortexa77,cortexa78,cortexa78ae,cortexa78c,cortexa65,cortexa65ae,cortexx1,cortexx1c,neoversen1,ares,neoversee1,octeontx2,octeontx2t98,octeontx2t96,octeontx2t93,octeontx2f95,octeontx2f95n,octeontx2f95mm,a64fx,tsv110,thunderx3t110,neoversev1,zeus,neoverse512tvb,saphira,cortexa57cortexa53,cortexa72cortexa53,cortexa73cortexa35,cortexa73cortexa53,cortexa75cortexa55,cortexa76cortexa55,cortexr82,cortexa510,cortexa710,cortexa715,cortexx2,cortexx3,neoversen2,cobalt100,neoversev2,grace,demeter"
(const (symbol_ref "((enum attr_tune) aarch64_tune)")))
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 914c4bc8e6d..b17d0cf9341 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -20315,8 +20315,8 @@ performance of the code.  Permissible values for this 
option are:
 @samp{cortex-a78}, @samp{cortex-a78ae}, @samp{cortex-a78c},
 @samp{ares}, @samp{exynos-m1}, @samp{emag}, @samp{falkor},
 @samp{neoverse-512tvb}, @samp{neoverse-e1}, @samp{neoverse-n1},
-@samp{neoverse-n2}, @samp{neoverse-v1}, @samp{neoverse-v2}, @samp{qdf24xx},
-@samp{saphira}, @samp{phecda}, @samp{xgene1}, @samp{vulcan},
+@samp{neoverse-n2}, @samp{neoverse-v1}, @samp{neoverse-v2}, @samp{grace},
+@samp{qdf24xx}, @samp{saphira}, @samp{phecda}, @samp{xgene1}, @samp{vulcan},
 @samp{octeontx}, @samp{octeontx81},  @samp{octeontx83},
 @samp{octeontx2}, @samp{octeontx2t98}, @samp{octeontx2t96}
 @samp{octeontx2t93}, @samp{octeontx2f95}, @samp{octeontx2f95n},


[gcc r12-10584] Add support for -mcpu=grace

2024-06-27 Thread Kyrylo Tkachov via Gcc-cvs
https://gcc.gnu.org/g:25cb13649b1765a21f21907f2d7a0aa2135accb5

commit r12-10584-g25cb13649b1765a21f21907f2d7a0aa2135accb5
Author: Kyrylo Tkachov 
Date:   Wed Jun 19 14:56:02 2024 +0530

Add support for -mcpu=grace

This adds support for the NVIDIA Grace CPU to aarch64.
We reuse the tuning decisions for the Neoverse V2 core, but include a
number of architecture features that are not enabled by default in
-mcpu=neoverse-v2.

This allows Grace users to more simply target the CPU with -mcpu=grace
rather than remembering what extensions to tag on top of
-mcpu=neoverse-v2.

Bootstrapped and tested on aarch64-none-linux-gnu.
gcc/

* config/aarch64/aarch64-cores.def (grace): New entry.
* config/aarch64/aarch64-tune.md: Regenerate.
* doc/invoke.texi (AArch64 Options): Document the above.

Signed-off-by: Kyrylo Tkachov 

Diff:
---
 gcc/config/aarch64/aarch64-cores.def | 1 +
 gcc/config/aarch64/aarch64-tune.md   | 2 +-
 gcc/doc/invoke.texi  | 4 ++--
 3 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-cores.def 
b/gcc/config/aarch64/aarch64-cores.def
index 956afa70714..6532bdaafb5 100644
--- a/gcc/config/aarch64/aarch64-cores.def
+++ b/gcc/config/aarch64/aarch64-cores.def
@@ -176,5 +176,6 @@ AARCH64_CORE("cobalt-100",   cobalt100, cortexa57, 9A, 
AARCH64_FL_FOR_ARCH9 | AA
 
 AARCH64_CORE("demeter", demeter, cortexa57, 9A, AARCH64_FL_FOR_ARCH9 | 
AARCH64_FL_I8MM | AARCH64_FL_BF16 | AARCH64_FL_SVE2_BITPERM | AARCH64_FL_RNG | 
AARCH64_FL_MEMTAG | AARCH64_FL_PROFILE, neoversev2, 0x41, 0xd4f, -1)
 AARCH64_CORE("neoverse-v2", neoversev2, cortexa57, 9A, AARCH64_FL_FOR_ARCH9 | 
AARCH64_FL_I8MM | AARCH64_FL_BF16 | AARCH64_FL_SVE2_BITPERM | AARCH64_FL_RNG | 
AARCH64_FL_MEMTAG | AARCH64_FL_PROFILE, neoversev2, 0x41, 0xd4f, -1)
+AARCH64_CORE("grace", grace, cortexa57, 9A, AARCH64_FL_FOR_ARCH9 | 
AARCH64_FL_I8MM | AARCH64_FL_BF16 | AARCH64_FL_SVE2_BITPERM | AARCH64_FL_CRYPTO 
| AARCH64_FL_SHA3 | AARCH64_FL_SM4 | AARCH64_FL_SVE2_AES | AARCH64_FL_SVE2_SHA3 
| AARCH64_FL_SVE2_SM4 | AARCH64_FL_PROFILE, neoversev2, 0x41, 0xd4f, -1)
 
 #undef AARCH64_CORE
diff --git a/gcc/config/aarch64/aarch64-tune.md 
b/gcc/config/aarch64/aarch64-tune.md
index 2c1852c8fe6..0c139e3e729 100644
--- a/gcc/config/aarch64/aarch64-tune.md
+++ b/gcc/config/aarch64/aarch64-tune.md
@@ -1,5 +1,5 @@
 ;; -*- buffer-read-only: t -*-
 ;; Generated automatically by gentune.sh from aarch64-cores.def
 (define_attr "tune"
-   
"cortexa34,cortexa35,cortexa53,cortexa57,cortexa72,cortexa73,thunderx,thunderxt88p1,thunderxt88,octeontx,octeontxt81,octeontxt83,thunderxt81,thunderxt83,ampere1,ampere1a,emag,xgene1,falkor,qdf24xx,exynosm1,phecda,thunderx2t99p1,vulcan,thunderx2t99,cortexa55,cortexa75,cortexa76,cortexa76ae,cortexa77,cortexa78,cortexa78ae,cortexa78c,cortexa65,cortexa65ae,cortexx1,ares,neoversen1,neoversee1,octeontx2,octeontx2t98,octeontx2t96,octeontx2t93,octeontx2f95,octeontx2f95n,octeontx2f95mm,a64fx,tsv110,thunderx3t110,zeus,neoversev1,neoverse512tvb,saphira,cortexa57cortexa53,cortexa72cortexa53,cortexa73cortexa35,cortexa73cortexa53,cortexa75cortexa55,cortexa76cortexa55,cortexr82,cortexa510,cortexa710,cortexx2,neoversen2,cobalt100,demeter,neoversev2"
+   
"cortexa34,cortexa35,cortexa53,cortexa57,cortexa72,cortexa73,thunderx,thunderxt88p1,thunderxt88,octeontx,octeontxt81,octeontxt83,thunderxt81,thunderxt83,ampere1,ampere1a,emag,xgene1,falkor,qdf24xx,exynosm1,phecda,thunderx2t99p1,vulcan,thunderx2t99,cortexa55,cortexa75,cortexa76,cortexa76ae,cortexa77,cortexa78,cortexa78ae,cortexa78c,cortexa65,cortexa65ae,cortexx1,ares,neoversen1,neoversee1,octeontx2,octeontx2t98,octeontx2t96,octeontx2t93,octeontx2f95,octeontx2f95n,octeontx2f95mm,a64fx,tsv110,thunderx3t110,zeus,neoversev1,neoverse512tvb,saphira,cortexa57cortexa53,cortexa72cortexa53,cortexa73cortexa35,cortexa73cortexa53,cortexa75cortexa55,cortexa76cortexa55,cortexr82,cortexa510,cortexa710,cortexx2,neoversen2,cobalt100,demeter,neoversev2,grace"
(const (symbol_ref "((enum attr_tune) aarch64_tune)")))
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index c83f667260e..fbfa3241e7f 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -19203,8 +19203,8 @@ performance of the code.  Permissible values for this 
option are:
 @samp{cortex-a78}, @samp{cortex-a78ae}, @samp{cortex-a78c},
 @samp{ares}, @samp{exynos-m1}, @samp{emag}, @samp{falkor},
 @samp{neoverse-512tvb}, @samp{neoverse-e1}, @samp{neoverse-n1},
-@samp{neoverse-n2}, @samp{neoverse-v1}, @samp{neoverse-v2}, @samp{qdf24xx},
-@samp{saphira}, @samp{phecda}, @samp{xgene1}, @samp{vulcan},
+@samp{neoverse-n2}, @samp{neoverse-v1}, @samp{neoverse-v2}, @samp{grace},
+@samp{qdf24xx}, @samp{saphira}, @samp{phecda}, @samp{xgene1}, @samp{vulcan},
 @samp{octeontx}, @samp{octeontx81},  @samp{octeontx83},
 @samp{octeontx2}, @samp{octeontx2t98}, @samp{octeontx2t96}
 @samp{octeontx2t93}, @samp{octeont

[gcc r11-11540] Add support for -mcpu=grace

2024-06-27 Thread Kyrylo Tkachov via Gcc-cvs
https://gcc.gnu.org/g:bb943609534fcbd984d39a9a7efef12fa2667ac6

commit r11-11540-gbb943609534fcbd984d39a9a7efef12fa2667ac6
Author: Kyrylo Tkachov 
Date:   Wed Jun 19 14:56:02 2024 +0530

Add support for -mcpu=grace

This adds support for the NVIDIA Grace CPU to aarch64.
We reuse the tuning decisions for the Neoverse V2 core, but include a
number of architecture features that are not enabled by default in
-mcpu=neoverse-v2.

This allows Grace users to more simply target the CPU with -mcpu=grace
rather than remembering what extensions to tag on top of
-mcpu=neoverse-v2.

Bootstrapped and tested on aarch64-none-linux-gnu.
gcc/

* config/aarch64/aarch64-cores.def (grace): New entry.
* config/aarch64/aarch64-tune.md: Regenerate.
* doc/invoke.texi (AArch64 Options): Document the above.

Signed-off-by: Kyrylo Tkachov 

Diff:
---
 gcc/config/aarch64/aarch64-cores.def | 1 +
 gcc/config/aarch64/aarch64-tune.md   | 2 +-
 gcc/doc/invoke.texi  | 4 ++--
 3 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-cores.def 
b/gcc/config/aarch64/aarch64-cores.def
index 5599cde700f..0243e3d4d1c 100644
--- a/gcc/config/aarch64/aarch64-cores.def
+++ b/gcc/config/aarch64/aarch64-cores.def
@@ -150,6 +150,7 @@ AARCH64_CORE("saphira", saphira,saphira,8_4A,  
AARCH64_FL_FOR_ARCH8_
 AARCH64_CORE("neoverse-n2", neoversen2, cortexa57, 8_5A, 
AARCH64_FL_FOR_ARCH8_5 | AARCH64_FL_I8MM | AARCH64_FL_BF16 | AARCH64_FL_F16 | 
AARCH64_FL_SVE | AARCH64_FL_SVE2 | AARCH64_FL_SVE2_BITPERM | AARCH64_FL_RNG | 
AARCH64_FL_MEMTAG, neoversen2, 0x41, 0xd49, -1)
 AARCH64_CORE("cobalt-100",   cobalt100, cortexa57, 8_5A, 
AARCH64_FL_FOR_ARCH8_5 | AARCH64_FL_I8MM | AARCH64_FL_BF16 | AARCH64_FL_F16 | 
AARCH64_FL_SVE | AARCH64_FL_SVE2 | AARCH64_FL_SVE2_BITPERM | AARCH64_FL_RNG | 
AARCH64_FL_MEMTAG, neoversen2, 0x6d, 0xd49, -1)
 AARCH64_CORE("neoverse-v2", neoversev2, cortexa57, 8_5A, 
AARCH64_FL_FOR_ARCH8_5 | AARCH64_FL_I8MM | AARCH64_FL_BF16 | AARCH64_FL_F16 | 
AARCH64_FL_SVE | AARCH64_FL_SVE2 | AARCH64_FL_SVE2_BITPERM | AARCH64_FL_RNG | 
AARCH64_FL_MEMTAG, neoverse512tvb, 0x41, 0xd4f, -1)
+AARCH64_CORE("grace", grace, cortexa57, 8_5A, AARCH64_FL_FOR_ARCH8_5 | 
AARCH64_FL_I8MM | AARCH64_FL_BF16 | AARCH64_FL_F16 | AARCH64_FL_CRYPTO | 
AARCH64_FL_SHA3 | AARCH64_FL_SM4 | AARCH64_FL_SVE | AARCH64_FL_SVE2 | 
AARCH64_FL_SVE2_BITPERM | AARCH64_FL_SVE2_AES | AARCH64_FL_SVE2_SM4 | 
AARCH64_FL_SVE2_SHA3, neoverse512tvb, 0x41, 0xd4f, -1)
 
 /* ARMv8-A big.LITTLE implementations.  */
 
diff --git a/gcc/config/aarch64/aarch64-tune.md 
b/gcc/config/aarch64/aarch64-tune.md
index 8953f1c0332..f233a7cce6c 100644
--- a/gcc/config/aarch64/aarch64-tune.md
+++ b/gcc/config/aarch64/aarch64-tune.md
@@ -1,5 +1,5 @@
 ;; -*- buffer-read-only: t -*-
 ;; Generated automatically by gentune.sh from aarch64-cores.def
 (define_attr "tune"
-   
"cortexa34,cortexa35,cortexa53,cortexa57,cortexa72,cortexa73,thunderx,thunderxt88p1,thunderxt88,octeontx,octeontxt81,octeontxt83,thunderxt81,thunderxt83,ampere1,ampere1a,emag,xgene1,falkor,qdf24xx,exynosm1,phecda,thunderx2t99p1,vulcan,thunderx2t99,cortexa55,cortexa75,cortexa76,cortexa76ae,cortexa77,cortexa78,cortexa78ae,cortexa78c,cortexa65,cortexa65ae,cortexx1,ares,neoversen1,neoversee1,octeontx2,octeontx2t98,octeontx2t96,octeontx2t93,octeontx2f95,octeontx2f95n,octeontx2f95mm,a64fx,tsv110,thunderx3t110,zeus,neoversev1,neoverse512tvb,saphira,neoversen2,cobalt100,neoversev2,cortexa57cortexa53,cortexa72cortexa53,cortexa73cortexa35,cortexa73cortexa53,cortexa75cortexa55,cortexa76cortexa55,cortexr82"
+   
"cortexa34,cortexa35,cortexa53,cortexa57,cortexa72,cortexa73,thunderx,thunderxt88p1,thunderxt88,octeontx,octeontxt81,octeontxt83,thunderxt81,thunderxt83,ampere1,ampere1a,emag,xgene1,falkor,qdf24xx,exynosm1,phecda,thunderx2t99p1,vulcan,thunderx2t99,cortexa55,cortexa75,cortexa76,cortexa76ae,cortexa77,cortexa78,cortexa78ae,cortexa78c,cortexa65,cortexa65ae,cortexx1,ares,neoversen1,neoversee1,octeontx2,octeontx2t98,octeontx2t96,octeontx2t93,octeontx2f95,octeontx2f95n,octeontx2f95mm,a64fx,tsv110,thunderx3t110,zeus,neoversev1,neoverse512tvb,saphira,neoversen2,cobalt100,neoversev2,grace,cortexa57cortexa53,cortexa72cortexa53,cortexa73cortexa35,cortexa73cortexa53,cortexa75cortexa55,cortexa76cortexa55,cortexr82"
(const (symbol_ref "((enum attr_tune) aarch64_tune)")))
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 1ae94fb3677..ef331d72beb 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -18233,8 +18233,8 @@ performance of the code.  Permissible values for this 
option are:
 @samp{cortex-a78}, @samp{cortex-a78ae}, @samp{cortex-a78c},
 @samp{ares}, @samp{exynos-m1}, @samp{emag}, @samp{falkor},
 @samp{neoverse-512tvb}, @samp{neoverse-e1}, @samp{neoverse-n1},
-@samp{neoverse-n2}, @samp{neoverse-v1},@samp{neoverse-v2}, @samp{qdf24xx},
-@samp{saphira}, @samp{phecda}, @sa

[gcc r13-8873] aarch64: Fix +nocrypto handling

2024-06-27 Thread Kyrylo Tkachov via Gcc-cvs
https://gcc.gnu.org/g:c93a9bba743ac236f6045ba7aafbc12a83726c48

commit r13-8873-gc93a9bba743ac236f6045ba7aafbc12a83726c48
Author: Andrew Carlotti 
Date:   Fri Nov 24 17:06:07 2023 +

aarch64: Fix +nocrypto handling

Additionally, replace all checks for the AARCH64_FL_CRYPTO bit with
checks for (AARCH64_FL_AES | AARCH64_FL_SHA2) instead.  The value of the
AARCH64_FL_CRYPTO bit within isa_flags is now ignored, but it is
retained because removing it would make processing the data in
option-extensions.def significantly more complex.

This bug should have been picked up by an existing test, but a missing
newline meant that the pattern incorrectly allowed "+crypto+nocrypto".

gcc/ChangeLog:

PR target/115618
* common/config/aarch64/aarch64-common.cc
(aarch64_get_extension_string_for_isa_flags): Fix generation of
the "+nocrypto" extension.
* config/aarch64/aarch64.h (AARCH64_ISA_CRYPTO): Remove.
(TARGET_CRYPTO): Remove.
* config/aarch64/aarch64-c.cc (aarch64_update_cpp_builtins):
Don't use TARGET_CRYPTO.

gcc/testsuite/ChangeLog:

PR target/115618
* gcc.target/aarch64/options_set_4.c: Add terminating newline.
* gcc.target/aarch64/options_set_27.c: New test.

(cherry picked from commit 8d30107455f2309854ced3d65fb07dc1f2c357c0)

Diff:
---
 gcc/common/config/aarch64/aarch64-common.cc   | 35 +--
 gcc/config/aarch64/aarch64-c.cc   |  2 +-
 gcc/config/aarch64/aarch64.h  | 10 +++
 gcc/testsuite/gcc.target/aarch64/options_set_27.c |  9 ++
 gcc/testsuite/gcc.target/aarch64/options_set_4.c  |  2 +-
 5 files changed, 43 insertions(+), 15 deletions(-)

diff --git a/gcc/common/config/aarch64/aarch64-common.cc 
b/gcc/common/config/aarch64/aarch64-common.cc
index 20bc4e1291b..673407ca9a8 100644
--- a/gcc/common/config/aarch64/aarch64-common.cc
+++ b/gcc/common/config/aarch64/aarch64-common.cc
@@ -310,6 +310,7 @@ aarch64_get_extension_string_for_isa_flags
  But in order to make the output more readable, it seems better
  to add the strings in definition order.  */
   aarch64_feature_flags added = 0;
+  auto flags_crypto = AARCH64_FL_AES | AARCH64_FL_SHA2;
   for (unsigned int i = ARRAY_SIZE (all_extensions); i-- > 0; )
 {
   auto &opt = all_extensions[i];
@@ -319,7 +320,7 @@ aarch64_get_extension_string_for_isa_flags
 per-feature crypto flags.  */
   auto flags = opt.flag_canonical;
   if (flags == AARCH64_FL_CRYPTO)
-   flags = AARCH64_FL_AES | AARCH64_FL_SHA2;
+   flags = flags_crypto;
 
   if ((flags & isa_flags & (explicit_flags | ~current_flags)) == flags)
{
@@ -338,14 +339,32 @@ aarch64_get_extension_string_for_isa_flags
  not have an HWCAPs then it shouldn't be taken into account for feature
  detection because one way or another we can't tell if it's available
  or not.  */
+
   for (auto &opt : all_extensions)
-if (opt.native_detect_p
-   && (opt.flag_canonical & current_flags & ~isa_flags))
-  {
-   current_flags &= ~opt.flags_off;
-   outstr += "+no";
-   outstr += opt.name;
-  }
+{
+  auto flags = opt.flag_canonical;
+  /* As a special case, don't emit "+noaes" or "+nosha2" when we could emit
+"+nocrypto" instead, in order to support assemblers that predate the
+separate per-feature crypto flags.  Only allow "+nocrypto" when "sm4"
+is not already enabled (to avoid dependending on whether "+nocrypto"
+also disables "sm4").  */
+  if (flags & flags_crypto
+ && (flags_crypto & current_flags & ~isa_flags) == flags_crypto
+ && !(current_flags & AARCH64_FL_SM4))
+ continue;
+
+  if (flags == AARCH64_FL_CRYPTO)
+   /* If either crypto flag needs removing here, then both do.  */
+   flags = flags_crypto;
+
+  if (opt.native_detect_p
+ && (flags & current_flags & ~isa_flags))
+   {
+ current_flags &= ~opt.flags_off;
+ outstr += "+no";
+ outstr += opt.name;
+   }
+}
 
   return outstr;
 }
diff --git a/gcc/config/aarch64/aarch64-c.cc b/gcc/config/aarch64/aarch64-c.cc
index 578ec6f45b0..6c5331a7625 100644
--- a/gcc/config/aarch64/aarch64-c.cc
+++ b/gcc/config/aarch64/aarch64-c.cc
@@ -139,7 +139,7 @@ aarch64_update_cpp_builtins (cpp_reader *pfile)
   aarch64_def_or_undef (TARGET_ILP32, "_ILP32", pfile);
   aarch64_def_or_undef (TARGET_ILP32, "__ILP32__", pfile);
 
-  aarch64_def_or_undef (TARGET_CRYPTO, "__ARM_FEATURE_CRYPTO", pfile);
+  aarch64_def_or_undef (TARGET_AES && TARGET_SHA2, "__ARM_FEATURE_CRYPTO", 
pfile);
   aarch64_def_or_undef (TARGET_SIMD_RDMA, "__ARM_FEATURE_QRDMX", pfile);
   aarch64_def_or_undef (TARGET_SVE, "__ARM_FEATURE_SVE", pfile);
   cpp_undef (pfile, "__ARM_FEATURE_SVE_BITS");
diff --git a/gcc/config/aarch64/aarch

[gcc r15-1813] aarch64: PR target/115475 Implement missing __ARM_FEATURE_SVE_BF16 macro

2024-07-03 Thread Kyrylo Tkachov via Gcc-cvs
https://gcc.gnu.org/g:6492c7130d6ae9992298fc3d072e2589d1131376

commit r15-1813-g6492c7130d6ae9992298fc3d072e2589d1131376
Author: Kyrylo Tkachov 
Date:   Fri Jun 28 13:22:37 2024 +0530

aarch64: PR target/115475 Implement missing __ARM_FEATURE_SVE_BF16 macro

The ACLE requires __ARM_FEATURE_SVE_BF16 to be enabled when SVE and BF16
and the associated intrinsics are available.
GCC does support the required intrinsics for TARGET_SVE_BF16 so define
this macro too.

Bootstrapped and tested on aarch64-none-linux-gnu.

gcc/

PR target/115475
* config/aarch64/aarch64-c.cc (aarch64_update_cpp_builtins):
Define __ARM_FEATURE_SVE_BF16 for TARGET_SVE_BF16.

gcc/testsuite/

PR target/115475
* gcc.target/aarch64/acle/bf16_sve_feature.c: New test.

Signed-off-by: Kyrylo Tkachov 

Diff:
---
 gcc/config/aarch64/aarch64-c.cc  |  3 +++
 gcc/testsuite/gcc.target/aarch64/acle/bf16_sve_feature.c | 10 ++
 2 files changed, 13 insertions(+)

diff --git a/gcc/config/aarch64/aarch64-c.cc b/gcc/config/aarch64/aarch64-c.cc
index f5d70339e4e..2aff097dd33 100644
--- a/gcc/config/aarch64/aarch64-c.cc
+++ b/gcc/config/aarch64/aarch64-c.cc
@@ -254,6 +254,9 @@ aarch64_update_cpp_builtins (cpp_reader *pfile)
"__ARM_FEATURE_BF16_SCALAR_ARITHMETIC", pfile);
   aarch64_def_or_undef (TARGET_BF16_FP,
"__ARM_FEATURE_BF16", pfile);
+  aarch64_def_or_undef (TARGET_SVE_BF16,
+   "__ARM_FEATURE_SVE_BF16", pfile);
+
   aarch64_def_or_undef (TARGET_LS64,
"__ARM_FEATURE_LS64", pfile);
   aarch64_def_or_undef (AARCH64_ISA_RCPC, "__ARM_FEATURE_RCPC", pfile);
diff --git a/gcc/testsuite/gcc.target/aarch64/acle/bf16_sve_feature.c 
b/gcc/testsuite/gcc.target/aarch64/acle/bf16_sve_feature.c
new file mode 100644
index 000..cb3ddac71a3
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/acle/bf16_sve_feature.c
@@ -0,0 +1,10 @@
+/* { dg-do compile } */
+
+#pragma GCC target "+sve+bf16"
+#ifndef __ARM_FEATURE_SVE_BF16
+#error "__ARM_FEATURE_SVE_BF16 is not defined but should be!"
+#endif
+
+void
+foo (void) {}
+


[gcc r15-1812] aarch64: PR target/115457 Implement missing __ARM_FEATURE_BF16 macro

2024-07-03 Thread Kyrylo Tkachov via Gcc-cvs
https://gcc.gnu.org/g:c10942134fa759843ac1ed1424b86fcb8e6368ba

commit r15-1812-gc10942134fa759843ac1ed1424b86fcb8e6368ba
Author: Kyrylo Tkachov 
Date:   Thu Jun 27 16:10:41 2024 +0530

aarch64: PR target/115457 Implement missing __ARM_FEATURE_BF16 macro

The ACLE asks the user to test for __ARM_FEATURE_BF16 before using the
 header but GCC doesn't set this up.
LLVM does, so this is an inconsistency between the compilers.

This patch enables that macro for TARGET_BF16_FP.
Bootstrapped and tested on aarch64-none-linux-gnu.

gcc/

PR target/115457
* config/aarch64/aarch64-c.cc (aarch64_update_cpp_builtins):
Define __ARM_FEATURE_BF16 for TARGET_BF16_FP.

gcc/testsuite/

PR target/115457
* gcc.target/aarch64/acle/bf16_feature.c: New test.

Signed-off-by: Kyrylo Tkachov 

Diff:
---
 gcc/config/aarch64/aarch64-c.cc  |  2 ++
 gcc/testsuite/gcc.target/aarch64/acle/bf16_feature.c | 10 ++
 2 files changed, 12 insertions(+)

diff --git a/gcc/config/aarch64/aarch64-c.cc b/gcc/config/aarch64/aarch64-c.cc
index d042e5fbd8c..f5d70339e4e 100644
--- a/gcc/config/aarch64/aarch64-c.cc
+++ b/gcc/config/aarch64/aarch64-c.cc
@@ -252,6 +252,8 @@ aarch64_update_cpp_builtins (cpp_reader *pfile)
"__ARM_FEATURE_BF16_VECTOR_ARITHMETIC", pfile);
   aarch64_def_or_undef (TARGET_BF16_FP,
"__ARM_FEATURE_BF16_SCALAR_ARITHMETIC", pfile);
+  aarch64_def_or_undef (TARGET_BF16_FP,
+   "__ARM_FEATURE_BF16", pfile);
   aarch64_def_or_undef (TARGET_LS64,
"__ARM_FEATURE_LS64", pfile);
   aarch64_def_or_undef (AARCH64_ISA_RCPC, "__ARM_FEATURE_RCPC", pfile);
diff --git a/gcc/testsuite/gcc.target/aarch64/acle/bf16_feature.c 
b/gcc/testsuite/gcc.target/aarch64/acle/bf16_feature.c
new file mode 100644
index 000..96584b4b988
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/acle/bf16_feature.c
@@ -0,0 +1,10 @@
+/* { dg-do compile } */
+
+#pragma GCC target "+bf16"
+#ifndef __ARM_FEATURE_BF16
+#error "__ARM_FEATURE_BF16 is not defined but should be!"
+#endif
+
+void
+foo (void) {}
+


[gcc r15-1817] [PATCH] match.pd: Fold x/sqrt(x) to sqrt(x)

2024-07-03 Thread Kyrylo Tkachov via Gcc-cvs
https://gcc.gnu.org/g:8dc5ad3ce8d4d2cd6cc2b7516d282395502fdf7d

commit r15-1817-g8dc5ad3ce8d4d2cd6cc2b7516d282395502fdf7d
Author: Jennifer Schmitz 
Date:   Wed Jul 3 14:40:42 2024 +0200

[PATCH] match.pd: Fold x/sqrt(x) to sqrt(x)

This patch adds a pattern in match.pd folding x/sqrt(x) to sqrt(x) for 
-funsafe-math-optimizations. Test cases were added for double, float, and long 
double.

The patch was bootstrapped and regtested on aarch64-linux-gnu, no 
regression.
Ok for mainline?

Signed-off-by: Jennifer Schmitz 

gcc/

* match.pd: Fold x/sqrt(x) to sqrt(x).

gcc/testsuite/

* gcc.dg/tree-ssa/sqrt_div.c: New test.

Diff:
---
 gcc/match.pd |  4 
 gcc/testsuite/gcc.dg/tree-ssa/sqrt_div.c | 23 +++
 2 files changed, 27 insertions(+)

diff --git a/gcc/match.pd b/gcc/match.pd
index 7fff7b5f9fe..a2e205b3207 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -7770,6 +7770,10 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
  when the operand has that value.)  */
 
 (if (flag_unsafe_math_optimizations)
+ /* Simplify x / sqrt(x) -> sqrt(x).  */
+ (simplify
+  (rdiv @0 (SQRT @0)) (SQRT @0))
+
  /* Simplify sqrt(x) * sqrt(x) -> x.  */
  (simplify
   (mult (SQRT_ALL@1 @0) @1)
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/sqrt_div.c 
b/gcc/testsuite/gcc.dg/tree-ssa/sqrt_div.c
new file mode 100644
index 000..2ae481b7982
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/sqrt_div.c
@@ -0,0 +1,23 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ffast-math -fdump-tree-forwprop-details" } */
+/* { dg-require-effective-target c99_runtime } */
+
+#define T(n, type, fname)  \
+type f##n (type x) \
+{  \
+  type t1 = __builtin_##fname (x); \
+  type t2 = x / t1;\
+  return t2;   \
+}   
+
+T(1, double, sqrt)
+
+/* { dg-final { scan-tree-dump "gimple_simplified to t2_\[0-9\]+ = 
__builtin_sqrt .x_\[0-9\]*.D.." "forwprop1" } } */
+
+T(2, float, sqrtf )
+
+/* { dg-final { scan-tree-dump "gimple_simplified to t2_\[0-9\]+ = 
__builtin_sqrtf .x_\[0-9\]*.D.." "forwprop1" } } */
+
+T(3, long double, sqrtl)
+
+/* { dg-final { scan-tree-dump "gimple_simplified to t2_\[0-9\]+ = 
__builtin_sqrtl .x_\[0-9\]*.D.." "forwprop1" } } */


[gcc r15-1838] Aarch64: Add test for non-commutative SIMD intrinsic

2024-07-04 Thread Kyrylo Tkachov via Gcc-cvs
https://gcc.gnu.org/g:14c6793885c11c892ac90d5046979ab20de1b0b1

commit r15-1838-g14c6793885c11c892ac90d5046979ab20de1b0b1
Author: Alfie Richards 
Date:   Thu Jul 4 09:07:57 2024 +0200

Aarch64: Add test for non-commutative SIMD intrinsic

This adds a test for non-commutative SIMD NEON intrinsics.
Specifically addp is non-commutative and has a bug in the current 
big-endian implementation.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/vector_intrinsics_asm.c: New test.

Diff:
---
 .../gcc.target/aarch64/vector_intrinsics_asm.c | 371 +
 1 file changed, 371 insertions(+)

diff --git a/gcc/testsuite/gcc.target/aarch64/vector_intrinsics_asm.c 
b/gcc/testsuite/gcc.target/aarch64/vector_intrinsics_asm.c
new file mode 100644
index 000..b7d5620abab
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/vector_intrinsics_asm.c
@@ -0,0 +1,371 @@
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+/* { dg-final { check-function-bodies "**" "" "" { xfail be } } } */
+
+#include "arm_neon.h"
+
+// SIGNED VADD INTRINSICS
+
+/*
+**test_vadd_s8:
+** addpv0\.8b, v0\.8b, v1\.8b
+** ret
+*/
+int8x8_t test_vadd_s8(int8x8_t v1, int8x8_t v2) {
+ int8x8_t v3 = vpadd_s8(v1, v2);
+ return v3;
+}
+
+/*
+**test_vadd_s16:
+**addp v0\.4h, v0\.4h, v1\.4h
+**ret
+*/
+int16x4_t test_vadd_s16(int16x4_t v1, int16x4_t v2) {
+ int16x4_t v3 = vpadd_s16(v1, v2);
+ return v3;
+}
+
+/*
+**test_vadd_s32:
+** addpv0\.2s, v0\.2s, v1\.2s
+** ret
+*/
+int32x2_t test_vadd_s32(int32x2_t v1, int32x2_t v2) {
+ int32x2_t v3 = vpadd_s32(v1, v2);
+ return v3;
+}
+
+/*
+**test_vaddq_s8:
+**...
+** addpv0\.16b, v0\.16b, v1\.16b
+** ret
+*/
+int8x16_t test_vaddq_s8(int8x16_t v1, int8x16_t v2) {
+ int8x16_t v3 = vpaddq_s8(v1, v2);
+ return v3;
+}
+
+/*
+**test_vaddq_s16:
+**...
+** addpv0\.8h, v0\.8h, v1\.8h
+** ret
+*/
+int16x8_t test_vaddq_s16(int16x8_t v1, int16x8_t v2) {
+ int16x8_t v3 = vpaddq_s16(v1, v2);
+ return v3;
+}
+
+/*
+**test_vaddq_s32:
+**...
+** addpv0\.4s, v0\.4s, v1\.4s
+** ret
+*/
+int32x4_t test_vaddq_s32(int32x4_t v1, int32x4_t v2) {
+ int32x4_t v3 = vpaddq_s32(v1, v2);
+ return v3;
+}
+
+/*
+**test_vaddq_s64:
+**...
+** addpv0\.2d, v0\.2d, v1\.2d
+** ret
+*/
+int64x2_t test_vaddq_s64(int64x2_t v1, int64x2_t v2) {
+ int64x2_t v3 = vpaddq_s64(v1, v2);
+ return v3;
+}
+
+/*
+**test_vaddd_s64:
+**...
+** addp(d[0-9]+), v0\.2d
+** fmovx0, \1
+** ret
+*/
+int64_t test_vaddd_s64(int64x2_t v1) {
+ int64_t v2 = vpaddd_s64(v1);
+ return v2;
+}
+
+/*
+**test_vaddl_s8:
+**...
+** saddlp  v0\.4h, v0\.8b
+** ret
+*/
+int16x4_t test_vaddl_s8(int8x8_t v1) {
+ int16x4_t v2 = vpaddl_s8(v1);
+ return v2;
+}
+
+/*
+**test_vaddlq_s8:
+**...
+** saddlp  v0\.8h, v0\.16b
+** ret
+*/
+int16x8_t test_vaddlq_s8(int8x16_t v1) {
+ int16x8_t v2 = vpaddlq_s8(v1);
+ return v2;
+}
+/*
+**test_vaddl_s16:
+**...
+** saddlp  v0\.2s, v0\.4h
+** ret
+*/
+int32x2_t test_vaddl_s16(int16x4_t v1) {
+ int32x2_t v2 = vpaddl_s16(v1);
+ return v2;
+}
+
+/*
+**test_vaddlq_s16:
+**...
+** saddlp  v0\.4s, v0\.8h
+** ret
+*/
+int32x4_t test_vaddlq_s16(int16x8_t v1) {
+ int32x4_t v2 = vpaddlq_s16(v1);
+ return v2;
+}
+
+/*
+**test_vaddl_s32:
+**...
+** saddlp  v0\.1d, v0\.2s
+** ret
+*/
+int64x1_t test_vaddl_s32(int32x2_t v1) {
+ int64x1_t v2 = vpaddl_s32(v1);
+ return v2;
+}
+
+/*
+**test_vaddlq_s32:
+**...
+** saddlp  v0\.2d, v0\.4s
+** ret
+*/
+int64x2_t test_vaddlq_s32(int32x4_t v1) {
+ int64x2_t v2 = vpaddlq_s32(v1);
+ return v2;
+}
+
+// UNSIGNED VADD INTRINSICS
+
+/*
+**test_vadd_u8:
+**...
+** addpv0\.8b, v0\.8b, v1\.8b
+** ret
+*/
+uint8x8_t test_vadd_u8(uint8x8_t v1, uint8x8_t v2) {
+ uint8x8_t v3 = vpadd_u8(v1, v2);
+ return v3;
+}
+
+/*
+**test_vadd_u16:
+**...
+** addpv0\.4h, v0\.4h, v1\.4h
+** ret
+*/
+uint16x4_t test_vadd_u16(uint16x4_t v1, uint16x4_t v2) {
+ uint16x4_t v3 = vpadd_u16(v1, v2);
+ return v3;
+}
+
+/*
+**test_vadd_u32:
+**...
+** addpv0\.2s, v0\.2s, v1\.2s
+** ret
+*/
+uint32x2_t test_vadd_u32(uint32x2_t v1, uint32x2_t v2) {
+ uint32x2_t v3 = vpadd_u32(v1, v2);
+ return v3;
+}
+
+/*
+**test_vaddq_u8:
+**...
+** addpv0\.16b, v0\.16b, v1\.16b
+** ret
+*/
+uint8x16_t test_vaddq_u8(uint8x16_t v1, uint8x16_t v2) {
+ uint8x16_t v3 = vpaddq_u8(v1, v2);
+ return v3;
+}
+
+/*
+**test_vaddq_u16:
+**...
+** addpv0\.8h, v0\.8h, v1\.8h
+** ret
+*/
+uint16x8_t test_vaddq_u16(uint16x8_t v1, uint16x8_t v2) {
+ uint16x8_t v3 = vpaddq_u16(v1, v2);
+ return v3;
+}
+
+/*
+**test_vaddq_u32:
+**...
+** addpv0\.4s, v0\.4s, v1\.4s
+** ret
+*/
+uint32x4_t test_vaddq_u32(uint32x4_t v1, uint32x4_t v2) {
+ uint32x4_t v3 = vpaddq_u32(v1, v2);
+ return v3;
+}
+
+/*
+**test_vaddq_u64:
+**...
+** addpv0\.2d, v0\.2d, v1\.2d
+** ret
+*/
+uint64x2_t test_vaddq_u64(uint64x2_t v1, uint64x2_t v2) {
+ uint6

[gcc r15-1839] Aarch64, bugfix: Fix NEON bigendian addp intrinsic [PR114890]

2024-07-04 Thread Kyrylo Tkachov via Gcc-cvs
https://gcc.gnu.org/g:11049cdf204bc96bc407e5dd44ed3b8a492f405a

commit r15-1839-g11049cdf204bc96bc407e5dd44ed3b8a492f405a
Author: Alfie Richards 
Date:   Thu Jul 4 09:09:19 2024 +0200

Aarch64, bugfix: Fix NEON bigendian addp intrinsic [PR114890]

This change removes code that switches the operands in bigendian mode 
erroneously.
This fixes the related test also.

gcc/ChangeLog:

PR target/114890
* config/aarch64/aarch64-simd.md: Remove bigendian operand swap.

gcc/testsuite/ChangeLog:

PR target/114890
* gcc.target/aarch64/vector_intrinsics_asm.c: Remove xfail.

Diff:
---
 gcc/config/aarch64/aarch64-simd.md   | 2 --
 gcc/testsuite/gcc.target/aarch64/vector_intrinsics_asm.c | 2 +-
 2 files changed, 1 insertion(+), 3 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-simd.md 
b/gcc/config/aarch64/aarch64-simd.md
index fd0c5e612b5..fd10039f9a2 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -7379,8 +7379,6 @@
   nunits /= 2;
 rtx par_even = aarch64_gen_stepped_int_parallel (nunits, 0, 2);
 rtx par_odd = aarch64_gen_stepped_int_parallel (nunits, 1, 2);
-if (BYTES_BIG_ENDIAN)
-  std::swap (operands[1], operands[2]);
 emit_insn (gen_aarch64_addp_insn (operands[0], operands[1],
operands[2], par_even, par_odd));
 DONE;
diff --git a/gcc/testsuite/gcc.target/aarch64/vector_intrinsics_asm.c 
b/gcc/testsuite/gcc.target/aarch64/vector_intrinsics_asm.c
index b7d5620abab..e3dcd0830c8 100644
--- a/gcc/testsuite/gcc.target/aarch64/vector_intrinsics_asm.c
+++ b/gcc/testsuite/gcc.target/aarch64/vector_intrinsics_asm.c
@@ -1,6 +1,6 @@
 /* { dg-do compile } */
 /* { dg-options "-O2" } */
-/* { dg-final { check-function-bodies "**" "" "" { xfail be } } } */
+/* { dg-final { check-function-bodies "**" "" "" } } */
 
 #include "arm_neon.h"


[gcc r14-10379] aarch64: PR target/115457 Implement missing __ARM_FEATURE_BF16 macro

2024-07-04 Thread Kyrylo Tkachov via Gcc-cvs
https://gcc.gnu.org/g:1a97c8ed42562ceabb00c9c516435541909c134b

commit r14-10379-g1a97c8ed42562ceabb00c9c516435541909c134b
Author: Kyrylo Tkachov 
Date:   Thu Jun 27 16:10:41 2024 +0530

aarch64: PR target/115457 Implement missing __ARM_FEATURE_BF16 macro

The ACLE asks the user to test for __ARM_FEATURE_BF16 before using the
 header but GCC doesn't set this up.
LLVM does, so this is an inconsistency between the compilers.

This patch enables that macro for TARGET_BF16_FP.
Bootstrapped and tested on aarch64-none-linux-gnu.

gcc/

PR target/115457
* config/aarch64/aarch64-c.cc (aarch64_update_cpp_builtins):
Define __ARM_FEATURE_BF16 for TARGET_BF16_FP.

gcc/testsuite/

PR target/115457
* gcc.target/aarch64/acle/bf16_feature.c: New test.

Signed-off-by: Kyrylo Tkachov 
(cherry picked from commit c10942134fa759843ac1ed1424b86fcb8e6368ba)

Diff:
---
 gcc/config/aarch64/aarch64-c.cc  |  2 ++
 gcc/testsuite/gcc.target/aarch64/acle/bf16_feature.c | 10 ++
 2 files changed, 12 insertions(+)

diff --git a/gcc/config/aarch64/aarch64-c.cc b/gcc/config/aarch64/aarch64-c.cc
index d042e5fbd8c..f5d70339e4e 100644
--- a/gcc/config/aarch64/aarch64-c.cc
+++ b/gcc/config/aarch64/aarch64-c.cc
@@ -252,6 +252,8 @@ aarch64_update_cpp_builtins (cpp_reader *pfile)
"__ARM_FEATURE_BF16_VECTOR_ARITHMETIC", pfile);
   aarch64_def_or_undef (TARGET_BF16_FP,
"__ARM_FEATURE_BF16_SCALAR_ARITHMETIC", pfile);
+  aarch64_def_or_undef (TARGET_BF16_FP,
+   "__ARM_FEATURE_BF16", pfile);
   aarch64_def_or_undef (TARGET_LS64,
"__ARM_FEATURE_LS64", pfile);
   aarch64_def_or_undef (AARCH64_ISA_RCPC, "__ARM_FEATURE_RCPC", pfile);
diff --git a/gcc/testsuite/gcc.target/aarch64/acle/bf16_feature.c 
b/gcc/testsuite/gcc.target/aarch64/acle/bf16_feature.c
new file mode 100644
index 000..96584b4b988
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/acle/bf16_feature.c
@@ -0,0 +1,10 @@
+/* { dg-do compile } */
+
+#pragma GCC target "+bf16"
+#ifndef __ARM_FEATURE_BF16
+#error "__ARM_FEATURE_BF16 is not defined but should be!"
+#endif
+
+void
+foo (void) {}
+


[gcc r14-10380] aarch64: PR target/115475 Implement missing __ARM_FEATURE_SVE_BF16 macro

2024-07-04 Thread Kyrylo Tkachov via Gcc-cvs
https://gcc.gnu.org/g:dc63b5dbe60da076f46cb3bcb10f0f84cfd7fb7d

commit r14-10380-gdc63b5dbe60da076f46cb3bcb10f0f84cfd7fb7d
Author: Kyrylo Tkachov 
Date:   Fri Jun 28 13:22:37 2024 +0530

aarch64: PR target/115475 Implement missing __ARM_FEATURE_SVE_BF16 macro

The ACLE requires __ARM_FEATURE_SVE_BF16 to be enabled when SVE and BF16
and the associated intrinsics are available.
GCC does support the required intrinsics for TARGET_SVE_BF16 so define
this macro too.

Bootstrapped and tested on aarch64-none-linux-gnu.

gcc/

PR target/115475
* config/aarch64/aarch64-c.cc (aarch64_update_cpp_builtins):
Define __ARM_FEATURE_SVE_BF16 for TARGET_SVE_BF16.

gcc/testsuite/

PR target/115475
* gcc.target/aarch64/acle/bf16_sve_feature.c: New test.

Signed-off-by: Kyrylo Tkachov 
(cherry picked from commit 6492c7130d6ae9992298fc3d072e2589d1131376)

Diff:
---
 gcc/config/aarch64/aarch64-c.cc  |  3 +++
 gcc/testsuite/gcc.target/aarch64/acle/bf16_sve_feature.c | 10 ++
 2 files changed, 13 insertions(+)

diff --git a/gcc/config/aarch64/aarch64-c.cc b/gcc/config/aarch64/aarch64-c.cc
index f5d70339e4e..2aff097dd33 100644
--- a/gcc/config/aarch64/aarch64-c.cc
+++ b/gcc/config/aarch64/aarch64-c.cc
@@ -254,6 +254,9 @@ aarch64_update_cpp_builtins (cpp_reader *pfile)
"__ARM_FEATURE_BF16_SCALAR_ARITHMETIC", pfile);
   aarch64_def_or_undef (TARGET_BF16_FP,
"__ARM_FEATURE_BF16", pfile);
+  aarch64_def_or_undef (TARGET_SVE_BF16,
+   "__ARM_FEATURE_SVE_BF16", pfile);
+
   aarch64_def_or_undef (TARGET_LS64,
"__ARM_FEATURE_LS64", pfile);
   aarch64_def_or_undef (AARCH64_ISA_RCPC, "__ARM_FEATURE_RCPC", pfile);
diff --git a/gcc/testsuite/gcc.target/aarch64/acle/bf16_sve_feature.c 
b/gcc/testsuite/gcc.target/aarch64/acle/bf16_sve_feature.c
new file mode 100644
index 000..cb3ddac71a3
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/acle/bf16_sve_feature.c
@@ -0,0 +1,10 @@
+/* { dg-do compile } */
+
+#pragma GCC target "+sve+bf16"
+#ifndef __ARM_FEATURE_SVE_BF16
+#error "__ARM_FEATURE_SVE_BF16 is not defined but should be!"
+#endif
+
+void
+foo (void) {}
+


[gcc r13-8890] aarch64: PR target/115457 Implement missing __ARM_FEATURE_BF16 macro

2024-07-04 Thread Kyrylo Tkachov via Gcc-cvs
https://gcc.gnu.org/g:7785289f8d1f6350a3f48232ce578009b0e23534

commit r13-8890-g7785289f8d1f6350a3f48232ce578009b0e23534
Author: Kyrylo Tkachov 
Date:   Thu Jun 27 16:10:41 2024 +0530

aarch64: PR target/115457 Implement missing __ARM_FEATURE_BF16 macro

The ACLE asks the user to test for __ARM_FEATURE_BF16 before using the
 header but GCC doesn't set this up.
LLVM does, so this is an inconsistency between the compilers.

This patch enables that macro for TARGET_BF16_FP.
Bootstrapped and tested on aarch64-none-linux-gnu.

gcc/

PR target/115457
* config/aarch64/aarch64-c.cc (aarch64_update_cpp_builtins):
Define __ARM_FEATURE_BF16 for TARGET_BF16_FP.

gcc/testsuite/

PR target/115457
* gcc.target/aarch64/acle/bf16_feature.c: New test.

Signed-off-by: Kyrylo Tkachov 
(cherry picked from commit c10942134fa759843ac1ed1424b86fcb8e6368ba)

Diff:
---
 gcc/config/aarch64/aarch64-c.cc  |  2 ++
 gcc/testsuite/gcc.target/aarch64/acle/bf16_feature.c | 10 ++
 2 files changed, 12 insertions(+)

diff --git a/gcc/config/aarch64/aarch64-c.cc b/gcc/config/aarch64/aarch64-c.cc
index 6c5331a7625..51709d6044e 100644
--- a/gcc/config/aarch64/aarch64-c.cc
+++ b/gcc/config/aarch64/aarch64-c.cc
@@ -202,6 +202,8 @@ aarch64_update_cpp_builtins (cpp_reader *pfile)
"__ARM_FEATURE_BF16_VECTOR_ARITHMETIC", pfile);
   aarch64_def_or_undef (TARGET_BF16_FP,
"__ARM_FEATURE_BF16_SCALAR_ARITHMETIC", pfile);
+  aarch64_def_or_undef (TARGET_BF16_FP,
+   "__ARM_FEATURE_BF16", pfile);
   aarch64_def_or_undef (TARGET_LS64,
"__ARM_FEATURE_LS64", pfile);
   aarch64_def_or_undef (AARCH64_ISA_RCPC, "__ARM_FEATURE_RCPC", pfile);
diff --git a/gcc/testsuite/gcc.target/aarch64/acle/bf16_feature.c 
b/gcc/testsuite/gcc.target/aarch64/acle/bf16_feature.c
new file mode 100644
index 000..96584b4b988
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/acle/bf16_feature.c
@@ -0,0 +1,10 @@
+/* { dg-do compile } */
+
+#pragma GCC target "+bf16"
+#ifndef __ARM_FEATURE_BF16
+#error "__ARM_FEATURE_BF16 is not defined but should be!"
+#endif
+
+void
+foo (void) {}
+


[gcc r13-8891] aarch64: PR target/115475 Implement missing __ARM_FEATURE_SVE_BF16 macro

2024-07-04 Thread Kyrylo Tkachov via Gcc-cvs
https://gcc.gnu.org/g:40d54856c1189ab6125d3eeb064df25082dd0e50

commit r13-8891-g40d54856c1189ab6125d3eeb064df25082dd0e50
Author: Kyrylo Tkachov 
Date:   Fri Jun 28 13:22:37 2024 +0530

aarch64: PR target/115475 Implement missing __ARM_FEATURE_SVE_BF16 macro

The ACLE requires __ARM_FEATURE_SVE_BF16 to be enabled when SVE and BF16
and the associated intrinsics are available.
GCC does support the required intrinsics for TARGET_SVE_BF16 so define
this macro too.

Bootstrapped and tested on aarch64-none-linux-gnu.

gcc/

PR target/115475
* config/aarch64/aarch64-c.cc (aarch64_update_cpp_builtins):
Define __ARM_FEATURE_SVE_BF16 for TARGET_SVE_BF16.

gcc/testsuite/

PR target/115475
* gcc.target/aarch64/acle/bf16_sve_feature.c: New test.

Signed-off-by: Kyrylo Tkachov 
(cherry picked from commit 6492c7130d6ae9992298fc3d072e2589d1131376)

Diff:
---
 gcc/config/aarch64/aarch64-c.cc  |  3 +++
 gcc/testsuite/gcc.target/aarch64/acle/bf16_sve_feature.c | 10 ++
 2 files changed, 13 insertions(+)

diff --git a/gcc/config/aarch64/aarch64-c.cc b/gcc/config/aarch64/aarch64-c.cc
index 51709d6044e..6ddfcc7ce3e 100644
--- a/gcc/config/aarch64/aarch64-c.cc
+++ b/gcc/config/aarch64/aarch64-c.cc
@@ -204,6 +204,9 @@ aarch64_update_cpp_builtins (cpp_reader *pfile)
"__ARM_FEATURE_BF16_SCALAR_ARITHMETIC", pfile);
   aarch64_def_or_undef (TARGET_BF16_FP,
"__ARM_FEATURE_BF16", pfile);
+  aarch64_def_or_undef (TARGET_SVE_BF16,
+   "__ARM_FEATURE_SVE_BF16", pfile);
+
   aarch64_def_or_undef (TARGET_LS64,
"__ARM_FEATURE_LS64", pfile);
   aarch64_def_or_undef (AARCH64_ISA_RCPC, "__ARM_FEATURE_RCPC", pfile);
diff --git a/gcc/testsuite/gcc.target/aarch64/acle/bf16_sve_feature.c 
b/gcc/testsuite/gcc.target/aarch64/acle/bf16_sve_feature.c
new file mode 100644
index 000..cb3ddac71a3
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/acle/bf16_sve_feature.c
@@ -0,0 +1,10 @@
+/* { dg-do compile } */
+
+#pragma GCC target "+sve+bf16"
+#ifndef __ARM_FEATURE_SVE_BF16
+#error "__ARM_FEATURE_SVE_BF16 is not defined but should be!"
+#endif
+
+void
+foo (void) {}
+


[gcc r12-10599] aarch64: PR target/115457 Implement missing __ARM_FEATURE_BF16 macro

2024-07-04 Thread Kyrylo Tkachov via Gcc-cvs
https://gcc.gnu.org/g:ebf561429ee4fbd125aa51ee985e32f1cfd4daed

commit r12-10599-gebf561429ee4fbd125aa51ee985e32f1cfd4daed
Author: Kyrylo Tkachov 
Date:   Thu Jun 27 16:10:41 2024 +0530

aarch64: PR target/115457 Implement missing __ARM_FEATURE_BF16 macro

The ACLE asks the user to test for __ARM_FEATURE_BF16 before using the
 header but GCC doesn't set this up.
LLVM does, so this is an inconsistency between the compilers.

This patch enables that macro for TARGET_BF16_FP.
Bootstrapped and tested on aarch64-none-linux-gnu.

gcc/

PR target/115457
* config/aarch64/aarch64-c.cc (aarch64_update_cpp_builtins):
Define __ARM_FEATURE_BF16 for TARGET_BF16_FP.

gcc/testsuite/

PR target/115457
* gcc.target/aarch64/acle/bf16_feature.c: New test.

Signed-off-by: Kyrylo Tkachov 
(cherry picked from commit c10942134fa759843ac1ed1424b86fcb8e6368ba)

Diff:
---
 gcc/config/aarch64/aarch64-c.cc  |  2 ++
 gcc/testsuite/gcc.target/aarch64/acle/bf16_feature.c | 10 ++
 2 files changed, 12 insertions(+)

diff --git a/gcc/config/aarch64/aarch64-c.cc b/gcc/config/aarch64/aarch64-c.cc
index a4c407724a7..b31b967c140 100644
--- a/gcc/config/aarch64/aarch64-c.cc
+++ b/gcc/config/aarch64/aarch64-c.cc
@@ -200,6 +200,8 @@ aarch64_update_cpp_builtins (cpp_reader *pfile)
"__ARM_FEATURE_BF16_VECTOR_ARITHMETIC", pfile);
   aarch64_def_or_undef (TARGET_BF16_FP,
"__ARM_FEATURE_BF16_SCALAR_ARITHMETIC", pfile);
+  aarch64_def_or_undef (TARGET_BF16_FP,
+   "__ARM_FEATURE_BF16", pfile);
   aarch64_def_or_undef (TARGET_LS64,
"__ARM_FEATURE_LS64", pfile);
   aarch64_def_or_undef (AARCH64_ISA_RCPC, "__ARM_FEATURE_RCPC", pfile);
diff --git a/gcc/testsuite/gcc.target/aarch64/acle/bf16_feature.c 
b/gcc/testsuite/gcc.target/aarch64/acle/bf16_feature.c
new file mode 100644
index 000..96584b4b988
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/acle/bf16_feature.c
@@ -0,0 +1,10 @@
+/* { dg-do compile } */
+
+#pragma GCC target "+bf16"
+#ifndef __ARM_FEATURE_BF16
+#error "__ARM_FEATURE_BF16 is not defined but should be!"
+#endif
+
+void
+foo (void) {}
+


[gcc r12-10600] aarch64: PR target/115475 Implement missing __ARM_FEATURE_SVE_BF16 macro

2024-07-04 Thread Kyrylo Tkachov via Gcc-cvs
https://gcc.gnu.org/g:cdeb7ce83f71d1527626975e70d294ef55535d03

commit r12-10600-gcdeb7ce83f71d1527626975e70d294ef55535d03
Author: Kyrylo Tkachov 
Date:   Fri Jun 28 13:22:37 2024 +0530

aarch64: PR target/115475 Implement missing __ARM_FEATURE_SVE_BF16 macro

The ACLE requires __ARM_FEATURE_SVE_BF16 to be enabled when SVE and BF16
and the associated intrinsics are available.
GCC does support the required intrinsics for TARGET_SVE_BF16 so define
this macro too.

Bootstrapped and tested on aarch64-none-linux-gnu.

gcc/

PR target/115475
* config/aarch64/aarch64-c.cc (aarch64_update_cpp_builtins):
Define __ARM_FEATURE_SVE_BF16 for TARGET_SVE_BF16.

gcc/testsuite/

PR target/115475
* gcc.target/aarch64/acle/bf16_sve_feature.c: New test.

Signed-off-by: Kyrylo Tkachov 
(cherry picked from commit 6492c7130d6ae9992298fc3d072e2589d1131376)

Diff:
---
 gcc/config/aarch64/aarch64-c.cc  |  3 +++
 gcc/testsuite/gcc.target/aarch64/acle/bf16_sve_feature.c | 10 ++
 2 files changed, 13 insertions(+)

diff --git a/gcc/config/aarch64/aarch64-c.cc b/gcc/config/aarch64/aarch64-c.cc
index b31b967c140..e024d410dc7 100644
--- a/gcc/config/aarch64/aarch64-c.cc
+++ b/gcc/config/aarch64/aarch64-c.cc
@@ -202,6 +202,9 @@ aarch64_update_cpp_builtins (cpp_reader *pfile)
"__ARM_FEATURE_BF16_SCALAR_ARITHMETIC", pfile);
   aarch64_def_or_undef (TARGET_BF16_FP,
"__ARM_FEATURE_BF16", pfile);
+  aarch64_def_or_undef (TARGET_SVE_BF16,
+   "__ARM_FEATURE_SVE_BF16", pfile);
+
   aarch64_def_or_undef (TARGET_LS64,
"__ARM_FEATURE_LS64", pfile);
   aarch64_def_or_undef (AARCH64_ISA_RCPC, "__ARM_FEATURE_RCPC", pfile);
diff --git a/gcc/testsuite/gcc.target/aarch64/acle/bf16_sve_feature.c 
b/gcc/testsuite/gcc.target/aarch64/acle/bf16_sve_feature.c
new file mode 100644
index 000..cb3ddac71a3
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/acle/bf16_sve_feature.c
@@ -0,0 +1,10 @@
+/* { dg-do compile } */
+
+#pragma GCC target "+sve+bf16"
+#ifndef __ARM_FEATURE_SVE_BF16
+#error "__ARM_FEATURE_SVE_BF16 is not defined but should be!"
+#endif
+
+void
+foo (void) {}
+


[gcc r11-11564] aarch64: PR target/115457 Implement missing __ARM_FEATURE_BF16 macro

2024-07-09 Thread Kyrylo Tkachov via Gcc-cvs
https://gcc.gnu.org/g:d32cfe3352f3863325f8452e83400063b1e71e5b

commit r11-11564-gd32cfe3352f3863325f8452e83400063b1e71e5b
Author: Kyrylo Tkachov 
Date:   Thu Jun 27 16:10:41 2024 +0530

aarch64: PR target/115457 Implement missing __ARM_FEATURE_BF16 macro

The ACLE asks the user to test for __ARM_FEATURE_BF16 before using the
 header but GCC doesn't set this up.
LLVM does, so this is an inconsistency between the compilers.

This patch enables that macro for TARGET_BF16_FP.
Bootstrapped and tested on aarch64-none-linux-gnu.

gcc/

PR target/115457
* config/aarch64/aarch64-c.c (aarch64_update_cpp_builtins):
Define __ARM_FEATURE_BF16 for TARGET_BF16_FP.

gcc/testsuite/

PR target/115457
* gcc.target/aarch64/acle/bf16_feature.c: New test.

Signed-off-by: Kyrylo Tkachov 
(cherry picked from commit c10942134fa759843ac1ed1424b86fcb8e6368ba)

Diff:
---
 gcc/config/aarch64/aarch64-c.c   |  2 ++
 gcc/testsuite/gcc.target/aarch64/acle/bf16_feature.c | 10 ++
 2 files changed, 12 insertions(+)

diff --git a/gcc/config/aarch64/aarch64-c.c b/gcc/config/aarch64/aarch64-c.c
index 05869463e4ba..f6d90affd374 100644
--- a/gcc/config/aarch64/aarch64-c.c
+++ b/gcc/config/aarch64/aarch64-c.c
@@ -200,6 +200,8 @@ aarch64_update_cpp_builtins (cpp_reader *pfile)
"__ARM_FEATURE_BF16_VECTOR_ARITHMETIC", pfile);
   aarch64_def_or_undef (TARGET_BF16_FP,
"__ARM_FEATURE_BF16_SCALAR_ARITHMETIC", pfile);
+  aarch64_def_or_undef (TARGET_BF16_FP,
+   "__ARM_FEATURE_BF16", pfile);
   aarch64_def_or_undef (AARCH64_ISA_RCPC, "__ARM_FEATURE_RCPC", pfile);
 
   /* Not for ACLE, but required to keep "float.h" correct if we switch
diff --git a/gcc/testsuite/gcc.target/aarch64/acle/bf16_feature.c 
b/gcc/testsuite/gcc.target/aarch64/acle/bf16_feature.c
new file mode 100644
index ..96584b4b9887
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/acle/bf16_feature.c
@@ -0,0 +1,10 @@
+/* { dg-do compile } */
+
+#pragma GCC target "+bf16"
+#ifndef __ARM_FEATURE_BF16
+#error "__ARM_FEATURE_BF16 is not defined but should be!"
+#endif
+
+void
+foo (void) {}
+


[gcc r11-11565] aarch64: PR target/115475 Implement missing __ARM_FEATURE_SVE_BF16 macro

2024-07-09 Thread Kyrylo Tkachov via Gcc-cvs
https://gcc.gnu.org/g:ee69d6e1e3bed8c3799c29fad3299bfd2e14f64e

commit r11-11565-gee69d6e1e3bed8c3799c29fad3299bfd2e14f64e
Author: Kyrylo Tkachov 
Date:   Fri Jun 28 13:22:37 2024 +0530

aarch64: PR target/115475 Implement missing __ARM_FEATURE_SVE_BF16 macro

The ACLE requires __ARM_FEATURE_SVE_BF16 to be enabled when SVE and BF16
and the associated intrinsics are available.
GCC does support the required intrinsics for TARGET_SVE_BF16 so define
this macro too.

Bootstrapped and tested on aarch64-none-linux-gnu.

gcc/

PR target/115475
* config/aarch64/aarch64-c.c (aarch64_update_cpp_builtins):
Define __ARM_FEATURE_SVE_BF16 for TARGET_SVE_BF16.

gcc/testsuite/

PR target/115475
* gcc.target/aarch64/acle/bf16_sve_feature.c: New test.

Signed-off-by: Kyrylo Tkachov 
(cherry picked from commit 6492c7130d6ae9992298fc3d072e2589d1131376)

Diff:
---
 gcc/config/aarch64/aarch64-c.c   |  3 +++
 gcc/testsuite/gcc.target/aarch64/acle/bf16_sve_feature.c | 10 ++
 2 files changed, 13 insertions(+)

diff --git a/gcc/config/aarch64/aarch64-c.c b/gcc/config/aarch64/aarch64-c.c
index f6d90affd374..ba732e4d877c 100644
--- a/gcc/config/aarch64/aarch64-c.c
+++ b/gcc/config/aarch64/aarch64-c.c
@@ -202,6 +202,9 @@ aarch64_update_cpp_builtins (cpp_reader *pfile)
"__ARM_FEATURE_BF16_SCALAR_ARITHMETIC", pfile);
   aarch64_def_or_undef (TARGET_BF16_FP,
"__ARM_FEATURE_BF16", pfile);
+  aarch64_def_or_undef (TARGET_SVE_BF16,
+   "__ARM_FEATURE_SVE_BF16", pfile);
+
   aarch64_def_or_undef (AARCH64_ISA_RCPC, "__ARM_FEATURE_RCPC", pfile);
 
   /* Not for ACLE, but required to keep "float.h" correct if we switch
diff --git a/gcc/testsuite/gcc.target/aarch64/acle/bf16_sve_feature.c 
b/gcc/testsuite/gcc.target/aarch64/acle/bf16_sve_feature.c
new file mode 100644
index ..cb3ddac71a32
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/acle/bf16_sve_feature.c
@@ -0,0 +1,10 @@
+/* { dg-do compile } */
+
+#pragma GCC target "+sve+bf16"
+#ifndef __ARM_FEATURE_SVE_BF16
+#error "__ARM_FEATURE_SVE_BF16 is not defined but should be!"
+#endif
+
+void
+foo (void) {}
+


[gcc r15-1935] testsuite: Tests the pattern folding x/sqrt(x) to sqrt(x) for Float16

2024-07-10 Thread Kyrylo Tkachov via Gcc-cvs
https://gcc.gnu.org/g:1ae5fc24e86ecc9e7b60346d9ca2e56f83517bda

commit r15-1935-g1ae5fc24e86ecc9e7b60346d9ca2e56f83517bda
Author: Jennifer Schmitz 
Date:   Wed Jul 10 12:54:01 2024 +0530

testsuite: Tests the pattern folding x/sqrt(x) to sqrt(x) for Float16

As a follow-up to adding a pattern that folds x/sqrt(x) to sqrt(x) in 
match.pd, this patch adds a test case for type Float16 for armv8.2-a+fp16.

The patch was bootstrapped and regtested on aarch64-linux-gnu, no 
regression.

Signed-off-by: Jennifer Schmitz 

gcc/testsuite/

* gcc.target/aarch64/sqrt_div_float16.c: New test.

Diff:
---
 gcc/testsuite/gcc.target/aarch64/sqrt_div_float16.c | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/gcc/testsuite/gcc.target/aarch64/sqrt_div_float16.c 
b/gcc/testsuite/gcc.target/aarch64/sqrt_div_float16.c
new file mode 100644
index ..c4f297ef17ae
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sqrt_div_float16.c
@@ -0,0 +1,14 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ffast-math -fdump-tree-forwprop-details" } */
+/* { dg-require-effective-target c99_runtime } */
+
+#pragma GCC target ("arch=armv8.2-a+fp16")
+
+_Float16 f (_Float16 x) 
+{
+  _Float16 t1 = __builtin_sqrt (x);
+  _Float16 t2 = x / t1;
+  return t2;
+}
+
+/* { dg-final { scan-tree-dump "gimple_simplified to t2_\[0-9\]+ = .SQRT 
.x_\[0-9\]*.D.." "forwprop1" } } */


[gcc r15-2128] [aarch64] Document rewriting of -march=native to -mcpu=native

2024-07-17 Thread Kyrylo Tkachov via Gcc-cvs
https://gcc.gnu.org/g:a2cb656c0d1f0b493219025208fa8ed5c7abd2cb

commit r15-2128-ga2cb656c0d1f0b493219025208fa8ed5c7abd2cb
Author: Kyrylo Tkachov 
Date:   Tue Jul 16 16:59:42 2024 +0530

[aarch64] Document rewriting of -march=native to -mcpu=native

Commit dd9e5f4db2debf1429feab7f785962ccef6e0dbd changed -march=native to
treat it as -mcpu=native if no other mcpu or mtune option was given.
It would make sense to document this, especially if we try to persuade
compilers like LLVM to take the same approach.
This patch documents that behaviour.

Bootstrapped and tested on aarch64-none-linux-gnu.

Signed-off-by: Kyrylo Tkachov 

gcc/ChangeLog:

* doc/invoke.texi (AArch64 Options): Document rewriting of
-march=native to -mcpu=native.

Diff:
---
 gcc/doc/invoke.texi | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 403ea9da1abd..f052128e2a5d 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -21498,7 +21498,10 @@ and the features that they enable by default:
 The value @samp{native} is available on native AArch64 GNU/Linux and
 causes the compiler to pick the architecture of the host system.  This
 option has no effect if the compiler is unable to recognize the
-architecture of the host system,
+architecture of the host system.  When @option{-march=native} is given and
+no other @option{-mcpu} or @option{-mtune} is given then GCC will pick
+the host CPU as the CPU to tune for as well as select the architecture features
+from.  That is, @option{-march=native} is treated as @option{-mcpu=native}.
 
 The permissible values for @var{feature} are listed in the sub-section
 on @ref{aarch64-feature-modifiers,,@option{-march} and @option{-mcpu}


[gcc r15-2253] aarch64: Fuse CMP+CSEL and CMP+CSET for -mcpu=neoverse-v2

2024-07-24 Thread Kyrylo Tkachov via Gcc-cvs
https://gcc.gnu.org/g:4c5eb66e701bc9f3bf1298269f52559b10d63a09

commit r15-2253-g4c5eb66e701bc9f3bf1298269f52559b10d63a09
Author: Jennifer Schmitz 
Date:   Mon Jul 22 23:24:45 2024 -0700

aarch64: Fuse CMP+CSEL and CMP+CSET for -mcpu=neoverse-v2

According to the Neoverse V2 Software Optimization Guide (section 4.14), the
instruction pairs CMP+CSEL and CMP+CSET can be fused, which had not been
implemented so far. This patch implements and tests the two fusion pairs.

The patch was bootstrapped and regtested on aarch64-linux-gnu, no 
regression.
There was also no non-noise impact on SPEC CPU2017 benchmark.
OK for mainline?

Signed-off-by: Jennifer Schmitz 

gcc/

* config/aarch64/aarch64.cc (aarch_macro_fusion_pair_p): Implement
fusion logic.
* config/aarch64/aarch64-fusion-pairs.def (cmp+csel): New entry.
(cmp+cset): Likewise.
* config/aarch64/tuning_models/neoversev2.h: Enable logic in
field fusible_ops.

gcc/testsuite/

* gcc.target/aarch64/cmp_csel_fuse.c: New test.
* gcc.target/aarch64/cmp_cset_fuse.c: Likewise.

Diff:
---
 gcc/config/aarch64/aarch64-fusion-pairs.def  |  2 ++
 gcc/config/aarch64/aarch64.cc| 19 +
 gcc/config/aarch64/tuning_models/neoversev2.h|  5 +++-
 gcc/testsuite/gcc.target/aarch64/cmp_csel_fuse.c | 34 
 gcc/testsuite/gcc.target/aarch64/cmp_cset_fuse.c | 31 +
 5 files changed, 90 insertions(+), 1 deletion(-)

diff --git a/gcc/config/aarch64/aarch64-fusion-pairs.def 
b/gcc/config/aarch64/aarch64-fusion-pairs.def
index 9a43b0c80657..bf5e85ba8fe1 100644
--- a/gcc/config/aarch64/aarch64-fusion-pairs.def
+++ b/gcc/config/aarch64/aarch64-fusion-pairs.def
@@ -37,5 +37,7 @@ AARCH64_FUSION_PAIR ("aes+aesmc", AES_AESMC)
 AARCH64_FUSION_PAIR ("alu+branch", ALU_BRANCH)
 AARCH64_FUSION_PAIR ("alu+cbz", ALU_CBZ)
 AARCH64_FUSION_PAIR ("addsub_2reg_const1", ADDSUB_2REG_CONST1)
+AARCH64_FUSION_PAIR ("cmp+csel", CMP_CSEL)
+AARCH64_FUSION_PAIR ("cmp+cset", CMP_CSET)
 
 #undef AARCH64_FUSION_PAIR
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 9e51236ce9fa..db598ebf2c79 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -27348,6 +27348,25 @@ aarch_macro_fusion_pair_p (rtx_insn *prev, rtx_insn 
*curr)
   && reg_referenced_p (SET_DEST (prev_set), PATTERN (curr)))
 return true;
 
+  /* FUSE CMP and CSEL.  */
+  if (aarch64_fusion_enabled_p (AARCH64_FUSE_CMP_CSEL)
+  && prev_set && curr_set
+  && GET_CODE (SET_SRC (prev_set)) == COMPARE
+  && GET_CODE (SET_SRC (curr_set)) == IF_THEN_ELSE
+  && REG_P (XEXP (SET_SRC (curr_set), 1))
+  && REG_P (XEXP (SET_SRC (curr_set), 2))
+  && reg_referenced_p (SET_DEST (prev_set), PATTERN (curr)))
+return true;
+
+  /* Fuse CMP and CSET.  */
+  if (aarch64_fusion_enabled_p (AARCH64_FUSE_CMP_CSET)
+  && prev_set && curr_set
+  && GET_CODE (SET_SRC (prev_set)) == COMPARE
+  && GET_RTX_CLASS (GET_CODE (SET_SRC (curr_set))) == RTX_COMPARE
+  && REG_P (SET_DEST (curr_set))
+  && reg_referenced_p (SET_DEST (prev_set), PATTERN (curr)))
+return true;
+
   /* Fuse flag-setting ALU instructions and conditional branch.  */
   if (aarch64_fusion_enabled_p (AARCH64_FUSE_ALU_BRANCH)
   && any_condjump_p (curr))
diff --git a/gcc/config/aarch64/tuning_models/neoversev2.h 
b/gcc/config/aarch64/tuning_models/neoversev2.h
index f76e4ef358f7..ae99fab22d80 100644
--- a/gcc/config/aarch64/tuning_models/neoversev2.h
+++ b/gcc/config/aarch64/tuning_models/neoversev2.h
@@ -221,7 +221,10 @@ static const struct tune_params neoversev2_tunings =
 2 /* store_pred.  */
   }, /* memmov_cost.  */
   5, /* issue_rate  */
-  (AARCH64_FUSE_AES_AESMC | AARCH64_FUSE_CMP_BRANCH), /* fusible_ops  */
+  (AARCH64_FUSE_AES_AESMC
+   | AARCH64_FUSE_CMP_BRANCH
+   | AARCH64_FUSE_CMP_CSEL
+   | AARCH64_FUSE_CMP_CSET), /* fusible_ops  */
   "32:16", /* function_align.  */
   "4", /* jump_align.  */
   "32:16", /* loop_align.  */
diff --git a/gcc/testsuite/gcc.target/aarch64/cmp_csel_fuse.c 
b/gcc/testsuite/gcc.target/aarch64/cmp_csel_fuse.c
new file mode 100644
index ..f5e511e46737
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/cmp_csel_fuse.c
@@ -0,0 +1,34 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mcpu=neoverse-v2" } */
+/* { dg-final { check-function-bodies "**" "" } } */
+
+/*
+** f1:
+** ...
+** cmp w[0-9]+, w[0-9]+
+** cselw[0-9]+, w[0-9]+, w[0-9]+, le
+** ret
+*/
+int f1 (int a, int b, int c)
+{
+  int cmp = a > b;
+  int add1 = c + 3;
+  int add2 = c + 8;
+  return cmp ? add1 : add2;
+}
+
+/*
+** f2:
+** ...
+** cmp x[0-9]+, x[0-9]+
+** cselx[0-9]+, x[0-9]+, x[0-9]+, le
+** ret
+*/
+long long f2 (long long a, long long b, long long c)
+{
+  long lon

[gcc r15-2254] Revert "aarch64: Fuse CMP+CSEL and CMP+CSET for -mcpu=neoverse-v2"

2024-07-24 Thread Kyrylo Tkachov via Gcc-cvs
https://gcc.gnu.org/g:39562dd1e745c7aacc23b51b2849a7d346cbef14

commit r15-2254-g39562dd1e745c7aacc23b51b2849a7d346cbef14
Author: Kyrylo Tkachov 
Date:   Wed Jul 24 17:25:43 2024 +0530

Revert "aarch64: Fuse CMP+CSEL and CMP+CSET for -mcpu=neoverse-v2"

This reverts commit 4c5eb66e701bc9f3bf1298269f52559b10d63a09.

Diff:
---
 gcc/config/aarch64/aarch64-fusion-pairs.def  |  2 --
 gcc/config/aarch64/aarch64.cc| 19 -
 gcc/config/aarch64/tuning_models/neoversev2.h|  5 +---
 gcc/testsuite/gcc.target/aarch64/cmp_csel_fuse.c | 34 
 gcc/testsuite/gcc.target/aarch64/cmp_cset_fuse.c | 31 -
 5 files changed, 1 insertion(+), 90 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-fusion-pairs.def 
b/gcc/config/aarch64/aarch64-fusion-pairs.def
index bf5e85ba8fe1..9a43b0c80657 100644
--- a/gcc/config/aarch64/aarch64-fusion-pairs.def
+++ b/gcc/config/aarch64/aarch64-fusion-pairs.def
@@ -37,7 +37,5 @@ AARCH64_FUSION_PAIR ("aes+aesmc", AES_AESMC)
 AARCH64_FUSION_PAIR ("alu+branch", ALU_BRANCH)
 AARCH64_FUSION_PAIR ("alu+cbz", ALU_CBZ)
 AARCH64_FUSION_PAIR ("addsub_2reg_const1", ADDSUB_2REG_CONST1)
-AARCH64_FUSION_PAIR ("cmp+csel", CMP_CSEL)
-AARCH64_FUSION_PAIR ("cmp+cset", CMP_CSET)
 
 #undef AARCH64_FUSION_PAIR
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index db598ebf2c79..9e51236ce9fa 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -27348,25 +27348,6 @@ aarch_macro_fusion_pair_p (rtx_insn *prev, rtx_insn 
*curr)
   && reg_referenced_p (SET_DEST (prev_set), PATTERN (curr)))
 return true;
 
-  /* FUSE CMP and CSEL.  */
-  if (aarch64_fusion_enabled_p (AARCH64_FUSE_CMP_CSEL)
-  && prev_set && curr_set
-  && GET_CODE (SET_SRC (prev_set)) == COMPARE
-  && GET_CODE (SET_SRC (curr_set)) == IF_THEN_ELSE
-  && REG_P (XEXP (SET_SRC (curr_set), 1))
-  && REG_P (XEXP (SET_SRC (curr_set), 2))
-  && reg_referenced_p (SET_DEST (prev_set), PATTERN (curr)))
-return true;
-
-  /* Fuse CMP and CSET.  */
-  if (aarch64_fusion_enabled_p (AARCH64_FUSE_CMP_CSET)
-  && prev_set && curr_set
-  && GET_CODE (SET_SRC (prev_set)) == COMPARE
-  && GET_RTX_CLASS (GET_CODE (SET_SRC (curr_set))) == RTX_COMPARE
-  && REG_P (SET_DEST (curr_set))
-  && reg_referenced_p (SET_DEST (prev_set), PATTERN (curr)))
-return true;
-
   /* Fuse flag-setting ALU instructions and conditional branch.  */
   if (aarch64_fusion_enabled_p (AARCH64_FUSE_ALU_BRANCH)
   && any_condjump_p (curr))
diff --git a/gcc/config/aarch64/tuning_models/neoversev2.h 
b/gcc/config/aarch64/tuning_models/neoversev2.h
index ae99fab22d80..f76e4ef358f7 100644
--- a/gcc/config/aarch64/tuning_models/neoversev2.h
+++ b/gcc/config/aarch64/tuning_models/neoversev2.h
@@ -221,10 +221,7 @@ static const struct tune_params neoversev2_tunings =
 2 /* store_pred.  */
   }, /* memmov_cost.  */
   5, /* issue_rate  */
-  (AARCH64_FUSE_AES_AESMC
-   | AARCH64_FUSE_CMP_BRANCH
-   | AARCH64_FUSE_CMP_CSEL
-   | AARCH64_FUSE_CMP_CSET), /* fusible_ops  */
+  (AARCH64_FUSE_AES_AESMC | AARCH64_FUSE_CMP_BRANCH), /* fusible_ops  */
   "32:16", /* function_align.  */
   "4", /* jump_align.  */
   "32:16", /* loop_align.  */
diff --git a/gcc/testsuite/gcc.target/aarch64/cmp_csel_fuse.c 
b/gcc/testsuite/gcc.target/aarch64/cmp_csel_fuse.c
deleted file mode 100644
index f5e511e46737..
--- a/gcc/testsuite/gcc.target/aarch64/cmp_csel_fuse.c
+++ /dev/null
@@ -1,34 +0,0 @@
-/* { dg-do compile } */
-/* { dg-options "-O2 -mcpu=neoverse-v2" } */
-/* { dg-final { check-function-bodies "**" "" } } */
-
-/*
-** f1:
-** ...
-** cmp w[0-9]+, w[0-9]+
-** cselw[0-9]+, w[0-9]+, w[0-9]+, le
-** ret
-*/
-int f1 (int a, int b, int c)
-{
-  int cmp = a > b;
-  int add1 = c + 3;
-  int add2 = c + 8;
-  return cmp ? add1 : add2;
-}
-
-/*
-** f2:
-** ...
-** cmp x[0-9]+, x[0-9]+
-** cselx[0-9]+, x[0-9]+, x[0-9]+, le
-** ret
-*/
-long long f2 (long long a, long long b, long long c)
-{
-  long long cmp = a > b;
-  long long add1 = c + 3;
-  long long add2 = c + 8;
-  return cmp ? add1 : add2;
-}
-
diff --git a/gcc/testsuite/gcc.target/aarch64/cmp_cset_fuse.c 
b/gcc/testsuite/gcc.target/aarch64/cmp_cset_fuse.c
deleted file mode 100644
index 04f1ce2773ba..
--- a/gcc/testsuite/gcc.target/aarch64/cmp_cset_fuse.c
+++ /dev/null
@@ -1,31 +0,0 @@
-/* { dg-do compile } */
-/* { dg-options "-O2 -mcpu=neoverse-v2" } */
-/* { dg-final { check-function-bodies "**" "" } } */
-
-/*
-** f1:
-** cmp w[0-9]+, w[0-9]+
-** csetw[0-9]+, gt
-** ...
-*/
-int g;
-int f1 (int a, int b)
-{
-  int cmp = a > b;
-  g = cmp + 1;
-  return cmp;
-}
-
-/*
-** f2:
-** cmp x[0-9]+, x[0-9]+
-** csetx[0-9]+, gt
-** ...
-*/
-long long h;
-long long f2 (long long a, long long b)
-{
-  long long cmp = a > b;
-  h = cmp + 1;
-  retur

[gcc r15-2297] SVE Intrinsics: Change return type of redirect_call to gcall.

2024-07-24 Thread Kyrylo Tkachov via Gcc-cvs
https://gcc.gnu.org/g:3adfcc5802237e1299d67e6d716481cd3db2234a

commit r15-2297-g3adfcc5802237e1299d67e6d716481cd3db2234a
Author: Jennifer Schmitz 
Date:   Tue Jul 23 03:54:50 2024 -0700

SVE Intrinsics: Change return type of redirect_call to gcall.

As suggested in the review of
https://gcc.gnu.org/pipermail/gcc-patches/2024-July/657474.html,
this patch changes the return type of gimple_folder::redirect_call from
gimple * to gcall *. The motivation for this is that so far, most callers of
the function had been casting the result of the function to gcall. These
call sites were updated.

The patch was bootstrapped and regtested on aarch64-linux-gnu, no 
regression.
OK for mainline?

Signed-off-by: Jennifer Schmitz 

gcc/

* config/aarch64/aarch64-sve-builtins.cc
(gimple_folder::redirect_call): Update return type.
* config/aarch64/aarch64-sve-builtins.h: Likewise.
* config/aarch64/aarch64-sve-builtins-sve2.cc (svqshl_impl::fold):
Remove cast to gcall.
(svrshl_impl::fold): Likewise.

Diff:
---
 gcc/config/aarch64/aarch64-sve-builtins-sve2.cc | 6 +++---
 gcc/config/aarch64/aarch64-sve-builtins.cc  | 2 +-
 gcc/config/aarch64/aarch64-sve-builtins.h   | 2 +-
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc 
b/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc
index 4f25cc680282..dc5915516825 100644
--- a/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc
+++ b/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc
@@ -349,7 +349,7 @@ public:
instance.base_name = "svlsr";
instance.base = functions::svlsr;
  }
-   gcall *call = as_a  (f.redirect_call (instance));
+   gcall *call = f.redirect_call (instance);
gimple_call_set_arg (call, 2, amount);
return call;
  }
@@ -379,7 +379,7 @@ public:
function_instance instance ("svlsl", functions::svlsl,
shapes::binary_uint_opt_n, MODE_n,
f.type_suffix_ids, GROUP_none, f.pred);
-   gcall *call = as_a  (f.redirect_call (instance));
+   gcall *call = f.redirect_call (instance);
gimple_call_set_arg (call, 2, amount);
return call;
  }
@@ -392,7 +392,7 @@ public:
function_instance instance ("svrshr", functions::svrshr,
shapes::shift_right_imm, MODE_n,
f.type_suffix_ids, GROUP_none, f.pred);
-   gcall *call = as_a  (f.redirect_call (instance));
+   gcall *call = f.redirect_call (instance);
gimple_call_set_arg (call, 2, amount);
return call;
  }
diff --git a/gcc/config/aarch64/aarch64-sve-builtins.cc 
b/gcc/config/aarch64/aarch64-sve-builtins.cc
index f3983a123e35..0a560eaedca1 100644
--- a/gcc/config/aarch64/aarch64-sve-builtins.cc
+++ b/gcc/config/aarch64/aarch64-sve-builtins.cc
@@ -3592,7 +3592,7 @@ gimple_folder::load_store_cookie (tree type)
 }
 
 /* Fold the call to a call to INSTANCE, with the same arguments.  */
-gimple *
+gcall *
 gimple_folder::redirect_call (const function_instance &instance)
 {
   registered_function *rfn
diff --git a/gcc/config/aarch64/aarch64-sve-builtins.h 
b/gcc/config/aarch64/aarch64-sve-builtins.h
index 9cc07d5fa3de..9ab6f202c306 100644
--- a/gcc/config/aarch64/aarch64-sve-builtins.h
+++ b/gcc/config/aarch64/aarch64-sve-builtins.h
@@ -629,7 +629,7 @@ public:
   tree fold_contiguous_base (gimple_seq &, tree);
   tree load_store_cookie (tree);
 
-  gimple *redirect_call (const function_instance &);
+  gcall *redirect_call (const function_instance &);
   gimple *redirect_pred_x ();
 
   gimple *fold_to_cstu (poly_uint64);


[gcc r14-9782] [MAINTAINERS] Update my email address and step down as arm port maintainer

2024-04-04 Thread Kyrylo Tkachov via Gcc-cvs
https://gcc.gnu.org/g:f2ccfb2d0b2698e6b140e4d09e53b701a3193384

commit r14-9782-gf2ccfb2d0b2698e6b140e4d09e53b701a3193384
Author: Kyrylo Tkachov 
Date:   Thu Apr 4 09:12:28 2024 +0100

[MAINTAINERS] Update my email address and step down as arm port maintainer

* MAINTAINERS: Update my email details, remove myself as arm
maintainer.  Add myself to DCO section.

Diff:
---
 MAINTAINERS | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 8f64ee630b4..9a6c41afb12 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -52,7 +52,7 @@ docs, and the testsuite related to that.
 aarch64 port   Richard Earnshaw
 aarch64 port   Richard Sandiford   
 aarch64 port   Marcus Shawcroft
-aarch64 port   Kyrylo Tkachov  
+aarch64 port   Kyrylo Tkachov  
 alpha port Richard Henderson   
 amdgcn portJulian Brown
 amdgcn portAndrew Stubbs   
@@ -61,7 +61,6 @@ arc port  Claudiu Zissulescu  

 arm port   Nick Clifton
 arm port   Richard Earnshaw
 arm port   Ramana Radhakrishnan
-arm port   Kyrylo Tkachov  
 avr port   Denis Chertykov 
 bfin port  Jie Zhang   
 bpf port   Jose E. Marchesi
@@ -782,6 +781,7 @@ Nathaniel Shead 

 Nathan Sidwell 
 Edward Smith-Rowland   
 Fangrui Song   
+Kyrylo Tkachov 
 Petter Tomner  
 Martin Uecker  
 Jonathan Wakely


[gcc r15-2405] SVE intrinsics: Add strength reduction for division by constant.

2024-07-30 Thread Kyrylo Tkachov via Gcc-cvs
https://gcc.gnu.org/g:7cde140863edea536c676096cbc3d84a6d1424e4

commit r15-2405-g7cde140863edea536c676096cbc3d84a6d1424e4
Author: Jennifer Schmitz 
Date:   Tue Jul 16 01:59:50 2024 -0700

SVE intrinsics: Add strength reduction for division by constant.

This patch folds SVE division where all divisor elements are the same
power of 2 to svasrd (signed) or svlsr (unsigned).
Tests were added to check
1) whether the transform is applied (existing test harness was amended), and
2) correctness using runtime tests for all input types of svdiv; for signed
and unsigned integers, several corner cases were covered.

The patch was bootstrapped and regtested on aarch64-linux-gnu, no 
regression.
OK for mainline?

Signed-off-by: Jennifer Schmitz 

gcc/

* config/aarch64/aarch64-sve-builtins-base.cc (svdiv_impl::fold):
Implement strength reduction.

gcc/testsuite/

* gcc.target/aarch64/sve/div_const_run.c: New test.
* gcc.target/aarch64/sve/acle/asm/div_s32.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/div_s64.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/div_u32.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/div_u64.c: Likewise.

Diff:
---
 gcc/config/aarch64/aarch64-sve-builtins-base.cc|  49 +++-
 .../gcc.target/aarch64/sve/acle/asm/div_s32.c  | 273 +++--
 .../gcc.target/aarch64/sve/acle/asm/div_s64.c  | 273 +++--
 .../gcc.target/aarch64/sve/acle/asm/div_u32.c  | 201 +--
 .../gcc.target/aarch64/sve/acle/asm/div_u64.c  | 201 +--
 .../gcc.target/aarch64/sve/div_const_run.c |  91 +++
 6 files changed, 1031 insertions(+), 57 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-sve-builtins-base.cc 
b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
index a2268353ae31..d55bee0b72fa 100644
--- a/gcc/config/aarch64/aarch64-sve-builtins-base.cc
+++ b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
@@ -746,6 +746,53 @@ public:
   }
 };
 
+class svdiv_impl : public rtx_code_function
+{
+public:
+  CONSTEXPR svdiv_impl ()
+: rtx_code_function (DIV, UDIV, UNSPEC_COND_FDIV) {}
+
+  gimple *
+  fold (gimple_folder &f) const override
+  {
+tree divisor = gimple_call_arg (f.call, 2);
+tree divisor_cst = uniform_integer_cst_p (divisor);
+
+if (!divisor_cst || !integer_pow2p (divisor_cst))
+  return NULL;
+
+tree new_divisor;
+gcall *call;
+
+if (f.type_suffix (0).unsigned_p && tree_to_uhwi (divisor_cst) != 1)
+  {
+   function_instance instance ("svlsr", functions::svlsr,
+   shapes::binary_uint_opt_n, MODE_n,
+   f.type_suffix_ids, GROUP_none, f.pred);
+   call = f.redirect_call (instance);
+   tree d = INTEGRAL_TYPE_P (TREE_TYPE (divisor)) ? divisor : divisor_cst;
+   new_divisor = wide_int_to_tree (TREE_TYPE (d), tree_log2 (d));
+  }
+else
+  {
+   if (tree_int_cst_sign_bit (divisor_cst)
+   || tree_to_shwi (divisor_cst) == 1)
+ return NULL;
+
+   function_instance instance ("svasrd", functions::svasrd,
+   shapes::shift_right_imm, MODE_n,
+   f.type_suffix_ids, GROUP_none, f.pred);
+   call = f.redirect_call (instance);
+   new_divisor = wide_int_to_tree (scalar_types[VECTOR_TYPE_svuint64_t],
+   tree_log2 (divisor_cst));
+  }
+
+gimple_call_set_arg (call, 2, new_divisor);
+return call;
+  }
+};
+
+
 class svdot_impl : public function_base
 {
 public:
@@ -3043,7 +3090,7 @@ FUNCTION (svcreate3, svcreate_impl, (3))
 FUNCTION (svcreate4, svcreate_impl, (4))
 FUNCTION (svcvt, svcvt_impl,)
 FUNCTION (svcvtnt, CODE_FOR_MODE0 (aarch64_sve_cvtnt),)
-FUNCTION (svdiv, rtx_code_function, (DIV, UDIV, UNSPEC_COND_FDIV))
+FUNCTION (svdiv, svdiv_impl,)
 FUNCTION (svdivr, rtx_code_function_rotated, (DIV, UDIV, UNSPEC_COND_FDIV))
 FUNCTION (svdot, svdot_impl,)
 FUNCTION (svdot_lane, svdotprod_lane_impl, (UNSPEC_SDOT, UNSPEC_UDOT,
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/div_s32.c 
b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/div_s32.c
index c49ca1aa5243..d5a23bf07262 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/div_s32.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/div_s32.c
@@ -2,6 +2,8 @@
 
 #include "test_sve_acle.h"
 
+#define MAXPOW 1<<30
+
 /*
 ** div_s32_m_tied1:
 ** sdivz0\.s, p0/m, z0\.s, z1\.s
@@ -53,10 +55,27 @@ TEST_UNIFORM_ZX (div_w0_s32_m_untied, svint32_t, int32_t,
 z0 = svdiv_n_s32_m (p0, z1, x0),
 z0 = svdiv_m (p0, z1, x0))
 
+/*
+** div_1_s32_m_tied1:
+** sel z0\.s, p0, z0\.s, z0\.s
+** ret
+*/
+TEST_UNIFORM_Z (div_1_s32_m_tied1, svint32_t,
+   z0 = svdiv_n_s32_m (p0, z0, 1),
+   z0 = svdiv_m (p0, z0, 1

[gcc r15-2720] tree-reassoc.cc: PR tree-optimization/116139 Don't assert when forming fully-pipelined FMAs on wide

2024-08-05 Thread Kyrylo Tkachov via Gcc-cvs
https://gcc.gnu.org/g:44da85f4455ea11296667434172810ea76a62add

commit r15-2720-g44da85f4455ea11296667434172810ea76a62add
Author: Kyrylo Tkachov 
Date:   Fri Aug 2 06:21:16 2024 -0700

tree-reassoc.cc: PR tree-optimization/116139 Don't assert when forming 
fully-pipelined FMAs on wide MULT targets

The code in get_reassociation_width that forms FMAs aggressively when
they are fully pipelined expects the FMUL reassociation width in the
target to be less than for FMAs. This doesn't hold for all target
tunings.

This code shouldn't ICE, just avoid forming these FMAs here.
This patch does that.

Signed-off-by: Kyrylo Tkachov 

PR tree-optimization/116139

gcc/ChangeLog:

* tree-ssa-reassoc.cc (get_reassociation_width): Move width_mult
<= width comparison to if condition rather than assert.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/pr116139.c: New test.

Diff:
---
 gcc/testsuite/gcc.target/aarch64/pr116139.c | 35 +
 gcc/tree-ssa-reassoc.cc | 17 +++---
 2 files changed, 43 insertions(+), 9 deletions(-)

diff --git a/gcc/testsuite/gcc.target/aarch64/pr116139.c 
b/gcc/testsuite/gcc.target/aarch64/pr116139.c
new file mode 100644
index ..78a21323030a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr116139.c
@@ -0,0 +1,35 @@
+/* PR tree-optimization/116139 */
+/* { dg-do compile } */
+/* { dg-options "-Ofast --param fully-pipelined-fma=1 -mcpu=neoverse-n3" } */
+
+#define LOOP_COUNT 8
+typedef double data_e;
+
+data_e
+foo (data_e in)
+{
+  data_e a1, a2, a3, a4;
+  data_e tmp, result = 0;
+  a1 = in + 0.1;
+  a2 = in * 0.1;
+  a3 = in + 0.01;
+  a4 = in * 0.59;
+
+  data_e result2 = 0;
+
+  for (int ic = 0; ic < LOOP_COUNT; ic++)
+{
+  tmp = a1 + a2 * a2 + a3 * a3 + a4 * a4 ;
+  result += tmp - ic;
+  result2 = result2 / 2 - tmp;
+
+  a1 += 0.91;
+  a2 += 0.1;
+  a3 -= 0.01;
+  a4 -= 0.89;
+
+}
+
+  return result + result2;
+}
+
diff --git a/gcc/tree-ssa-reassoc.cc b/gcc/tree-ssa-reassoc.cc
index d74352268b5d..70c810c51984 100644
--- a/gcc/tree-ssa-reassoc.cc
+++ b/gcc/tree-ssa-reassoc.cc
@@ -5509,16 +5509,15 @@ get_reassociation_width (vec *ops, int 
mult_num, tree lhs,
  , it is latency(MULT)*2 + latency(ADD)*2.  Assuming latency(MULT) >=
  latency(ADD), the first variant is preferred.
 
- Find out if we can get a smaller width considering FMA.  */
-  if (width > 1 && mult_num && param_fully_pipelined_fma)
+ Find out if we can get a smaller width considering FMA.
+ Assume FMUL and FMA use the same units that can also do FADD.
+ For other scenarios, such as when FMUL and FADD are using separated units,
+ the following code may not apply.  */
+
+  int width_mult = targetm.sched.reassociation_width (MULT_EXPR, mode);
+  if (width > 1 && mult_num && param_fully_pipelined_fma
+  && width_mult <= width)
 {
-  /* When param_fully_pipelined_fma is set, assume FMUL and FMA use the
-same units that can also do FADD.  For other scenarios, such as when
-FMUL and FADD are using separated units, the following code may not
-appy.  */
-  int width_mult = targetm.sched.reassociation_width (MULT_EXPR, mode);
-  gcc_checking_assert (width_mult <= width);
-
   /* Latency of MULT_EXPRs.  */
   int lat_mul
= get_mult_latency_consider_fma (ops_num, mult_num, width_mult);


[gcc r15-2842] aarch64: Check CONSTM1_RTX in definition of Dm constraint

2024-08-08 Thread Kyrylo Tkachov via Gcc-cvs
https://gcc.gnu.org/g:19e565ed13972410451091a789fe58638d03b795

commit r15-2842-g19e565ed13972410451091a789fe58638d03b795
Author: Kyrylo Tkachov 
Date:   Mon Aug 5 10:47:33 2024 -0700

aarch64: Check CONSTM1_RTX in definition of Dm constraint

The constraint Dm is intended to match vectors of minus 1, but actually
checks for CONST1_RTX. This doesn't have a bad effect in practice as its
only use in the aarch64_wrffr pattern for the setffr instruction which
is a VNx16BI operation and -1 and 1 are the same there. That pattern
can only be currently generated through intrinsics anyway that create it
with a CONSTM1_RTX constant.

Fix the constraint definition so that it doesn't become a footgun if its
used in some other pattern.

Bootstrapped and tested on aarch64-none-linux-gnu.

Signed-off-by: Kyrylo Tkachov 

gcc/ChangeLog:

* config/aarch64/constraints.md (Dm): Match CONSTM1_RTX rather
CONST1_RTX.

Diff:
---
 gcc/config/aarch64/constraints.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/config/aarch64/constraints.md 
b/gcc/config/aarch64/constraints.md
index 0c81fb28f7e5..a2878f580d90 100644
--- a/gcc/config/aarch64/constraints.md
+++ b/gcc/config/aarch64/constraints.md
@@ -556,7 +556,7 @@
   "@internal
  A constraint that matches a vector of immediate minus one."
  (and (match_code "const,const_vector")
-  (match_test "op == CONST1_RTX (GET_MODE (op))")))
+  (match_test "op == CONSTM1_RTX (GET_MODE (op))")))
 
 (define_constraint "Dd"
   "@internal


[gcc r15-2859] Revert "lra: emit caller-save register spills before call insn [PR116028]"

2024-08-09 Thread Kyrylo Tkachov via Gcc-cvs
https://gcc.gnu.org/g:4734c1bfe837b3e70bc783dafc442de3bca43d88

commit r15-2859-g4734c1bfe837b3e70bc783dafc442de3bca43d88
Author: Kyrylo Tkachov 
Date:   Fri Aug 9 21:16:56 2024 +0200

Revert "lra: emit caller-save register spills before call insn [PR116028]"

This reverts commit 3c67a0fa1dd39a3378deb854a7fef0ff7fe38004.

Diff:
---
 gcc/lra-constraints.cc   | 28 
 gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-1.c |  2 +-
 gcc/testsuite/gcc.dg/pr10474.c   |  2 +-
 3 files changed, 6 insertions(+), 26 deletions(-)

diff --git a/gcc/lra-constraints.cc b/gcc/lra-constraints.cc
index 28c1a877c003..92b343fa99a0 100644
--- a/gcc/lra-constraints.cc
+++ b/gcc/lra-constraints.cc
@@ -152,9 +152,6 @@ static machine_mode curr_operand_mode[MAX_RECOG_OPERANDS];
(e.g. constant) and whose subreg is given operand of the current
insn.  VOIDmode in all other cases.  */
 static machine_mode original_subreg_reg_mode[MAX_RECOG_OPERANDS];
-/* The nearest call insn for an insn on which split transformation
-   will be done. The call insn is in the same EBB as the insn.  */
-static rtx_insn *latest_call_insn;
 
 
 
@@ -6289,25 +6286,10 @@ split_reg (bool before_p, int original_regno, rtx_insn 
*insn,
 after_p ? restore : NULL,
 call_save_p
 ?  "Add reg<-save" : "Add reg<-split");
-  if (call_save_p && latest_call_insn != NULL)
-/* PR116028: If original_regno is a pseudo that has been assigned a
-   call-save hard register, then emit the spill insn before the call
-   insn 'latest_call_insn' instead of adjacent to 'insn'. If 'insn'
-   and 'latest_call_insn' belong to the same EBB but to two separate
-   BBs, and if 'insn' is present in the entry BB, then generating the
-   spill insn in the entry BB can prevent shrink wrap from happening.
-   This is because the spill insn references the stack pointer and
-   hence the prolog gets generated in the entry BB itself. It is
-   also more efficient to generate the spill before
-   'latest_call_insn' as the spill now occurs only in the path
-   containing the call.  */
-lra_process_new_insns (PREV_INSN (latest_call_insn), NULL, save,
-  "Add save<-reg");
-  else
-lra_process_new_insns (insn, before_p ? save : NULL,
-  before_p ? NULL : save,
-  call_save_p
-  ?  "Add save<-reg" : "Add split<-reg");
+  lra_process_new_insns (insn, before_p ? save : NULL,
+before_p ? NULL : save,
+call_save_p
+?  "Add save<-reg" : "Add split<-reg");
   if (nregs > 1 || original_regno < FIRST_PSEUDO_REGISTER)
 /* If we are trying to split multi-register.  We should check
conflicts on the next assignment sub-pass.  IRA can allocate on
@@ -6791,7 +6773,6 @@ inherit_in_ebb (rtx_insn *head, rtx_insn *tail)
   last_processed_bb = NULL;
   CLEAR_HARD_REG_SET (potential_reload_hard_regs);
   live_hard_regs = eliminable_regset | lra_no_alloc_regs;
-  latest_call_insn = NULL;
   /* We don't process new insns generated in the loop. */
   for (curr_insn = tail; curr_insn != PREV_INSN (head); curr_insn = prev_insn)
 {
@@ -7004,7 +6985,6 @@ inherit_in_ebb (rtx_insn *head, rtx_insn *tail)
  last_call_for_abi[callee_abi.id ()] = calls_num;
  full_and_partial_call_clobbers
|= callee_abi.full_and_partial_reg_clobbers ();
- latest_call_insn = curr_insn;
  if ((cheap = find_reg_note (curr_insn,
  REG_RETURNED, NULL_RTX)) != NULL_RTX
  && ((cheap = XEXP (cheap, 0)), true)
diff --git a/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-1.c 
b/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-1.c
index 8c150972f952..a95637abbe54 100644
--- a/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-1.c
+++ b/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-1.c
@@ -26,4 +26,4 @@ bar (long a)
 
 /* { dg-final { scan-rtl-dump "Will split live ranges of parameters" "ira" } } 
*/
 /* { dg-final { scan-rtl-dump "Split live-range of register" "ira" { xfail { ! 
aarch64*-*-* } } } } */
-/* { dg-final { scan-rtl-dump "Performing shrink-wrapping" "pro_and_epilogue" 
} } */
+/* { dg-final { scan-rtl-dump "Performing shrink-wrapping" "pro_and_epilogue" 
{ xfail powerpc*-*-* } } } */
diff --git a/gcc/testsuite/gcc.dg/pr10474.c b/gcc/testsuite/gcc.dg/pr10474.c
index b5393d5b6e3e..a4af536ec284 100644
--- a/gcc/testsuite/gcc.dg/pr10474.c
+++ b/gcc/testsuite/gcc.dg/pr10474.c
@@ -13,4 +13,4 @@ void f(int *i)
 }
 
 /* XFAIL due to PR70681.  */ 
-/* { dg-final { scan-rtl-dump "Performing shrink-wrapping" "pro_and_epilogue"  
{ xfail arm*-*-* } } } */
+/* { dg-final { scan-rtl-dump "Performing shrink-wrapping" "pro_and_epilogue"  
{ xfail arm*-*-* powerpc*-*-* } } } */


[gcc r15-2883] aarch64: Emit ADD X, Y, Y instead of SHL X, Y, #1 for Advanced SIMD

2024-08-12 Thread Kyrylo Tkachov via Gcc-cvs
https://gcc.gnu.org/g:fcc766c82cf8e0473ba54f1660c8282a7ce3231c

commit r15-2883-gfcc766c82cf8e0473ba54f1660c8282a7ce3231c
Author: Kyrylo Tkachov 
Date:   Mon Aug 5 11:29:44 2024 -0700

aarch64: Emit ADD X, Y, Y instead of SHL X, Y, #1 for Advanced SIMD

On many cores, including Neoverse V2 the throughput of vector ADD
instructions is higher than vector shifts like SHL.  We can lean on that
to emit code like:
  add v0.4s, v0.4s, v0.4s
instead of:
  shl v0.4s, v0.4s, 1

LLVM already does this trick.
In RTL the code gets canonincalised from (plus x x) to (ashift x 1) so I
opted to instead do this at the final assembly printing stage, similar
to how we emit CMLT instead of SSHR elsewhere in the backend.

I'd like to also do this for SVE shifts, but those will have to be
separate patches.

Signed-off-by: Kyrylo Tkachov 

gcc/ChangeLog:

* config/aarch64/aarch64-simd.md
(aarch64_simd_imm_shl): Rewrite to new
syntax.  Add =w,w,vs1 alternative.
* config/aarch64/constraints.md (vs1): New constraint.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/advsimd_shl_add.c: New test.

Diff:
---
 gcc/config/aarch64/aarch64-simd.md | 12 ++--
 gcc/config/aarch64/constraints.md  |  6 ++
 gcc/testsuite/gcc.target/aarch64/advsimd_shl_add.c | 64 ++
 3 files changed, 77 insertions(+), 5 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-simd.md 
b/gcc/config/aarch64/aarch64-simd.md
index cc612ec2ca0e..475f19766c38 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -1352,12 +1352,14 @@
 )
 
 (define_insn "aarch64_simd_imm_shl"
- [(set (match_operand:VDQ_I 0 "register_operand" "=w")
-   (ashift:VDQ_I (match_operand:VDQ_I 1 "register_operand" "w")
-  (match_operand:VDQ_I  2 "aarch64_simd_lshift_imm" "Dl")))]
+ [(set (match_operand:VDQ_I 0 "register_operand")
+   (ashift:VDQ_I (match_operand:VDQ_I 1 "register_operand")
+  (match_operand:VDQ_I  2 "aarch64_simd_lshift_imm")))]
  "TARGET_SIMD"
-  "shl\t%0., %1., %2"
-  [(set_attr "type" "neon_shift_imm")]
+  {@ [ cons: =0, 1,  2   ; attrs: type   ]
+ [ w   , w,  vs1 ; neon_add   ] add\t%0., %1., 
%1.
+ [ w   , w,  Dl  ; neon_shift_imm ] shl\t%0., %1., %2
+  }
 )
 
 (define_insn "aarch64_simd_reg_sshl"
diff --git a/gcc/config/aarch64/constraints.md 
b/gcc/config/aarch64/constraints.md
index a2878f580d90..f491e4bd6a06 100644
--- a/gcc/config/aarch64/constraints.md
+++ b/gcc/config/aarch64/constraints.md
@@ -667,6 +667,12 @@
SMAX and SMIN operations."
  (match_operand 0 "aarch64_sve_vsm_immediate"))
 
+(define_constraint "vs1"
+  "@internal
+ A constraint that matches a vector of immediate one."
+ (and (match_code "const,const_vector")
+  (match_test "op == CONST1_RTX (GET_MODE (op))")))
+
 (define_constraint "vsA"
   "@internal
A constraint that matches an immediate operand valid for SVE FADD
diff --git a/gcc/testsuite/gcc.target/aarch64/advsimd_shl_add.c 
b/gcc/testsuite/gcc.target/aarch64/advsimd_shl_add.c
new file mode 100644
index ..a161f89a3acc
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/advsimd_shl_add.c
@@ -0,0 +1,64 @@
+/* { dg-do compile } */
+/* { dg-additional-options "--save-temps -O1" } */
+/* { dg-final { check-function-bodies "**" "" "" } } */
+
+typedef __INT64_TYPE__ __attribute__ ((vector_size (16))) v2di;
+typedef int __attribute__ ((vector_size (16))) v4si;
+typedef short __attribute__ ((vector_size (16))) v8hi;
+typedef char __attribute__ ((vector_size (16))) v16qi;
+typedef short __attribute__ ((vector_size (8))) v4hi;
+typedef char __attribute__ ((vector_size (8))) v8qi;
+
+#define FUNC(S) \
+S   \
+foo_##S (S a)   \
+{ return a << 1; }
+
+/*
+** foo_v2di:
+**  addv0.2d, v0.2d, v0.2d
+**  ret
+*/
+
+FUNC (v2di)
+
+/*
+** foo_v4si:
+**  addv0.4s, v0.4s, v0.4s
+**  ret
+*/
+
+FUNC (v4si)
+
+/*
+** foo_v8hi:
+**  addv0.8h, v0.8h, v0.8h
+**  ret
+*/
+
+FUNC (v8hi)
+
+/*
+** foo_v16qi:
+**  addv0.16b, v0.16b, v0.16b
+**  ret
+*/
+
+FUNC (v16qi)
+
+/*
+** foo_v4hi:
+**  addv0.4h, v0.4h, v0.4h
+**  ret
+*/
+
+FUNC (v4hi)
+
+/*
+** foo_v8qi:
+**  addv0.8b, v0.8b, v0.8b
+**  ret
+*/
+
+FUNC (v8qi)
+


[gcc r15-1403] [MAINTAINERS] Update my email address

2024-06-18 Thread Kyrylo Tkachov via Gcc-cvs
https://gcc.gnu.org/g:5f6b42969d598139640e60daf1d0b9bdfcaa9f73

commit r15-1403-g5f6b42969d598139640e60daf1d0b9bdfcaa9f73
Author: Kyrylo Tkachov 
Date:   Tue Jun 18 14:00:54 2024 +0200

[MAINTAINERS] Update my email address

Pushing to trunk.

* MAINTAINERS (aarch64 port): Update my email address.
(DCO section): Likewise.

Signed-off-by: Kyrylo Tkachov 

Diff:
---
 MAINTAINERS | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 6444e6ea2f1a..8b6fa16f79a9 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -52,7 +52,7 @@ docs, and the testsuite related to that.
 aarch64 port   Richard Earnshaw
 aarch64 port   Richard Sandiford   
 aarch64 port   Marcus Shawcroft
-aarch64 port   Kyrylo Tkachov  
+aarch64 port   Kyrylo Tkachov  
 alpha port Richard Henderson   
 amdgcn portJulian Brown
 amdgcn portAndrew Stubbs   
@@ -784,7 +784,7 @@ Nathaniel Shead 

 Nathan Sidwell 
 Edward Smith-Rowland   
 Fangrui Song   
-Kyrylo Tkachov 
+Kyrylo Tkachov 
 Petter Tomner  
 Martin Uecker  
 Jonathan Wakely


[gcc r15-4269] PR 117048: simplify-rtx: Extend (x << C1) | (X >> C2) --> ROTATE transformation to vector operands

2024-10-11 Thread Kyrylo Tkachov via Gcc-cvs
https://gcc.gnu.org/g:70566e719f0710323251e8e9190b322f4de8faeb

commit r15-4269-g70566e719f0710323251e8e9190b322f4de8faeb
Author: Kyrylo Tkachov 
Date:   Wed Oct 9 09:39:55 2024 -0700

PR 117048: simplify-rtx: Extend (x << C1) | (X >> C2) --> ROTATE 
transformation to vector operands

In the testcase from patch [2/2] we want to match a vector rotate operate 
from
an IOR of left and right shifts by immediate.  simplify-rtx has code for 
just
that but it looks like it's prepared to do handle only scalar operands.
In practice most of the code works for vector modes as well except the shift
amounts are checked to be CONST_INT rather than vector constants that we 
have
here.  This is easily extended by using unwrap_const_vec_duplicate to 
extract
the repeating constant shift amount.  With this change combine now tries
matching the simpler and expected:
(set (reg:V2DI 119 [ _14 ])
(rotate:V2DI (xor:V2DI (reg:V2DI 114 [ vect__1.12_16 ])
(reg:V2DI 116 [ *m1_01_8(D) ]))
(const_vector:V2DI [
(const_int 32 [0x20]) repeated x2
])))
instead of the previous:
(set (reg:V2DI 119 [ _14 ])
(ior:V2DI (ashift:V2DI (xor:V2DI (reg:V2DI 114 [ vect__1.12_16 ])
(reg:V2DI 116 [ *m1_01_8(D) ]))
(const_vector:V2DI [
(const_int 32 [0x20]) repeated x2
]))
(lshiftrt:V2DI (xor:V2DI (reg:V2DI 114 [ vect__1.12_16 ])
(reg:V2DI 116 [ *m1_01_8(D) ]))
(const_vector:V2DI [
(const_int 32 [0x20]) repeated x2
]

To actually fix the PR the aarch64 backend needs some adjustment as well
which is done in patch [2/2], which adds the testcase as well.

Bootstrapped and tested on aarch64-none-linux-gnu.

Signed-off-by: Kyrylo Tkachov 

PR target/117048
* simplify-rtx.cc (simplify_context::simplify_binary_operation_1):
Handle vector constants in (x << C1) | (x >> C2) -> ROTATE
simplification.

Diff:
---
 gcc/simplify-rtx.cc | 16 ++--
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/gcc/simplify-rtx.cc b/gcc/simplify-rtx.cc
index e8e60404ef62..dc0d192dd218 100644
--- a/gcc/simplify-rtx.cc
+++ b/gcc/simplify-rtx.cc
@@ -3477,12 +3477,16 @@ simplify_context::simplify_binary_operation_1 (rtx_code 
code,
}
 
   if (GET_CODE (opleft) == ASHIFT && GET_CODE (opright) == LSHIFTRT
-  && rtx_equal_p (XEXP (opleft, 0), XEXP (opright, 0))
-  && CONST_INT_P (XEXP (opleft, 1))
-  && CONST_INT_P (XEXP (opright, 1))
-  && (INTVAL (XEXP (opleft, 1)) + INTVAL (XEXP (opright, 1))
- == GET_MODE_UNIT_PRECISION (mode)))
-return gen_rtx_ROTATE (mode, XEXP (opright, 0), XEXP (opleft, 1));
+ && rtx_equal_p (XEXP (opleft, 0), XEXP (opright, 0)))
+   {
+ rtx leftcst = unwrap_const_vec_duplicate (XEXP (opleft, 1));
+ rtx rightcst = unwrap_const_vec_duplicate (XEXP (opright, 1));
+
+ if (CONST_INT_P (leftcst) && CONST_INT_P (rightcst)
+ && (INTVAL (leftcst) + INTVAL (rightcst)
+ == GET_MODE_UNIT_PRECISION (mode)))
+   return gen_rtx_ROTATE (mode, XEXP (opright, 0), XEXP (opleft, 1));
+   }
 
   /* Same, but for ashift that has been "simplified" to a wider mode
 by simplify_shift_const.  */


[gcc r15-4270] PR target/117048 aarch64: Use more canonical and optimization-friendly representation for XAR instru

2024-10-11 Thread Kyrylo Tkachov via Gcc-cvs
https://gcc.gnu.org/g:1dcc6a1a67165a469d4cd9b6b39514c46cc656ad

commit r15-4270-g1dcc6a1a67165a469d4cd9b6b39514c46cc656ad
Author: Kyrylo Tkachov 
Date:   Wed Oct 9 09:40:33 2024 -0700

PR target/117048 aarch64: Use more canonical and optimization-friendly 
representation for XAR instruction

The pattern for the Advanced SIMD XAR instruction isn't very
optimization-friendly at the moment.
In the testcase from the PR once simlify-rtx has done its work it
generates the RTL:
(set (reg:V2DI 119 [ _14 ])
(rotate:V2DI (xor:V2DI (reg:V2DI 114 [ vect__1.12_16 ])
(reg:V2DI 116 [ *m1_01_8(D) ]))
(const_vector:V2DI [
(const_int 32 [0x20]) repeated x2
])))

which fails to match our XAR pattern because the pattern expects:
1) A ROTATERT instead of the ROTATE.  However, according to the RTL ops
documentation the preferred form of rotate-by-immediate is ROTATE, which
I take to mean it's the canonical form.
ROTATE (x, C) <-> ROTATERT (x, MODE_WIDTH - C) so it's better to match just
one canonical representation.
2) A CONST_INT shift amount whereas the midend asks for a repeated vector
constant.

These issues are fixed by introducing a dedicated expander for the
aarch64_xarqv2di name, needed by the arm_neon.h intrinsic, that translate
the intrinsic-level CONST_INT immediate (the right-rotate amount) into
a repeated vector constant subtracted from 64 to give the corresponding
left-rotate amount that is fed to the new representation for the XAR
define_insn that uses the ROTATE RTL code.  This is a similar approach
to have we handle the discrepancy between intrinsic-level and RTL-level
vector lane numbers for big-endian.

With this patch and [1/2] the arithmetic parts of the testcase now simplify
to just one XAR instruction.

Bootstrapped and tested on aarch64-none-linux-gnu.

Signed-off-by: Kyrylo Tkachov 

gcc/
PR target/117048
* config/aarch64/aarch64-simd.md (aarch64_xarqv2di): Redefine into a
define_expand.
(*aarch64_xarqv2di_insn): Define.

gcc/testsuite/
PR target/117048
* g++.target/aarch64/pr117048.C: New test.

Diff:
---
 gcc/config/aarch64/aarch64-simd.md  | 33 
 gcc/testsuite/g++.target/aarch64/pr117048.C | 34 +
 2 files changed, 63 insertions(+), 4 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-simd.md 
b/gcc/config/aarch64/aarch64-simd.md
index 11d405ed640f..bf272bc0b4eb 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -9046,18 +9046,43 @@
   [(set_attr "type" "crypto_sha3")]
 )
 
-(define_insn "aarch64_xarqv2di"
+(define_insn "*aarch64_xarqv2di_insn"
   [(set (match_operand:V2DI 0 "register_operand" "=w")
-   (rotatert:V2DI
+   (rotate:V2DI
 (xor:V2DI
  (match_operand:V2DI 1 "register_operand" "%w")
  (match_operand:V2DI 2 "register_operand" "w"))
-(match_operand:SI 3 "aarch64_simd_shift_imm_di" "Usd")))]
+(match_operand:V2DI 3 "aarch64_simd_lshift_imm" "Dl")))]
   "TARGET_SHA3"
-  "xar\\t%0.2d, %1.2d, %2.2d, %3"
+  {
+operands[3]
+  = GEN_INT (64 - INTVAL (unwrap_const_vec_duplicate (operands[3])));
+return "xar\\t%0.2d, %1.2d, %2.2d, %3";
+  }
   [(set_attr "type" "crypto_sha3")]
 )
 
+;; The semantics of the vxarq_u64 intrinsics treat the immediate argument as a
+;; right-rotate amount but the recommended representation of rotates by a
+;; constant in RTL is with the left ROTATE code.  Translate between the
+;; intrinsic-provided amount and the RTL operands in the expander here.
+;; The define_insn for XAR will translate back to instruction semantics in its
+;; output logic.
+(define_expand "aarch64_xarqv2di"
+  [(set (match_operand:V2DI 0 "register_operand")
+   (rotate:V2DI
+(xor:V2DI
+ (match_operand:V2DI 1 "register_operand")
+ (match_operand:V2DI 2 "register_operand"))
+(match_operand:SI 3 "aarch64_simd_shift_imm_di")))]
+  "TARGET_SHA3"
+  {
+operands[3]
+  = aarch64_simd_gen_const_vector_dup (V2DImode,
+  64 - INTVAL (operands[3]));
+  }
+)
+
 (define_insn "bcaxq4"
   [(set (match_operand:VQ_I 0 "register_operand" "=w")
(xor:VQ_I
diff --git a/gcc/testsuite/g++.target/aarch64/pr117048.C 
b/gcc/testsuite/g++.target/aarch64/pr117048.C
new file mode 100644
index ..ae46e5875e4c
--- /dev/null
+++ b/gcc/testsuite/g++.target/aarch64/pr117048.C
@@ -0,0 +1,34 @@
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+
+#include 
+
+#pragma GCC target "+sha3"
+
+static inline uint64x2_t
+rotr64_vec(uint64x2_t x, const int b)
+{
+int64x2_t neg_b = vdupq_n_s64(-b);
+int64x2_t left_shift = vsubq_s64(vdupq_n_s64(64), vdupq_n_s64(b));
+
+uint64x2_t r

[gcc r15-4068] aarch64: Set Armv9-A generic L1 cache line size to 64 bytes

2024-10-04 Thread Kyrylo Tkachov via Gcc-cvs
https://gcc.gnu.org/g:f000cb8cbc58b23a91c84d47d69481904981a1d9

commit r15-4068-gf000cb8cbc58b23a91c84d47d69481904981a1d9
Author: Kyrylo Tkachov 
Date:   Fri Sep 20 05:11:39 2024 -0700

aarch64: Set Armv9-A generic L1 cache line size to 64 bytes

I'd like to use a value of 64 bytes for the L1 cache size for Armv9-A
generic tuning.
As described in g:9a99559a478111f7fbeec29bd78344df7651c707 this value is 
used
to set the std::hardware_destructive_interference_size value which we want 
to
be not overly large when running concurrent applications on large core-count
systems.

The generic value for Armv8-A systems and the port baseline is 256 bytes
because that's what the A64FX CPU has, as set de-facto in
aarch64_override_options_internal.

But for Armv9-A CPUs as far as I know there isn't anything larger
than 64 bytes, so we should be able to use the smaller value here and reduce
the size of concurrent structs that use
std::hardware_destructive_interference_size to pad their fields.

Bootstrapped and tested on aarch64-none-linux-gnu.

Signed-off-by: Kyrylo Tkachov 

* config/aarch64/tuning_models/generic_armv9_a.h
(generic_armv9a_prefetch_tune): Define.
(generic_armv9_a_tunings): Use the above.

Diff:
---
 gcc/config/aarch64/tuning_models/generic_armv9_a.h | 14 +-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/gcc/config/aarch64/tuning_models/generic_armv9_a.h 
b/gcc/config/aarch64/tuning_models/generic_armv9_a.h
index 85ed40f6..76b3e4c9cf73 100644
--- a/gcc/config/aarch64/tuning_models/generic_armv9_a.h
+++ b/gcc/config/aarch64/tuning_models/generic_armv9_a.h
@@ -207,6 +207,18 @@ static const struct cpu_vector_cost 
generic_armv9_a_vector_cost =
   &generic_armv9_a_vec_issue_info /* issue_info  */
 };
 
+/* Generic prefetch settings (which disable prefetch).  */
+static const cpu_prefetch_tune generic_armv9a_prefetch_tune =
+{
+  0,   /* num_slots  */
+  -1,  /* l1_cache_size  */
+  64,  /* l1_cache_line_size  */
+  -1,  /* l2_cache_size  */
+  true,/* prefetch_dynamic_strides */
+  -1,  /* minimum_stride */
+  -1   /* default_opt_level  */
+};
+
 static const struct tune_params generic_armv9_a_tunings =
 {
   &cortexa76_extra_costs,
@@ -239,7 +251,7 @@ static const struct tune_params generic_armv9_a_tunings =
   (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
| AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
| AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),/* tune_flags.  */
-  &generic_prefetch_tune,
+  &generic_armv9a_prefetch_tune,
   AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
   AARCH64_LDP_STP_POLICY_ALWAYS   /* stp_policy_model.  */
 };


[gcc r15-4592] SVE intrinsics: Fold constant operands for svlsl.

2024-10-25 Thread Kyrylo Tkachov via Gcc-cvs
https://gcc.gnu.org/g:3e7549ece7c6b90b9e961778361ee2b65bf104a9

commit r15-4592-g3e7549ece7c6b90b9e961778361ee2b65bf104a9
Author: Soumya AR 
Date:   Thu Oct 17 09:30:35 2024 +0530

SVE intrinsics: Fold constant operands for svlsl.

This patch implements constant folding for svlsl. Test cases have been 
added to
check for the following cases:

Zero, merge, and don't care predication.
Shift by 0.
Shift by register width.
Overflow shift on signed and unsigned integers.
Shift on a negative integer.
Maximum possible shift, eg. shift by 7 on an 8-bit integer.

The patch was bootstrapped and regtested on aarch64-linux-gnu, no 
regression.
OK for mainline?

Signed-off-by: Soumya AR 

gcc/ChangeLog:

* config/aarch64/aarch64-sve-builtins-base.cc (svlsl_impl::fold):
Try constant folding.
* config/aarch64/aarch64-sve-builtins.cc (aarch64_const_binop):
Return 0 if shift is out of range.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/sve/const_fold_lsl_1.c: New test.

Diff:
---
 gcc/config/aarch64/aarch64-sve-builtins-base.cc|  15 ++-
 gcc/config/aarch64/aarch64-sve-builtins.cc |   5 +-
 .../gcc.target/aarch64/sve/const_fold_lsl_1.c  | 142 +
 3 files changed, 160 insertions(+), 2 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-sve-builtins-base.cc 
b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
index 327688756d1b..fe16d93adcd1 100644
--- a/gcc/config/aarch64/aarch64-sve-builtins-base.cc
+++ b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
@@ -1926,6 +1926,19 @@ public:
   }
 };
 
+class svlsl_impl : public rtx_code_function
+{
+public:
+  CONSTEXPR svlsl_impl ()
+: rtx_code_function (ASHIFT, ASHIFT) {}
+
+  gimple *
+  fold (gimple_folder &f) const override
+  {
+return f.fold_const_binary (LSHIFT_EXPR);
+  }
+};
+
 class svmad_impl : public function_base
 {
 public:
@@ -3304,7 +3317,7 @@ FUNCTION (svldnf1uh, svldxf1_extend_impl, 
(TYPE_SUFFIX_u16, UNSPEC_LDNF1))
 FUNCTION (svldnf1uw, svldxf1_extend_impl, (TYPE_SUFFIX_u32, UNSPEC_LDNF1))
 FUNCTION (svldnt1, svldnt1_impl,)
 FUNCTION (svlen, svlen_impl,)
-FUNCTION (svlsl, rtx_code_function, (ASHIFT, ASHIFT))
+FUNCTION (svlsl, svlsl_impl,)
 FUNCTION (svlsl_wide, shift_wide, (ASHIFT, UNSPEC_ASHIFT_WIDE))
 FUNCTION (svlsr, rtx_code_function, (LSHIFTRT, LSHIFTRT))
 FUNCTION (svlsr_wide, shift_wide, (LSHIFTRT, UNSPEC_LSHIFTRT_WIDE))
diff --git a/gcc/config/aarch64/aarch64-sve-builtins.cc 
b/gcc/config/aarch64/aarch64-sve-builtins.cc
index 41673745cfea..af6469fff716 100644
--- a/gcc/config/aarch64/aarch64-sve-builtins.cc
+++ b/gcc/config/aarch64/aarch64-sve-builtins.cc
@@ -1147,7 +1147,10 @@ aarch64_const_binop (enum tree_code code, tree arg1, 
tree arg2)
   /* Return 0 for division by 0, like SDIV and UDIV do.  */
   if (code == TRUNC_DIV_EXPR && integer_zerop (arg2))
return arg2;
-
+  /* Return 0 if shift amount is out of range. */
+  if (code == LSHIFT_EXPR
+ && wi::geu_p (wi::to_wide (arg2), TYPE_PRECISION (type)))
+   return build_int_cst (type, 0);
   if (!poly_int_binop (poly_res, code, arg1, arg2, sign, &overflow))
return NULL_TREE;
   return force_fit_type (type, poly_res, false,
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/const_fold_lsl_1.c 
b/gcc/testsuite/gcc.target/aarch64/sve/const_fold_lsl_1.c
new file mode 100644
index ..6109558001a0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/const_fold_lsl_1.c
@@ -0,0 +1,142 @@
+/* { dg-final { check-function-bodies "**" "" } } */
+/* { dg-options "-O2" } */
+
+#include "arm_sve.h"
+
+/*
+** s64_x:
+** mov z[0-9]+\.d, #20
+** ret
+*/
+svint64_t s64_x (svbool_t pg) {
+return svlsl_n_s64_x (pg, svdup_s64 (5), 2);  
+}
+
+/*
+** s64_x_vect:
+** mov z[0-9]+\.d, #20
+** ret
+*/
+svint64_t s64_x_vect (svbool_t pg) {
+return svlsl_s64_x (pg, svdup_s64 (5), svdup_u64 (2));  
+}
+
+/*
+** s64_z:
+** mov z[0-9]+\.d, p[0-7]/z, #20
+** ret
+*/
+svint64_t s64_z (svbool_t pg) {
+return svlsl_n_s64_z (pg, svdup_s64 (5), 2);  
+}
+
+/*
+** s64_z_vect:
+** mov z[0-9]+\.d, p[0-7]/z, #20
+** ret
+*/
+svint64_t s64_z_vect (svbool_t pg) {
+return svlsl_s64_z (pg, svdup_s64 (5), svdup_u64 (2));  
+}
+
+/*
+** s64_m_ptrue:
+** mov z[0-9]+\.d, #20
+** ret
+*/
+svint64_t s64_m_ptrue () {
+return svlsl_n_s64_m (svptrue_b64 (), svdup_s64 (5), 2);  
+}
+
+/*
+** s64_m_ptrue_vect:
+** mov z[0-9]+\.d, #20
+** ret
+*/
+svint64_t s64_m_ptrue_vect () {
+return svlsl_s64_m (svptrue_b64 (), svdup_s64 (5), svdup_u64 (2));  
+}
+
+/*
+** s64_m_pg:
+** mov z[0-9]+\.d, #5
+** lsl z[0-9]+\.d, p[0-7]/m, z[0-9]+\.d, #2
+** ret
+*/
+svint64_t s64_m_pg (svbool_t pg) {
+return svlsl_n_s64_m (pg, svdup_s64 (5), 2);
+} 
+
+/*
+** s64_m_pg_vect:
+** mov z[0-9]+\.d, #

[gcc r15-4886] Revert "PR 117048: simplify-rtx: Simplify (X << C1) [+, ^] (X >> C2) into ROTATE"

2024-11-04 Thread Kyrylo Tkachov via Gcc-cvs
https://gcc.gnu.org/g:8762bb1b004c442b8dbb22a6d9eb0b7da4a3e59f

commit r15-4886-g8762bb1b004c442b8dbb22a6d9eb0b7da4a3e59f
Author: Kyrylo Tkachov 
Date:   Mon Nov 4 14:04:59 2024 +0100

Revert "PR 117048: simplify-rtx: Simplify (X << C1) [+,^] (X >> C2) into 
ROTATE"

This reverts commit de2bc6a7367aca2eecc925ebb64cfb86998d89f3.

Diff:
---
 gcc/simplify-rtx.cc | 204 +---
 1 file changed, 48 insertions(+), 156 deletions(-)

diff --git a/gcc/simplify-rtx.cc b/gcc/simplify-rtx.cc
index 751c908113ef..ce8d3879270d 100644
--- a/gcc/simplify-rtx.cc
+++ b/gcc/simplify-rtx.cc
@@ -2820,104 +2820,6 @@ reverse_rotate_by_imm_p (machine_mode mode, unsigned 
int left, rtx op1)
   return false;
 }
 
-/* Analyse argument X to see if it represents an (ASHIFT X Y) operation
-   and return the expression to be shifted in SHIFT_OPND and the shift amount
-   in SHIFT_AMNT.  This is primarily used to group handling of ASHIFT (X, CST)
-   and (PLUS (X, X)) in one place.  If the expression is not equivalent to an
-   ASHIFT then return FALSE and set SHIFT_OPND and SHIFT_AMNT to NULL.  */
-
-static bool
-extract_ashift_operands_p (rtx x, rtx *shift_opnd, rtx *shift_amnt)
-{
-  if (GET_CODE (x) == ASHIFT)
-{
-  *shift_opnd = XEXP (x, 0);
-  *shift_amnt = XEXP (x, 1);
-  return true;
-}
-  if (GET_CODE (x) == PLUS && rtx_equal_p (XEXP (x, 0), XEXP (x, 1)))
-{
-  *shift_opnd = XEXP (x, 0);
-  *shift_amnt = CONST1_RTX (GET_MODE (x));
-  return true;
-}
-  *shift_opnd = NULL_RTX;
-  *shift_amnt = NULL_RTX;
-  return false;
-}
-
-/* OP0 and OP1 are combined under an operation of mode MODE that can
-   potentially result in a ROTATE expression.  Analyze the OP0 and OP1
-   and return the resulting ROTATE expression if so.  Return NULL otherwise.
-   This is used in detecting the patterns (X << C1) [+,|,^] (X >> C2) where
-   C1 + C2 == GET_MODE_UNIT_PRECISION (mode).
-   (X << C1) and (C >> C2) would be OP0 and OP1.  */
-
-static rtx
-simplify_rotate_op (rtx op0, rtx op1, machine_mode mode)
-{
-  /* Convert (ior (ashift A CX) (lshiftrt A CY)) where CX+CY equals the
- mode size to (rotate A CX).  */
-
-  rtx opleft = simplify_rtx (op0);
-  rtx opright = simplify_rtx (op1);
-  rtx ashift_opnd, ashift_amnt;
-  /* In some cases the ASHIFT is not a direct ASHIFT.  Look deeper and extract
- the relevant operands here.  */
-  bool ashift_op_p
-= extract_ashift_operands_p (op1, &ashift_opnd, &ashift_amnt);
-
-  if (ashift_op_p
- || GET_CODE (op1) == SUBREG)
-{
-  opleft = op1;
-  opright = op0;
-}
-  else
-{
-  opright = op1;
-  opleft = op0;
-  ashift_op_p
-   = extract_ashift_operands_p (opleft, &ashift_opnd, &ashift_amnt);
-}
-
-  if (ashift_op_p && GET_CODE (opright) == LSHIFTRT
-  && rtx_equal_p (ashift_opnd, XEXP (opright, 0)))
-{
-  rtx leftcst = unwrap_const_vec_duplicate (ashift_amnt);
-  rtx rightcst = unwrap_const_vec_duplicate (XEXP (opright, 1));
-
-  if (CONST_INT_P (leftcst) && CONST_INT_P (rightcst)
- && (INTVAL (leftcst) + INTVAL (rightcst)
- == GET_MODE_UNIT_PRECISION (mode)))
-   return gen_rtx_ROTATE (mode, XEXP (opright, 0), ashift_amnt);
-}
-
-  /* Same, but for ashift that has been "simplified" to a wider mode
- by simplify_shift_const.  */
-  scalar_int_mode int_mode, inner_mode;
-
-  if (GET_CODE (opleft) == SUBREG
-  && is_a  (mode, &int_mode)
-  && is_a  (GET_MODE (SUBREG_REG (opleft)),
-&inner_mode)
-  && GET_CODE (SUBREG_REG (opleft)) == ASHIFT
-  && GET_CODE (opright) == LSHIFTRT
-  && GET_CODE (XEXP (opright, 0)) == SUBREG
-  && known_eq (SUBREG_BYTE (opleft), SUBREG_BYTE (XEXP (opright, 0)))
-  && GET_MODE_SIZE (int_mode) < GET_MODE_SIZE (inner_mode)
-  && rtx_equal_p (XEXP (SUBREG_REG (opleft), 0),
- SUBREG_REG (XEXP (opright, 0)))
-  && CONST_INT_P (XEXP (SUBREG_REG (opleft), 1))
-  && CONST_INT_P (XEXP (opright, 1))
-  && (INTVAL (XEXP (SUBREG_REG (opleft), 1))
-   + INTVAL (XEXP (opright, 1))
-== GET_MODE_PRECISION (int_mode)))
-   return gen_rtx_ROTATE (int_mode, XEXP (opright, 0),
-  XEXP (SUBREG_REG (opleft), 1));
-  return NULL_RTX;
-}
-
 /* Subroutine of simplify_binary_operation.  Simplify a binary operation
CODE with result mode MODE, operating on OP0 and OP1.  If OP0 and/or
OP1 are constant pool references, TRUEOP0 and TRUEOP1 represent the
@@ -2929,7 +2831,7 @@ simplify_context::simplify_binary_operation_1 (rtx_code 
code,
   rtx op0, rtx op1,
   rtx trueop0, rtx trueop1)
 {
-  rtx tem, reversed, elt0, elt1;
+  rtx tem, reversed, opleft, opright, elt0, elt1;
   HOST_WIDE_INT val;
   scalar_int_mode int_mode, inner_mode;
   poly_int64 offset;
@@ -3128,11 +3030,6 @@

[gcc r15-4889] PR 117048: simplify-rtx: Simplify (X << C1) [+, ^] (X >> C2) into ROTATE

2024-11-04 Thread Kyrylo Tkachov via Gcc-cvs
https://gcc.gnu.org/g:1c46a541c6957e8b0eee339d4cff46e951a5ad4e

commit r15-4889-g1c46a541c6957e8b0eee339d4cff46e951a5ad4e
Author: Kyrylo Tkachov 
Date:   Mon Nov 4 07:25:16 2024 -0800

PR 117048: simplify-rtx: Simplify (X << C1) [+,^] (X >> C2) into ROTATE

This is, in effect, a reapplication of 
de2bc6a7367aca2eecc925ebb64cfb86998d89f3
fixing the compile-time hog in var-tracking due to calling simplify_rtx
on the two arms of the rotation before detecting the ROTATE.
That is not necessary.

simplify-rtx can transform (X << C1) | (X >> C2) into ROTATE (X, C1) when
C1 + C2 == mode-width.  But the transformation is also valid for PLUS and 
XOR.
Indeed GIMPLE can also do the fold.  Let's teach RTL to do it too.

The motivating testcase for this is in AArch64 intrinsics:

uint64x2_t G2(uint64x2_t a, uint64x2_t b) {
uint64x2_t c = veorq_u64(a, b);
return veorq_u64(vaddq_u64(c, c), vshrq_n_u64(c, 63));
}

which I was hoping to fold to a single XAR (a ROTATE+XOR instruction) but
GCC was failing to detect the rotate operation for two reasons:
1) The combination of the two arms of the expression is done under XOR 
rather
than IOR that simplify-rtx currently supports.
2) The ASHIFT operation is actually a (PLUS X X) operation and thus is not
detected as the LHS of the two arms we require.

The patch fixes both issues.  The analysis of the two arms of the rotation
expression is factored out into a common helper simplify_rotate_op which is
then used in the PLUS, XOR, IOR cases in simplify_binary_operation_1.

The check-assembly testcase for this is added in the following patch because
it needs some extra AArch64 backend work, but I've added self-tests in this
patch to validate the transformation.

Bootstrapped and tested on aarch64-none-linux-gnu

Signed-off-by: Kyrylo Tkachov 

PR target/117048
* simplify-rtx.cc (extract_ashift_operands_p): Define.
(simplify_rotate_op): Likewise.
(simplify_context::simplify_binary_operation_1): Use the above in
the PLUS, IOR, XOR cases.
(test_vector_rotate): Define.
(test_vector_ops): Use the above.

Diff:
---
 gcc/simplify-rtx.cc | 204 +++-
 1 file changed, 156 insertions(+), 48 deletions(-)

diff --git a/gcc/simplify-rtx.cc b/gcc/simplify-rtx.cc
index ce8d3879270d..893c5f6e1ae0 100644
--- a/gcc/simplify-rtx.cc
+++ b/gcc/simplify-rtx.cc
@@ -2820,6 +2820,104 @@ reverse_rotate_by_imm_p (machine_mode mode, unsigned 
int left, rtx op1)
   return false;
 }
 
+/* Analyse argument X to see if it represents an (ASHIFT X Y) operation
+   and return the expression to be shifted in SHIFT_OPND and the shift amount
+   in SHIFT_AMNT.  This is primarily used to group handling of ASHIFT (X, CST)
+   and (PLUS (X, X)) in one place.  If the expression is not equivalent to an
+   ASHIFT then return FALSE and set SHIFT_OPND and SHIFT_AMNT to NULL.  */
+
+static bool
+extract_ashift_operands_p (rtx x, rtx *shift_opnd, rtx *shift_amnt)
+{
+  if (GET_CODE (x) == ASHIFT)
+{
+  *shift_opnd = XEXP (x, 0);
+  *shift_amnt = XEXP (x, 1);
+  return true;
+}
+  if (GET_CODE (x) == PLUS && rtx_equal_p (XEXP (x, 0), XEXP (x, 1)))
+{
+  *shift_opnd = XEXP (x, 0);
+  *shift_amnt = CONST1_RTX (GET_MODE (x));
+  return true;
+}
+  *shift_opnd = NULL_RTX;
+  *shift_amnt = NULL_RTX;
+  return false;
+}
+
+/* OP0 and OP1 are combined under an operation of mode MODE that can
+   potentially result in a ROTATE expression.  Analyze the OP0 and OP1
+   and return the resulting ROTATE expression if so.  Return NULL otherwise.
+   This is used in detecting the patterns (X << C1) [+,|,^] (X >> C2) where
+   C1 + C2 == GET_MODE_UNIT_PRECISION (mode).
+   (X << C1) and (C >> C2) would be OP0 and OP1.  */
+
+static rtx
+simplify_rotate_op (rtx op0, rtx op1, machine_mode mode)
+{
+  /* Convert (ior (ashift A CX) (lshiftrt A CY)) where CX+CY equals the
+ mode size to (rotate A CX).  */
+
+  rtx opleft = op0;
+  rtx opright = op1;
+  rtx ashift_opnd, ashift_amnt;
+  /* In some cases the ASHIFT is not a direct ASHIFT.  Look deeper and extract
+ the relevant operands here.  */
+  bool ashift_op_p
+= extract_ashift_operands_p (op1, &ashift_opnd, &ashift_amnt);
+
+  if (ashift_op_p
+ || GET_CODE (op1) == SUBREG)
+{
+  opleft = op1;
+  opright = op0;
+}
+  else
+{
+  opright = op1;
+  opleft = op0;
+  ashift_op_p
+   = extract_ashift_operands_p (opleft, &ashift_opnd, &ashift_amnt);
+}
+
+  if (ashift_op_p && GET_CODE (opright) == LSHIFTRT
+  && rtx_equal_p (ashift_opnd, XEXP (opright, 0)))
+{
+  rtx leftcst = unwrap_const_vec_duplicate (ashift_amnt);
+  rtx rightcst = unwrap_const_vec_duplicate (XEXP (opright, 1));
+
+  if (CONST_INT_P (leftcs

[gcc r15-4963] PR target/117449: Restrict vector rotate match and split to pre-reload

2024-11-05 Thread Kyrylo Tkachov via Gcc-cvs
https://gcc.gnu.org/g:161e246cf32f1298400aa3c1d86110490a3cd0ce

commit r15-4963-g161e246cf32f1298400aa3c1d86110490a3cd0ce
Author: Kyrylo Tkachov 
Date:   Tue Nov 5 05:10:22 2024 -0800

PR target/117449: Restrict vector rotate match and split to pre-reload

The vector rotate splitter has some logic to deal with post-reload splitting
but not all cases in aarch64_emit_opt_vec_rotate are post-reload-safe.
In particular the ROTATE+XOR expansion for TARGET_SHA3 can create RTL that
can later be simplified to a simple ROTATE post-reload, which would then
match the insn again and try to split it.
So do a clean split pre-reload and avoid going down this path post-reload
by restricting the insn_and_split to can_create_pseudo_p ().

Bootstrapped and tested on aarch64-none-linux.

Signed-off-by: Kyrylo Tkachov 
gcc/

PR target/117449
* config/aarch64/aarch64-simd.md (*aarch64_simd_rotate_imm):
Match only when can_create_pseudo_p ().
* config/aarch64/aarch64.cc (aarch64_emit_opt_vec_rotate): Assume
can_create_pseudo_p ().

gcc/testsuite/

PR target/117449
* gcc.c-torture/compile/pr117449.c: New test.

Diff:
---
 gcc/config/aarch64/aarch64-simd.md |  6 --
 gcc/config/aarch64/aarch64.cc  | 11 ++-
 gcc/testsuite/gcc.c-torture/compile/pr117449.c |  8 
 3 files changed, 18 insertions(+), 7 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-simd.md 
b/gcc/config/aarch64/aarch64-simd.md
index a91222b6e3b2..cfe95bd4c316 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -1296,11 +1296,13 @@
 
 ;; After all the combinations and propagations of ROTATE have been
 ;; attempted split any remaining vector rotates into SHL + USRA sequences.
+;; Don't match this after reload as the various possible sequence for this
+;; require temporary registers.
 (define_insn_and_split "*aarch64_simd_rotate_imm"
   [(set (match_operand:VDQ_I 0 "register_operand" "=&w")
(rotate:VDQ_I (match_operand:VDQ_I 1 "register_operand" "w")
  (match_operand:VDQ_I 2 "aarch64_simd_lshift_imm")))]
-  "TARGET_SIMD"
+  "TARGET_SIMD && can_create_pseudo_p ()"
   "#"
   "&& 1"
   [(set (match_dup 3)
@@ -1316,7 +1318,7 @@
 if (aarch64_emit_opt_vec_rotate (operands[0], operands[1], operands[2]))
   DONE;
 
-operands[3] = reload_completed ? operands[0] : gen_reg_rtx (mode);
+operands[3] = gen_reg_rtx (mode);
 rtx shft_amnt = unwrap_const_vec_duplicate (operands[2]);
 int bitwidth = GET_MODE_UNIT_SIZE (mode) * BITS_PER_UNIT;
 operands[4]
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 9347e06f0e9e..f2b53475adbe 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -16030,6 +16030,8 @@ aarch64_emit_opt_vec_rotate (rtx dst, rtx reg, rtx 
amnt_vec)
   gcc_assert (CONST_INT_P (amnt));
   HOST_WIDE_INT rotamnt = UINTVAL (amnt);
   machine_mode mode = GET_MODE (reg);
+  /* Don't end up here after reload.  */
+  gcc_assert (can_create_pseudo_p ());
   /* Rotates by half the element width map down to REV* instructions and should
  always be preferred when possible.  */
   if (rotamnt == GET_MODE_UNIT_BITSIZE (mode) / 2
@@ -16037,11 +16039,10 @@ aarch64_emit_opt_vec_rotate (rtx dst, rtx reg, rtx 
amnt_vec)
 return true;
   /* 64 and 128-bit vector modes can use the XAR instruction
  when available.  */
-  else if (can_create_pseudo_p ()
-  && ((TARGET_SHA3 && mode == V2DImode)
-  || (TARGET_SVE2
-  && (known_eq (GET_MODE_SIZE (mode), 8)
-  || known_eq (GET_MODE_SIZE (mode), 16)
+  else if ((TARGET_SHA3 && mode == V2DImode)
+  || (TARGET_SVE2
+  && (known_eq (GET_MODE_SIZE (mode), 8)
+  || known_eq (GET_MODE_SIZE (mode), 16
 {
   rtx zeroes = aarch64_gen_shareable_zero (mode);
   rtx xar_op
diff --git a/gcc/testsuite/gcc.c-torture/compile/pr117449.c 
b/gcc/testsuite/gcc.c-torture/compile/pr117449.c
new file mode 100644
index ..8ae0071fca6b
--- /dev/null
+++ b/gcc/testsuite/gcc.c-torture/compile/pr117449.c
@@ -0,0 +1,8 @@
+/* { dg-additional-options "-march=armv8.2-a+sha3" { target aarch64*-*-* } } */
+
+unsigned long *a;
+int i;
+void f() {
+  for (i = 0; i < 80; i++)
+a[i] = (a[i] >> 8 | a[i] << 64 - 8) ^ a[i];
+}


[gcc r15-4873] PR 117048: simplify-rtx: Simplify (X << C1) [+, ^] (X >> C2) into ROTATE

2024-11-04 Thread Kyrylo Tkachov via Gcc-cvs
https://gcc.gnu.org/g:de2bc6a7367aca2eecc925ebb64cfb86998d89f3

commit r15-4873-gde2bc6a7367aca2eecc925ebb64cfb86998d89f3
Author: Kyrylo Tkachov 
Date:   Tue Oct 15 06:32:31 2024 -0700

PR 117048: simplify-rtx: Simplify (X << C1) [+,^] (X >> C2) into ROTATE

simplify-rtx can transform (X << C1) | (X >> C2) into ROTATE (X, C1) when
C1 + C2 == mode-width.  But the transformation is also valid for PLUS and 
XOR.
Indeed GIMPLE can also do the fold.  Let's teach RTL to do it too.

The motivating testcase for this is in AArch64 intrinsics:

uint64x2_t G2(uint64x2_t a, uint64x2_t b) {
uint64x2_t c = veorq_u64(a, b);
return veorq_u64(vaddq_u64(c, c), vshrq_n_u64(c, 63));
}

which I was hoping to fold to a single XAR (a ROTATE+XOR instruction) but
GCC was failing to detect the rotate operation for two reasons:
1) The combination of the two arms of the expression is done under XOR 
rather
than IOR that simplify-rtx currently supports.
2) The ASHIFT operation is actually a (PLUS X X) operation and thus is not
detected as the LHS of the two arms we require.

The patch fixes both issues.  The analysis of the two arms of the rotation
expression is factored out into a common helper simplify_rotate which is
then used in the PLUS, XOR, IOR cases in simplify_binary_operation_1.

The check-assembly testcase for this is added in the following patch because
it needs some extra AArch64 backend work, but I've added self-tests in this
patch to validate the transformation.

Bootstrapped and tested on aarch64-none-linux-gnu

Signed-off-by: Kyrylo Tkachov 

PR target/117048
* simplify-rtx.cc (extract_ashift_operands_p): Define.
(simplify_rotate_op): Likewise.
(simplify_context::simplify_binary_operation_1): Use the above in
the PLUS, IOR, XOR cases.
(test_vector_rotate): Define.
(test_vector_ops): Use the above.

Diff:
---
 gcc/simplify-rtx.cc | 204 +++-
 1 file changed, 156 insertions(+), 48 deletions(-)

diff --git a/gcc/simplify-rtx.cc b/gcc/simplify-rtx.cc
index 2c04ce960ee4..0ff72638d85f 100644
--- a/gcc/simplify-rtx.cc
+++ b/gcc/simplify-rtx.cc
@@ -2820,6 +2820,104 @@ reverse_rotate_by_imm_p (machine_mode mode, unsigned 
int left, rtx op1)
   return false;
 }
 
+/* Analyse argument X to see if it represents an (ASHIFT X Y) operation
+   and return the expression to be shifted in SHIFT_OPND and the shift amount
+   in SHIFT_AMNT.  This is primarily used to group handling of ASHIFT (X, CST)
+   and (PLUS (X, X)) in one place.  If the expression is not equivalent to an
+   ASHIFT then return FALSE and set SHIFT_OPND and SHIFT_AMNT to NULL.  */
+
+static bool
+extract_ashift_operands_p (rtx x, rtx *shift_opnd, rtx *shift_amnt)
+{
+  if (GET_CODE (x) == ASHIFT)
+{
+  *shift_opnd = XEXP (x, 0);
+  *shift_amnt = XEXP (x, 1);
+  return true;
+}
+  if (GET_CODE (x) == PLUS && rtx_equal_p (XEXP (x, 0), XEXP (x, 1)))
+{
+  *shift_opnd = XEXP (x, 0);
+  *shift_amnt = CONST1_RTX (GET_MODE (x));
+  return true;
+}
+  *shift_opnd = NULL_RTX;
+  *shift_amnt = NULL_RTX;
+  return false;
+}
+
+/* OP0 and OP1 are combined under an operation of mode MODE that can
+   potentially result in a ROTATE expression.  Analyze the OP0 and OP1
+   and return the resulting ROTATE expression if so.  Return NULL otherwise.
+   This is used in detecting the patterns (X << C1) [+,|,^] (X >> C2) where
+   C1 + C2 == GET_MODE_UNIT_PRECISION (mode).
+   (X << C1) and (C >> C2) would be OP0 and OP1.  */
+
+static rtx
+simplify_rotate_op (rtx op0, rtx op1, machine_mode mode)
+{
+  /* Convert (ior (ashift A CX) (lshiftrt A CY)) where CX+CY equals the
+ mode size to (rotate A CX).  */
+
+  rtx opleft = simplify_rtx (op0);
+  rtx opright = simplify_rtx (op1);
+  rtx ashift_opnd, ashift_amnt;
+  /* In some cases the ASHIFT is not a direct ASHIFT.  Look deeper and extract
+ the relevant operands here.  */
+  bool ashift_op_p
+= extract_ashift_operands_p (op1, &ashift_opnd, &ashift_amnt);
+
+  if (ashift_op_p
+ || GET_CODE (op1) == SUBREG)
+{
+  opleft = op1;
+  opright = op0;
+}
+  else
+{
+  opright = op1;
+  opleft = op0;
+  ashift_op_p
+   = extract_ashift_operands_p (opleft, &ashift_opnd, &ashift_amnt);
+}
+
+  if (ashift_op_p && GET_CODE (opright) == LSHIFTRT
+  && rtx_equal_p (ashift_opnd, XEXP (opright, 0)))
+{
+  rtx leftcst = unwrap_const_vec_duplicate (ashift_amnt);
+  rtx rightcst = unwrap_const_vec_duplicate (XEXP (opright, 1));
+
+  if (CONST_INT_P (leftcst) && CONST_INT_P (rightcst)
+ && (INTVAL (leftcst) + INTVAL (rightcst)
+ == GET_MODE_UNIT_PRECISION (mode)))
+   return gen_rtx_ROTATE (mode, XEXP (opright, 0), ashift_amnt);
+}
+
+  /* Same, but for a

[gcc r15-4874] aarch64: Use canonical RTL representation for SVE2 XAR and extend it to fixed-width modes

2024-11-04 Thread Kyrylo Tkachov via Gcc-cvs
https://gcc.gnu.org/g:1e5ff11142b2a37e7fd07a85248a0179bbb534be

commit r15-4874-g1e5ff11142b2a37e7fd07a85248a0179bbb534be
Author: Kyrylo Tkachov 
Date:   Tue Oct 22 03:27:47 2024 -0700

aarch64: Use canonical RTL representation for SVE2 XAR and extend it to 
fixed-width modes

The MD pattern for the XAR instruction in SVE2 is currently expressed with
non-canonical RTL by using a ROTATERT code with a constant rotate amount.
Fix it by using the left ROTATE code.  This necessitates splitting out the
expander separately to translate the immediate coming from the intrinsic
from a right-rotate to a left-rotate immediate.

Additionally, as the SVE2 XAR instruction is unpredicated and can handle all
element sizes from .b to .d, it is a good fit for implementing the 
XOR+ROTATE
operation for Advanced SIMD modes where the TARGET_SHA3 cannot be used
(that can only handle V2DImode operands).  Therefore let's extend the 
accepted
modes of the SVE2 patternt to include the Advanced SIMD integer modes.

This leads to some tests for the svxar* intrinsics to fail because they now
simplify to a plain EOR when the rotate amount is the width of the element.
This simplification is desirable (EOR instructions have better or equal
throughput than XAR, and they are non-destructive of their input) so the
tests are adjusted.

For V2DImode XAR operations we should prefer the Advanced SIMD version when
it is available (TARGET_SHA3) because it is non-destructive, so restrict the
SVE2 pattern accordingly.  Tests are added to confirm this.

Bootstrapped and tested on aarch64-none-linux-gnu.
Ok for mainline?

Signed-off-by: Kyrylo Tkachov 

gcc/

* config/aarch64/iterators.md (SVE_ASIMD_FULL_I): New mode iterator.
* config/aarch64/aarch64-sve2.md (@aarch64_sve2_xar):
Use SVE_ASIMD_FULL_I modes.  Use ROTATE code for the rotate step.
Adjust output logic.
* config/aarch64/aarch64-sve-builtins-sve2.cc (svxar_impl): Define.
(svxar): Use the above.

gcc/testsuite/

* gcc.target/aarch64/xar_neon_modes.c: New test.
* gcc.target/aarch64/xar_v2di_nonsve.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/xar_s16.c: Scan for EOR rather 
than
XAR.
* gcc.target/aarch64/sve2/acle/asm/xar_s32.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/xar_s64.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/xar_s8.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/xar_u16.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/xar_u32.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/xar_u64.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/xar_u8.c: Likewise.

Diff:
---
 gcc/config/aarch64/aarch64-sve-builtins-sve2.cc| 18 +-
 gcc/config/aarch64/aarch64-sve2.md | 30 +++--
 gcc/config/aarch64/iterators.md|  3 ++
 .../gcc.target/aarch64/sve2/acle/asm/xar_s16.c | 18 ++
 .../gcc.target/aarch64/sve2/acle/asm/xar_s32.c | 18 ++
 .../gcc.target/aarch64/sve2/acle/asm/xar_s64.c | 18 ++
 .../gcc.target/aarch64/sve2/acle/asm/xar_s8.c  | 18 ++
 .../gcc.target/aarch64/sve2/acle/asm/xar_u16.c | 18 ++
 .../gcc.target/aarch64/sve2/acle/asm/xar_u32.c | 18 ++
 .../gcc.target/aarch64/sve2/acle/asm/xar_u64.c | 18 ++
 .../gcc.target/aarch64/sve2/acle/asm/xar_u8.c  | 18 ++
 gcc/testsuite/gcc.target/aarch64/xar_neon_modes.c  | 39 ++
 gcc/testsuite/gcc.target/aarch64/xar_v2di_nonsve.c | 16 +
 13 files changed, 191 insertions(+), 59 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc 
b/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc
index 64f86035c30e..f0ab7400ef50 100644
--- a/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc
+++ b/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc
@@ -108,6 +108,22 @@ public:
   }
 };
 
+class svxar_impl : public function_base
+{
+public:
+  rtx
+  expand (function_expander &e) const override
+  {
+/* aarch64_sve2_xar represents this operation with a left-rotate RTX.
+   Convert the right-rotate amount from the intrinsic to fit this.  */
+machine_mode mode = e.vector_mode (0);
+HOST_WIDE_INT rot = GET_MODE_UNIT_BITSIZE (mode)
+   - INTVAL (e.args[2]);
+e.args[2] = aarch64_simd_gen_const_vector_dup (mode, rot);
+return e.use_exact_insn (code_for_aarch64_sve2_xar (mode));
+  }
+};
+
 class svcdot_impl : public function_base
 {
 public:
@@ -795,6 +811,6 @@ FUNCTION (svwhilege, while_comparison, (UNSPEC_WHILEGE, 
UNSPEC_WHILEHS))
 FUNCTION (svwhilegt, while_comparison, (UNSPEC_WHILEGT, UNSPEC_WHILEHI))
 FUNCTION (svwhilerw, svwhilerw_svwhilewr_impl, (UNSPEC_WHILERW))
 FUNCTION (svwhilewr, svwhilerw_svwh

[gcc r15-4875] PR 117048: aarch64: Add define_insn_and_split for vector ROTATE

2024-11-04 Thread Kyrylo Tkachov via Gcc-cvs
https://gcc.gnu.org/g:1411d39bc72515227de2e490eb8f629d8bf74c95

commit r15-4875-g1411d39bc72515227de2e490eb8f629d8bf74c95
Author: Kyrylo Tkachov 
Date:   Tue Oct 15 06:33:11 2024 -0700

PR 117048: aarch64: Add define_insn_and_split for vector ROTATE

The ultimate goal in this PR is to match the XAR pattern that is represented
as a (ROTATE (XOR X Y) VCST) from the ACLE intrinsics code in the testcase.
The first blocker for this was the missing recognition of ROTATE in
simplify-rtx, which is fixed in the previous patch.
The next problem is that once the ROTATE has been matched from the shifts
and orr/xor/plus, it will try to match it in an insn before trying to 
combine
the XOR into it.  But as we don't have a backend pattern for a vector ROTATE
this recog fails and combine does not try the followup XOR+ROTATE 
combination
which would have succeeded.

This patch solves that by introducing a sort of "scaffolding" pattern for
vector ROTATE, which allows it to be combined into the XAR.
If it fails to be combined into anything the splitter will break it back
down into the SHL+USRA sequence that it would have emitted.
By having this splitter we can special-case some rotate amounts in the 
future
to emit more specialised instructions e.g. from the REV* family.
This can be done if the ROTATE is not combined into something else.

This optimisation is done in the next patch in the series.

Bootstrapped and tested on aarch64-none-linux-gnu.

Signed-off-by: Kyrylo Tkachov 

gcc/

PR target/117048
* config/aarch64/aarch64-simd.md (*aarch64_simd_rotate_imm):
New define_insn_and_split.

gcc/testsuite/

PR target/117048
* gcc.target/aarch64/simd/pr117048.c: New test.

Diff:
---
 gcc/config/aarch64/aarch64-simd.md   | 29 ++
 gcc/testsuite/gcc.target/aarch64/simd/pr117048.c | 73 
 2 files changed, 102 insertions(+)

diff --git a/gcc/config/aarch64/aarch64-simd.md 
b/gcc/config/aarch64/aarch64-simd.md
index e456f693d2f3..08b121227eee 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -1294,6 +1294,35 @@
   [(set_attr "type" "neon_shift_acc")]
 )
 
+;; After all the combinations and propagations of ROTATE have been
+;; attempted split any remaining vector rotates into SHL + USRA sequences.
+(define_insn_and_split "*aarch64_simd_rotate_imm"
+  [(set (match_operand:VDQ_I 0 "register_operand" "=&w")
+   (rotate:VDQ_I (match_operand:VDQ_I 1 "register_operand" "w")
+ (match_operand:VDQ_I 2 "aarch64_simd_lshift_imm")))]
+  "TARGET_SIMD"
+  "#"
+  "&& 1"
+  [(set (match_dup 3)
+   (ashift:VDQ_I (match_dup 1)
+ (match_dup 2)))
+   (set (match_dup 0)
+   (plus:VDQ_I
+ (lshiftrt:VDQ_I
+   (match_dup 1)
+   (match_dup 4))
+ (match_dup 3)))]
+  {
+operands[3] = reload_completed ? operands[0] : gen_reg_rtx (mode);
+rtx shft_amnt = unwrap_const_vec_duplicate (operands[2]);
+int bitwidth = GET_MODE_UNIT_SIZE (mode) * BITS_PER_UNIT;
+operands[4]
+  = aarch64_simd_gen_const_vector_dup (mode,
+  bitwidth - INTVAL (shft_amnt));
+  }
+  [(set_attr "length" "8")]
+)
+
 (define_insn "aarch64_rsra_n_insn"
  [(set (match_operand:VSDQ_I_DI 0 "register_operand" "=w")
(plus:VSDQ_I_DI
diff --git a/gcc/testsuite/gcc.target/aarch64/simd/pr117048.c 
b/gcc/testsuite/gcc.target/aarch64/simd/pr117048.c
new file mode 100644
index ..621c0f46fc4b
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/simd/pr117048.c
@@ -0,0 +1,73 @@
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+/* { dg-final { check-function-bodies "**" "" "" } } */
+
+#include 
+
+#pragma GCC target "+sha3"
+
+/*
+** func_shl_eor:
+** xar v0\.2d, v([0-9]+)\.2d, v([0-9]+)\.2d, 63
+** ret 
+*/
+uint64x2_t
+func_shl_eor (uint64x2_t a, uint64x2_t b) {
+  uint64x2_t c = veorq_u64 (a, b);
+  return veorq_u64(vshlq_n_u64(c, 1), vshrq_n_u64(c, 63));
+}
+
+/*
+** func_add_eor:
+** xar v0\.2d, v([0-9]+)\.2d, v([0-9]+)\.2d, 63
+** ret 
+*/
+uint64x2_t
+func_add_eor (uint64x2_t a, uint64x2_t b) {
+  uint64x2_t c = veorq_u64 (a, b);
+  return veorq_u64(vaddq_u64(c, c), vshrq_n_u64(c, 63));
+}
+
+/*
+** func_shl_orr:
+** xar v0\.2d, v([0-9]+)\.2d, v([0-9]+)\.2d, 63
+** ret 
+*/
+uint64x2_t
+func_shl_orr (uint64x2_t a, uint64x2_t b) {
+  uint64x2_t c = veorq_u64 (a, b);
+  return vorrq_u64(vshlq_n_u64(c, 1), vshrq_n_u64(c, 63));
+}
+
+/*
+** func_add_orr:
+** xar v0\.2d, v([0-9]+)\.2d, v([0-9]+)\.2d, 63
+** ret 
+*/
+uint64x2_t
+func_add_orr (uint64x2_t a, uint64x2_t b) {
+  uint64x2_t c = veorq_u64 (a, b);
+  return vorrq_u64(vaddq_u64(c, c), vshrq_n_u64(c, 63));
+}
+
+/*
+** func_shl_add:
+** xar v0\.2d, v([0-9]+)\.2d, v([

[gcc r15-4876] aarch64: Optimize vector rotates as vector permutes where possible

2024-11-04 Thread Kyrylo Tkachov via Gcc-cvs
https://gcc.gnu.org/g:19757e1c28de07b45da03117e6ff7ae3e21e5a7a

commit r15-4876-g19757e1c28de07b45da03117e6ff7ae3e21e5a7a
Author: Kyrylo Tkachov 
Date:   Wed Oct 16 04:10:08 2024 -0700

aarch64: Optimize vector rotates as vector permutes where possible

Some vector rotate operations can be implemented in a single instruction
rather than using the fallback SHL+USRA sequence.
In particular, when the rotate amount is half the bitwidth of the element
we can use a REV64,REV32,REV16 instruction.
More generally, rotates by a byte amount can be implented using vector
permutes.
This patch adds such a generic routine in expmed.cc called
expand_rotate_as_vec_perm that calculates the required permute indices
and uses the expand_vec_perm_const interface.

On aarch64 this ends up generating the single-instruction sequences above
where possible and can use LDR+TBL sequences too, which are a good choice.

With help from Richard, the routine should be VLA-safe.
However, the only use of expand_rotate_as_vec_perm introduced in this patch
is in aarch64-specific code that for now only handles fixed-width modes.

A runtime aarch64 test is added to ensure the permute indices are not messed
up.

Bootstrapped and tested on aarch64-none-linux-gnu.

Signed-off-by: Kyrylo Tkachov 

gcc/

* expmed.h (expand_rotate_as_vec_perm): Declare.
* expmed.cc (expand_rotate_as_vec_perm): Define.
* config/aarch64/aarch64-protos.h (aarch64_emit_opt_vec_rotate):
Declare prototype.
* config/aarch64/aarch64.cc (aarch64_emit_opt_vec_rotate): 
Implement.
* config/aarch64/aarch64-simd.md (*aarch64_simd_rotate_imm):
Call the above.

gcc/testsuite/

* gcc.target/aarch64/vec-rot-exec.c: New test.
* gcc.target/aarch64/simd/pr117048_2.c: New test.

Diff:
---
 gcc/config/aarch64/aarch64-protos.h|   1 +
 gcc/config/aarch64/aarch64-simd.md |   3 +
 gcc/config/aarch64/aarch64.cc  |  16 
 gcc/expmed.cc  |  44 +
 gcc/expmed.h   |   1 +
 gcc/testsuite/gcc.target/aarch64/simd/pr117048_2.c |  66 ++
 gcc/testsuite/gcc.target/aarch64/vec-rot-exec.c| 101 +
 7 files changed, 232 insertions(+)

diff --git a/gcc/config/aarch64/aarch64-protos.h 
b/gcc/config/aarch64/aarch64-protos.h
index 05caad5e2fee..e8588e1cb177 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -851,6 +851,7 @@ bool aarch64_rnd_imm_p (rtx);
 bool aarch64_constant_address_p (rtx);
 bool aarch64_emit_approx_div (rtx, rtx, rtx);
 bool aarch64_emit_approx_sqrt (rtx, rtx, bool);
+bool aarch64_emit_opt_vec_rotate (rtx, rtx, rtx);
 tree aarch64_vector_load_decl (tree);
 rtx aarch64_gen_callee_cookie (aarch64_isa_mode, arm_pcs);
 void aarch64_expand_call (rtx, rtx, rtx, bool);
diff --git a/gcc/config/aarch64/aarch64-simd.md 
b/gcc/config/aarch64/aarch64-simd.md
index 08b121227eee..a91222b6e3b2 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -1313,6 +1313,9 @@
(match_dup 4))
  (match_dup 3)))]
   {
+if (aarch64_emit_opt_vec_rotate (operands[0], operands[1], operands[2]))
+  DONE;
+
 operands[3] = reload_completed ? operands[0] : gen_reg_rtx (mode);
 rtx shft_amnt = unwrap_const_vec_duplicate (operands[2]);
 int bitwidth = GET_MODE_UNIT_SIZE (mode) * BITS_PER_UNIT;
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 0fa7927d821a..7388f6b8fdf1 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -16018,6 +16018,22 @@ aarch64_emit_approx_div (rtx quo, rtx num, rtx den)
   return true;
 }
 
+/* Emit an optimized sequence to perform a vector rotate
+   of REG by the vector constant amount AMNT and place the result
+   in DST.  Return true iff successful.  */
+
+bool
+aarch64_emit_opt_vec_rotate (rtx dst, rtx reg, rtx amnt)
+{
+  machine_mode mode = GET_MODE (reg);
+  /* Attempt to expand the rotate as a vector permute.
+ For some rotate amounts they can be single instructions and
+ even the general single-vector TBL permute has good throughput.  */
+  if (expand_rotate_as_vec_perm (mode, dst, reg, amnt))
+return true;
+  return false;
+}
+
 /* Return the number of instructions that can be issued per cycle.  */
 static int
 aarch64_sched_issue_rate (void)
diff --git a/gcc/expmed.cc b/gcc/expmed.cc
index aa9f1abc8aba..2d5e5243ce8e 100644
--- a/gcc/expmed.cc
+++ b/gcc/expmed.cc
@@ -6286,6 +6286,50 @@ emit_store_flag_force (rtx target, enum rtx_code code, 
rtx op0, rtx op1,
   return target;
 }
 
+/* Expand a vector (left) rotate of MODE of X by an immediate AMT as a vector
+   permute operation.  Emit code to put the result in DST

[gcc r15-4877] aarch64: Emit XAR for vector rotates where possible

2024-11-04 Thread Kyrylo Tkachov via Gcc-cvs
https://gcc.gnu.org/g:14cb23e743e02e6923f7e46a14717e9f561f6723

commit r15-4877-g14cb23e743e02e6923f7e46a14717e9f561f6723
Author: Kyrylo Tkachov 
Date:   Tue Oct 22 07:52:36 2024 -0700

aarch64: Emit XAR for vector rotates where possible

We can make use of the integrated rotate step of the XAR instruction
to implement most vector integer rotates, as long we zero out one
of the input registers for it.  This allows for a lower-latency sequence
than the fallback SHL+USRA, especially when we can hoist the zeroing 
operation
away from loops and hot parts.  This should be safe to do for 64-bit vectors
as well even though the XAR instructions operate on 128-bit values, as the
bottom 64-bit results is later accessed through the right subregs.

This strategy is used whenever we have XAR instructions, the logic
in aarch64_emit_opt_vec_rotate is adjusted to resort to
expand_rotate_as_vec_perm only when it's expected to generate a single REV*
instruction or when XAR instructions are not present.

With this patch we can gerate for the input:
v4si
G1 (v4si r)
{
return (r >> 23) | (r << 9);
}

v8qi
G2 (v8qi r)
{
  return (r << 3) | (r >> 5);
}
the assembly for +sve2:
G1:
moviv31.4s, 0
xar z0.s, z0.s, z31.s, #23
ret

G2:
moviv31.4s, 0
xar z0.b, z0.b, z31.b, #5
ret

instead of the current:
G1:
shl v31.4s, v0.4s, 9
usrav31.4s, v0.4s, 23
mov v0.16b, v31.16b
ret
G2:
shl v31.8b, v0.8b, 3
usrav31.8b, v0.8b, 5
mov v0.8b, v31.8b
ret

Bootstrapped and tested on aarch64-none-linux-gnu.

Signed-off-by: Kyrylo Tkachov 

gcc/

* config/aarch64/aarch64.cc (aarch64_emit_opt_vec_rotate): Add
generation of XAR sequences when possible.

gcc/testsuite/

* gcc.target/aarch64/rotate_xar_1.c: New test.

Diff:
---
 gcc/config/aarch64/aarch64.cc   | 34 +++--
 gcc/testsuite/gcc.target/aarch64/rotate_xar_1.c | 93 +
 2 files changed, 121 insertions(+), 6 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 7388f6b8fdf1..00f99d5004ca 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -16019,17 +16019,39 @@ aarch64_emit_approx_div (rtx quo, rtx num, rtx den)
 }
 
 /* Emit an optimized sequence to perform a vector rotate
-   of REG by the vector constant amount AMNT and place the result
+   of REG by the vector constant amount AMNT_VEC and place the result
in DST.  Return true iff successful.  */
 
 bool
-aarch64_emit_opt_vec_rotate (rtx dst, rtx reg, rtx amnt)
+aarch64_emit_opt_vec_rotate (rtx dst, rtx reg, rtx amnt_vec)
 {
+  rtx amnt = unwrap_const_vec_duplicate (amnt_vec);
+  gcc_assert (CONST_INT_P (amnt));
+  HOST_WIDE_INT rotamnt = UINTVAL (amnt);
   machine_mode mode = GET_MODE (reg);
-  /* Attempt to expand the rotate as a vector permute.
- For some rotate amounts they can be single instructions and
- even the general single-vector TBL permute has good throughput.  */
-  if (expand_rotate_as_vec_perm (mode, dst, reg, amnt))
+  /* Rotates by half the element width map down to REV* instructions and should
+ always be preferred when possible.  */
+  if (rotamnt == GET_MODE_UNIT_BITSIZE (mode) / 2
+  && expand_rotate_as_vec_perm (mode, dst, reg, amnt))
+return true;
+  /* 64 and 128-bit vector modes can use the XAR instruction
+ when available.  */
+  else if (can_create_pseudo_p ()
+  && ((TARGET_SHA3 && mode == V2DImode)
+  || (TARGET_SVE2
+  && (known_eq (GET_MODE_SIZE (mode), 8)
+  || known_eq (GET_MODE_SIZE (mode), 16)
+{
+  rtx zeroes = aarch64_gen_shareable_zero (mode);
+  rtx xar_op
+   = gen_rtx_ROTATE (mode, gen_rtx_XOR (mode, reg, zeroes),
+   amnt_vec);
+  emit_set_insn (dst, xar_op);
+  return true;
+}
+  /* If none of the above, try to expand rotates by any byte amount as
+ permutes.  */
+  else if (expand_rotate_as_vec_perm (mode, dst, reg, amnt))
 return true;
   return false;
 }
diff --git a/gcc/testsuite/gcc.target/aarch64/rotate_xar_1.c 
b/gcc/testsuite/gcc.target/aarch64/rotate_xar_1.c
new file mode 100644
index ..73007701cfb4
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/rotate_xar_1.c
@@ -0,0 +1,93 @@
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+/* { dg-final { check-function-bodies "**" "" } } */
+
+typedef char __attribute__ ((vector_size (16))) v16qi;
+typedef unsigned short __attribute__ ((vector_size (16))) v8hi;
+typedef unsigned int __attribute__ ((vector_size (16))) v4si;
+typedef unsigned

[gcc r15-4878] simplify-rtx: Simplify ROTATE:HI (X:HI, 8) into BSWAP:HI (X)

2024-11-04 Thread Kyrylo Tkachov via Gcc-cvs
https://gcc.gnu.org/g:f1d16cd9236e0d59c04018e2dccc09dd736bf1df

commit r15-4878-gf1d16cd9236e0d59c04018e2dccc09dd736bf1df
Author: Kyrylo Tkachov 
Date:   Thu Oct 17 06:39:57 2024 -0700

simplify-rtx: Simplify ROTATE:HI (X:HI, 8) into BSWAP:HI (X)

With recent patch to improve detection of vector rotates at RTL level
combine now tries matching a V8HImode rotate by 8 in the example in the
testcase.  We can teach AArch64 to emit a REV16 instruction for such a 
rotate
but really this operation corresponds to the RTL code BSWAP, for which we
already have the right patterns.  BSWAP is arguably a simpler representation
than ROTATE here because it has only one operand, so let's teach 
simplify-rtx
to generate it.

With this patch the testcase now generates the simplest form:
.L2:
ldr q31, [x1, x0]
rev16   v31.16b, v31.16b
str q31, [x0, x2]
add x0, x0, 16
cmp x0, 2048
bne .L2

instead of the previous:
.L2:
ldr q31, [x1, x0]
shl v30.8h, v31.8h, 8
usrav30.8h, v31.8h, 8
str q30, [x0, x2]
add x0, x0, 16
cmp x0, 2048
bne .L2

IMO ideally the bswap detection would have been done during vectorisation
time and used the expanders for that, but teaching simplify-rtx to do this
transformation is fairly straightforward and, unlike at tree level, we have
the native RTL BSWAP code.  This change is not enough to generate the
equivalent sequence in SVE, but that is something that should be tackled
separately.

Bootstrapped and tested on aarch64-none-linux-gnu.

Signed-off-by: Kyrylo Tkachov 

gcc/

* simplify-rtx.cc (simplify_context::simplify_binary_operation_1):
Simplify (rotate:HI x:HI, 8) -> (bswap:HI x:HI).

gcc/testsuite/

* gcc.target/aarch64/rot_to_bswap.c: New test.

Diff:
---
 gcc/simplify-rtx.cc |  8 
 gcc/testsuite/gcc.target/aarch64/rot_to_bswap.c | 23 +++
 2 files changed, 31 insertions(+)

diff --git a/gcc/simplify-rtx.cc b/gcc/simplify-rtx.cc
index 0ff72638d85f..751c908113ef 100644
--- a/gcc/simplify-rtx.cc
+++ b/gcc/simplify-rtx.cc
@@ -4328,6 +4328,14 @@ simplify_context::simplify_binary_operation_1 (rtx_code 
code,
  mode, op0, new_amount_rtx);
}
 #endif
+  /* ROTATE/ROTATERT:HI (X:HI, 8) is BSWAP:HI (X).  Other combinations
+such as SImode with a count of 16 do not correspond to RTL BSWAP
+semantics.  */
+  tem = unwrap_const_vec_duplicate (trueop1);
+  if (GET_MODE_UNIT_BITSIZE (mode) == (2 * BITS_PER_UNIT)
+ && CONST_INT_P (tem) && INTVAL (tem) == BITS_PER_UNIT)
+   return simplify_gen_unary (BSWAP, mode, op0, mode);
+
   /* FALLTHRU */
 case ASHIFTRT:
   if (trueop1 == CONST0_RTX (mode))
diff --git a/gcc/testsuite/gcc.target/aarch64/rot_to_bswap.c 
b/gcc/testsuite/gcc.target/aarch64/rot_to_bswap.c
new file mode 100644
index ..f5b002da8853
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/rot_to_bswap.c
@@ -0,0 +1,23 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 --param aarch64-autovec-preference=asimd-only" } */
+
+#pragma GCC target "+nosve"
+
+
+#define N 1024
+
+unsigned short in_s[N];
+unsigned short out_s[N];
+
+void
+foo16 (void)
+{
+  for (unsigned i = 0; i < N; i++)
+  {
+unsigned short x = in_s[i];
+out_s[i] = (x >> 8) | (x << 8);
+  }
+}
+
+/* { dg-final { scan-assembler {\trev16\tv([123])?[0-9]\.16b, 
v([123])?[0-9]\.16b} } } */
+


[gcc r15-4721] aarch64: Use implementation namespace for vxarq_u64 immediate argument

2024-10-28 Thread Kyrylo Tkachov via Gcc-cvs
https://gcc.gnu.org/g:7c0e4963d5de12b44414c82419d3d9e426f718b6

commit r15-4721-g7c0e4963d5de12b44414c82419d3d9e426f718b6
Author: Kyrylo Tkachov 
Date:   Mon Oct 28 15:19:07 2024 +0100

aarch64: Use implementation namespace for vxarq_u64 immediate argument

Looks like this immediate variable was missed out when I last fixed the
namespace issues in arm_neon.h.  Fixed in the obvious manner.

Bootstrapped and tested on aarch64-none-linux-gnu.

Signed-off-by: Kyrylo Tkachov 

* config/aarch64/arm_neon.h (vxarq_u64): Rename imm6 to __imm6.

Diff:
---
 gcc/config/aarch64/arm_neon.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/gcc/config/aarch64/arm_neon.h b/gcc/config/aarch64/arm_neon.h
index 730d9d3fa815..d3533f3ee6fe 100644
--- a/gcc/config/aarch64/arm_neon.h
+++ b/gcc/config/aarch64/arm_neon.h
@@ -26952,9 +26952,9 @@ vrax1q_u64 (uint64x2_t __a, uint64x2_t __b)
 
 __extension__ extern __inline uint64x2_t
 __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
-vxarq_u64 (uint64x2_t __a, uint64x2_t __b, const int imm6)
+vxarq_u64 (uint64x2_t __a, uint64x2_t __b, const int __imm6)
 {
-  return __builtin_aarch64_xarqv2di_uuus (__a, __b,imm6);
+  return __builtin_aarch64_xarqv2di_uuus (__a, __b, __imm6);
 }
 
 __extension__ extern __inline uint8x16_t


[gcc r15-3705] aarch64: Define l1_cache_line_size for -mcpu=neoverse-v2

2024-09-19 Thread Kyrylo Tkachov via Gcc-cvs
https://gcc.gnu.org/g:9a99559a478111f7fbeec29bd78344df7651c707

commit r15-3705-g9a99559a478111f7fbeec29bd78344df7651c707
Author: Kyrylo Tkachov 
Date:   Wed Sep 11 06:58:35 2024 -0700

aarch64: Define l1_cache_line_size for -mcpu=neoverse-v2

This is a small patch that sets the L1 cache line size for Neoverse V2.
Unlike the other cache-related constants in there this value is not used 
just
for SW prefetch generation (which we want to avoid for Neoverse V2 
presently).
It's also used to set std::hardware_destructive_interference_size.
See the links and recent discussions in PR116662 for reference.
Some CPU tunings in aarch64 set this value to something useful, but for
generic tuning we use the conservative 256, which forces 256-byte alignment
in such atomic structures.  Using a smaller value can decrease the size of 
such
structs during layout and should not present an ABI problem as
std::hardware_destructive_interference_size is not intended to be used for 
structs
in an external interface, and GCC warns about such uses.
Another place where the L1 cache line size is used is in phiopt for
-fhoist-adjacent-loads where conditional accesses to adjacent struct members
can be speculatively loaded as long as they are within the same L1 cache 
line.
e.g.
struct S { int i; int j; };

int
bar (struct S *x, int y)
{
  int r;
  if (y)
r = x->i;
  else
r = x->j;
  return r;
}

The Neoverse V2 L1 cache line is 64 bytes according to the TRM, so set it to
that. The rest of the prefetch parameters inherit from the generic tuning so
we don't do anything extra for software prefeteches.

Bootstrapped and tested on aarch64-none-linux-gnu.

Signed-off-by: Kyrylo Tkachov 

* config/aarch64/tuning_models/neoversev2.h 
(neoversev2_prefetch_tune):
Define.
(neoversev2_tunings): Use it.

Diff:
---
 gcc/config/aarch64/tuning_models/neoversev2.h | 15 ++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/gcc/config/aarch64/tuning_models/neoversev2.h 
b/gcc/config/aarch64/tuning_models/neoversev2.h
index 52aad7d4a433..e7e37e6b3b6e 100644
--- a/gcc/config/aarch64/tuning_models/neoversev2.h
+++ b/gcc/config/aarch64/tuning_models/neoversev2.h
@@ -206,6 +206,19 @@ static const struct cpu_vector_cost neoversev2_vector_cost 
=
   &neoversev2_vec_issue_info /* issue_info  */
 };
 
+/* Prefetch settings.  Disable software prefetch generation but set L1 cache
+   line size.  */
+static const cpu_prefetch_tune neoversev2_prefetch_tune =
+{
+  0,   /* num_slots  */
+  -1,  /* l1_cache_size  */
+  64,  /* l1_cache_line_size  */
+  -1,  /* l2_cache_size  */
+  true,/* prefetch_dynamic_strides */
+  -1,  /* minimum_stride */
+  -1   /* default_opt_level  */
+};
+
 static const struct tune_params neoversev2_tunings =
 {
   &cortexa76_extra_costs,
@@ -244,7 +257,7 @@ static const struct tune_params neoversev2_tunings =
| AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
| AARCH64_EXTRA_TUNE_AVOID_PRED_RMW
| AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA),  /* tune_flags.  */
-  &generic_prefetch_tune,
+  &neoversev2_prefetch_tune,
   AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
   AARCH64_LDP_STP_POLICY_ALWAYS   /* stp_policy_model.  */
 };


[gcc r15-6023] aarch64: Update cpuinfo strings for some arch features

2024-12-09 Thread Kyrylo Tkachov via Gcc-cvs
https://gcc.gnu.org/g:0b79d8b98ec086fccd4714c1ff66ff4382780183

commit r15-6023-g0b79d8b98ec086fccd4714c1ff66ff4382780183
Author: Kyrylo Tkachov 
Date:   Tue Dec 3 04:12:09 2024 -0800

aarch64: Update cpuinfo strings for some arch features

The entries for some recently-added arch features were missing the cpuinfo
string used in -march=native detection.  Presumably the Linux kernel had not
specified such a string at the time the GCC support was added.
But I see that current versions of Linux do have strings for these features
in the arch/arm64/kernel/cpuinfo.c file in the kernel tree.

This patch adds them.  This fixes the strings for the f32mm and f64mm 
features
which I think were using the wrong string.  The kernel exposes them with an
"sve" prefix.

Bootstrapped and tested on aarch64-none-linux-gnu.

Signed-off-by: Kyrylo Tkachov 

gcc/

* config/aarch64/aarch64-option-extensions.def (sve-b16b16,
f32mm, f64mm, sve2p1, sme-f64f64, sme-i16i64, sme-b16b16,
sme-f16f16, mops): Update FEATURE_STRING field.

Diff:
---
 gcc/config/aarch64/aarch64-option-extensions.def | 18 +-
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-option-extensions.def 
b/gcc/config/aarch64/aarch64-option-extensions.def
index 52c3e7b57668..eb0459d2962d 100644
--- a/gcc/config/aarch64/aarch64-option-extensions.def
+++ b/gcc/config/aarch64/aarch64-option-extensions.def
@@ -166,13 +166,13 @@ AARCH64_FMV_FEATURE("rpres", RPRES, ())
 AARCH64_OPT_FMV_EXTENSION("sve", SVE, (SIMD, F16), (), (), "sve")
 
 /* This specifically does not imply +sve.  */
-AARCH64_OPT_EXTENSION("sve-b16b16", SVE_B16B16, (), (), (), "")
+AARCH64_OPT_EXTENSION("sve-b16b16", SVE_B16B16, (), (), (), "sveb16b16")
 
-AARCH64_OPT_EXTENSION("f32mm", F32MM, (SVE), (), (), "f32mm")
+AARCH64_OPT_EXTENSION("f32mm", F32MM, (SVE), (), (), "svef32mm")
 
 AARCH64_FMV_FEATURE("f32mm", SVE_F32MM, (F32MM))
 
-AARCH64_OPT_EXTENSION("f64mm", F64MM, (SVE), (), (), "f64mm")
+AARCH64_OPT_EXTENSION("f64mm", F64MM, (SVE), (), (), "svef64mm")
 
 AARCH64_FMV_FEATURE("f64mm", SVE_F64MM, (F64MM))
 
@@ -195,7 +195,7 @@ AARCH64_OPT_EXTENSION("sve2-sm4", SVE2_SM4, (SVE2, SM4), 
(), (), "svesm4")
 
 AARCH64_FMV_FEATURE("sve2-sm4", SVE_SM4, (SVE2_SM4))
 
-AARCH64_OPT_EXTENSION("sve2p1", SVE2p1, (SVE2), (), (), "")
+AARCH64_OPT_EXTENSION("sve2p1", SVE2p1, (SVE2), (), (), "sve2p1")
 
 AARCH64_OPT_FMV_EXTENSION("sme", SME, (BF16, SVE2), (), (), "sme")
 
@@ -215,11 +215,11 @@ AARCH64_OPT_EXTENSION("pauth", PAUTH, (), (), (), "paca 
pacg")
 
 AARCH64_OPT_EXTENSION("ls64", LS64, (), (), (), "")
 
-AARCH64_OPT_EXTENSION("sme-f64f64", SME_F64F64, (SME), (), (), "")
+AARCH64_OPT_EXTENSION("sme-f64f64", SME_F64F64, (SME), (), (), "smef64f64")
 
 AARCH64_FMV_FEATURE("sme-f64f64", SME_F64, (SME_F64F64))
 
-AARCH64_OPT_EXTENSION("sme-i16i64", SME_I16I64, (SME), (), (), "")
+AARCH64_OPT_EXTENSION("sme-i16i64", SME_I16I64, (SME), (), (), "smei16i64")
 
 AARCH64_FMV_FEATURE("sme-i16i64", SME_I64, (SME_I16I64))
 
@@ -227,11 +227,11 @@ AARCH64_OPT_FMV_EXTENSION("sme2", SME2, (SME), (), (), 
"sme2")
 
 AARCH64_OPT_EXTENSION("sme2p1", SME2p1, (SME2), (), (), "sme2p1")
 
-AARCH64_OPT_EXTENSION("sme-b16b16", SME_B16B16, (SME2, SVE_B16B16), (), (), "")
+AARCH64_OPT_EXTENSION("sme-b16b16", SME_B16B16, (SME2, SVE_B16B16), (), (), 
"smeb16b16")
 
-AARCH64_OPT_EXTENSION("sme-f16f16", SME_F16F16, (SME2), (), (), "")
+AARCH64_OPT_EXTENSION("sme-f16f16", SME_F16F16, (SME2), (), (), "smef16f16")
 
-AARCH64_OPT_EXTENSION("mops", MOPS, (), (), (), "")
+AARCH64_OPT_EXTENSION("mops", MOPS, (), (), (), "mops")
 
 AARCH64_OPT_EXTENSION("cssc", CSSC, (), (), (), "cssc")


[gcc r15-8083] Aarch64: Add FMA and FMAF intrinsic and corresponding tests

2025-03-17 Thread Kyrylo Tkachov via Gcc-cvs
https://gcc.gnu.org/g:f4f7216c56fe2f67c72db5b7c4afa220725f3ed1

commit r15-8083-gf4f7216c56fe2f67c72db5b7c4afa220725f3ed1
Author: Ayan Shafqat 
Date:   Mon Mar 17 09:28:27 2025 +0100

Aarch64: Add FMA and FMAF intrinsic and corresponding tests

This patch introduces inline definitions for the __fma and __fmaf
functions in arm_acle.h for Aarch64 targets. These definitions rely on
__builtin_fma and __builtin_fmaf to ensure proper inlining and to meet
the ACLE requirements [1].

The patch has been tested locally using a crosstool-NG sysroot for
Aarch64, confirming that the generated code uses the expected fused
multiply-accumulate instructions (fmadd).

[1] 
https://arm-software.github.io/acle/main/acle.html#fused-multiply-accumulate-fma

gcc/ChangeLog:

* config/aarch64/arm_acle.h (__fma, __fmaf): New functions.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/acle/acle_fma.c: New test.

Diff:
---
 gcc/config/aarch64/arm_acle.h| 14 ++
 gcc/testsuite/gcc.target/aarch64/acle/acle_fma.c | 17 +
 2 files changed, 31 insertions(+)

diff --git a/gcc/config/aarch64/arm_acle.h b/gcc/config/aarch64/arm_acle.h
index 7976c117daf7..d9e2401ea9f6 100644
--- a/gcc/config/aarch64/arm_acle.h
+++ b/gcc/config/aarch64/arm_acle.h
@@ -129,6 +129,20 @@ __jcvt (double __a)
 
 #pragma GCC pop_options
 
+__extension__ extern __inline double
+__attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
+__fma (double __x, double __y, double __z)
+{
+  return __builtin_fma (__x, __y, __z);
+}
+
+__extension__ extern __inline float
+__attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
+__fmaf (float __x, float __y, float __z)
+{
+  return __builtin_fmaf (__x, __y, __z);
+}
+
 #pragma GCC push_options
 #pragma GCC target ("+nothing+frintts")
 __extension__ extern __inline float
diff --git a/gcc/testsuite/gcc.target/aarch64/acle/acle_fma.c 
b/gcc/testsuite/gcc.target/aarch64/acle/acle_fma.c
new file mode 100644
index ..9363a75b593d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/acle/acle_fma.c
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+
+#include "arm_acle.h"
+
+double test_acle_fma (double x, double y, double z)
+{
+  return __fma (x, y, z);
+}
+
+float test_acle_fmaf (float x, float y, float z)
+{
+  return __fmaf (x, y, z);
+}
+
+/* { dg-final { scan-assembler-times "fmadd\td\[0-9\]" 1 } } */
+/* { dg-final { scan-assembler-times "fmadd\ts\[0-9\]" 1 } } */


[gcc r15-8290] aarch64: Add +sve2p1 to -march=armv9.4-a flags

2025-03-19 Thread Kyrylo Tkachov via Gcc-cvs
https://gcc.gnu.org/g:d46be332818361d7a31065c6d46df7181505ab30

commit r15-8290-gd46be332818361d7a31065c6d46df7181505ab30
Author: Kyrylo Tkachov 
Date:   Mon Mar 17 08:24:18 2025 -0700

aarch64: Add +sve2p1 to -march=armv9.4-a flags

The ArmARM says:
"In an Armv9.4 implementation, if FEAT_SVE2 is implemented, FEAT_SVE2p1
is implemented."

We should enable +sve2p1 as part of -march=armv9.4-a, which this patch does.
This makes gcc consistent with gas.
Bootstrapped and tested on aarch64-none-linux-gnu.

Signed-off-by: Kyrylo Tkachov 

gcc/

* config/aarch64/aarch64-arches.def (...): Add SVE2p1.
* doc/invoke.texi (AArch64 Options): Document +sve2p1 in
-march=armv9.4-a.

Diff:
---
 gcc/config/aarch64/aarch64-arches.def | 2 +-
 gcc/doc/invoke.texi   | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-arches.def 
b/gcc/config/aarch64/aarch64-arches.def
index 34a792d69510..bf56fe9b4449 100644
--- a/gcc/config/aarch64/aarch64-arches.def
+++ b/gcc/config/aarch64/aarch64-arches.def
@@ -45,7 +45,7 @@ AARCH64_ARCH("armv9-a",   generic_armv9_a,   V9A  , 
9,  (V8_5A, SVE2))
 AARCH64_ARCH("armv9.1-a", generic_armv9_a,   V9_1A, 9,  (V8_6A, V9A))
 AARCH64_ARCH("armv9.2-a", generic_armv9_a,   V9_2A, 9,  (V8_7A, V9_1A))
 AARCH64_ARCH("armv9.3-a", generic_armv9_a,   V9_3A, 9,  (V8_8A, V9_2A))
-AARCH64_ARCH("armv9.4-a", generic_armv9_a,   V9_4A, 9,  (V8_9A, V9_3A))
+AARCH64_ARCH("armv9.4-a", generic_armv9_a,   V9_4A, 9,  (V8_9A, V9_3A, 
SVE2p1))
 AARCH64_ARCH("armv9.5-a", generic_armv9_a,   V9_5A, 9,  (V9_4A, CPA, 
FAMINMAX, LUT))
 
 #undef AARCH64_ARCH
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 7bef9bbf1c00..1819bcdcdfb9 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -21708,7 +21708,7 @@ and the features that they enable by default:
 @item @samp{armv9.1-a} @tab Armv9.1-A @tab @samp{armv9-a}, @samp{+bf16}, 
@samp{+i8mm}
 @item @samp{armv9.2-a} @tab Armv9.2-A @tab @samp{armv9.1-a}, @samp{+wfxt}, 
@samp{+xs}
 @item @samp{armv9.3-a} @tab Armv9.3-A @tab @samp{armv9.2-a}, @samp{+mops}
-@item @samp{armv9.4-a} @tab Armv9.4-A @tab @samp{armv9.3-a}
+@item @samp{armv9.4-a} @tab Armv9.4-A @tab @samp{armv9.3-a}, @samp{+sve2p1}
 @item @samp{armv9.5-a} @tab Armv9.4-A @tab @samp{armv9.4-a}, @samp{cpa}, 
@samp{+faminmax}, @samp{+lut}
 @item @samp{armv8-r} @tab Armv8-R @tab @samp{armv8-r}
 @end multitable


[gcc r15-8570] aarch64: Add support for -mcpu=olympus

2025-03-27 Thread Kyrylo Tkachov via Gcc-cvs
https://gcc.gnu.org/g:1aa49fe0b18deb98e324cee18538d26b46829611

commit r15-8570-g1aa49fe0b18deb98e324cee18538d26b46829611
Author: Dhruv Chawla 
Date:   Wed Mar 19 09:34:09 2025 -0700

aarch64: Add support for -mcpu=olympus

This adds support for the NVIDIA Olympus core to the AArch64 backend. The
initial patch does not add any special tuning decisions, and those may come
later.

Bootstrapped and tested on aarch64-none-linux-gnu.

gcc/ChangeLog:

* config/aarch64/aarch64-cores.def (olympus): New entry.
* config/aarch64/aarch64-tune.md: Regenerate.
* doc/invoke.texi (AArch64 Options): Document the above.

Signed-off-by: Dhruv Chawla 

Diff:
---
 gcc/config/aarch64/aarch64-cores.def | 3 +++
 gcc/config/aarch64/aarch64-tune.md   | 2 +-
 gcc/doc/invoke.texi  | 2 +-
 3 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-cores.def 
b/gcc/config/aarch64/aarch64-cores.def
index 5ac81332b67c..0e22d72976ef 100644
--- a/gcc/config/aarch64/aarch64-cores.def
+++ b/gcc/config/aarch64/aarch64-cores.def
@@ -207,6 +207,9 @@ AARCH64_CORE("neoverse-v3ae", neoversev3ae, cortexa57, 
V9_2A, (SVE2_BITPERM, RNG
 
 AARCH64_CORE("demeter", demeter, cortexa57, V9A, (I8MM, BF16, SVE2_BITPERM, 
RNG, MEMTAG, PROFILE), neoversev2, 0x41, 0xd4f, -1)
 
+/* NVIDIA ('N') cores. */
+AARCH64_CORE("olympus", olympus, cortexa57, V9_2A, (SVE2_BITPERM, RNG, LS64, 
MEMTAG, PROFILE, FAMINMAX, FP8DOT2, LUT, SVE2_AES, SVE2_SHA3, SVE2_SM4), 
neoversev3, 0x4e, 0x10, -1)
+
 /* Generic Architecture Processors.  */
 AARCH64_CORE("generic",  generic, cortexa53, V8A,  (), generic, 0x0, 0x0, -1)
 AARCH64_CORE("generic-armv8-a",  generic_armv8_a, cortexa53, V8A, (), 
generic_armv8_a, 0x0, 0x0, -1)
diff --git a/gcc/config/aarch64/aarch64-tune.md 
b/gcc/config/aarch64/aarch64-tune.md
index 54c65cbf68df..56a914f12b9c 100644
--- a/gcc/config/aarch64/aarch64-tune.md
+++ b/gcc/config/aarch64/aarch64-tune.md
@@ -1,5 +1,5 @@
 ;; -*- buffer-read-only: t -*-
 ;; Generated automatically by gentune.sh from aarch64-cores.def
 (define_attr "tune"
-   
"cortexa34,cortexa35,cortexa53,cortexa57,cortexa72,cortexa73,thunderx,thunderxt88,thunderxt88p1,octeontx,octeontxt81,octeontxt83,thunderxt81,thunderxt83,ampere1,ampere1a,ampere1b,emag,xgene1,falkor,qdf24xx,exynosm1,phecda,thunderx2t99p1,vulcan,thunderx2t99,cortexa55,cortexa75,cortexa76,cortexa76ae,cortexa77,cortexa78,cortexa78ae,cortexa78c,cortexa65,cortexa65ae,cortexx1,cortexx1c,neoversen1,ares,neoversee1,octeontx2,octeontx2t98,octeontx2t96,octeontx2t93,octeontx2f95,octeontx2f95n,octeontx2f95mm,a64fx,fujitsu_monaka,tsv110,thunderx3t110,neoversev1,zeus,neoverse512tvb,saphira,oryon1,cortexa57cortexa53,cortexa72cortexa53,cortexa73cortexa35,cortexa73cortexa53,cortexa75cortexa55,cortexa76cortexa55,cortexr82,cortexr82ae,cortexa510,cortexa520,cortexa520ae,cortexa710,cortexa715,cortexa720,cortexa720ae,cortexa725,cortexx2,cortexx3,cortexx4,cortexx925,neoversen2,cobalt100,neoversen3,neoversev2,grace,neoversev3,neoversev3ae,demeter,generic,generic_armv8_a,generic_armv9_a"
+   
"cortexa34,cortexa35,cortexa53,cortexa57,cortexa72,cortexa73,thunderx,thunderxt88,thunderxt88p1,octeontx,octeontxt81,octeontxt83,thunderxt81,thunderxt83,ampere1,ampere1a,ampere1b,emag,xgene1,falkor,qdf24xx,exynosm1,phecda,thunderx2t99p1,vulcan,thunderx2t99,cortexa55,cortexa75,cortexa76,cortexa76ae,cortexa77,cortexa78,cortexa78ae,cortexa78c,cortexa65,cortexa65ae,cortexx1,cortexx1c,neoversen1,ares,neoversee1,octeontx2,octeontx2t98,octeontx2t96,octeontx2t93,octeontx2f95,octeontx2f95n,octeontx2f95mm,a64fx,fujitsu_monaka,tsv110,thunderx3t110,neoversev1,zeus,neoverse512tvb,saphira,oryon1,cortexa57cortexa53,cortexa72cortexa53,cortexa73cortexa35,cortexa73cortexa53,cortexa75cortexa55,cortexa76cortexa55,cortexr82,cortexr82ae,cortexa510,cortexa520,cortexa520ae,cortexa710,cortexa715,cortexa720,cortexa720ae,cortexa725,cortexx2,cortexx3,cortexx4,cortexx925,neoversen2,cobalt100,neoversen3,neoversev2,grace,neoversev3,neoversev3ae,demeter,olympus,generic,generic_armv8_a,generic_armv9_a"
(const (symbol_ref "((enum attr_tune) aarch64_tune)")))
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index a2d327d7c997..515d91ac2e3a 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -21745,7 +21745,7 @@ performance of the code.  Permissible values for this 
option are:
 @samp{oryon-1},
 @samp{neoverse-512tvb}, @samp{neoverse-e1}, @samp{neoverse-n1},
 @samp{neoverse-n2}, @samp{neoverse-v1}, @samp{neoverse-v2}, @samp{grace},
-@samp{neoverse-v3}, @samp{neoverse-v3ae}, @samp{neoverse-n3},
+@samp{neoverse-v3}, @samp{neoverse-v3ae}, @samp{neoverse-n3}, @samp{olympus},
 @samp{cortex-a725}, @samp{cortex-x925},
 @samp{qdf24xx}, @samp{saphira}, @samp{phecda}, @samp{xgene1}, @samp{vulcan},
 @samp{octeontx}, @samp{octeontx81},  @samp{octeontx83},


[gcc r16-70] Document locality partitioning params in invoke.texi

2025-04-22 Thread Kyrylo Tkachov via Gcc-cvs
https://gcc.gnu.org/g:b7fb18dcf79476aa30ed2ad6cc2eaeab1f266107

commit r16-70-gb7fb18dcf79476aa30ed2ad6cc2eaeab1f266107
Author: Kyrylo Tkachov 
Date:   Thu Apr 17 10:50:44 2025 -0700

Document locality partitioning params in invoke.texi

Filip Kastl pointed out that contrib/check-params-in-docs.py complains
about params not documented in invoke.texi, so this patch adds the short
explanation from params.opt for these to the invoke.texi section.
Thanks for the reminder.

Signed-off-by: Kyrylo Tkachov 

gcc/

* doc/invoke.texi (lto-partition-locality-frequency-cutoff,
lto-partition-locality-size-cutoff, lto-max-locality-partition):
Document.

Diff:
---
 gcc/doc/invoke.texi | 13 +
 1 file changed, 13 insertions(+)

diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 020442aa032e..1a43b3b839d7 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -16963,6 +16963,19 @@ Size of max partition for WHOPR (in estimated 
instructions).
 to provide an upper bound for individual size of partition.
 Meant to be used only with balanced partitioning.
 
+@item lto-partition-locality-frequency-cutoff
+The denominator n of fraction 1/n of the execution frequency of callee to be
+cloned for a particular caller. Special value of 0 dictates to always clone
+without a cut-off.
+
+@item lto-partition-locality-size-cutoff
+Size cut-off for callee including inlined calls to be cloned for a particular
+caller.
+
+@item lto-max-locality-partition
+Maximal size of a locality partition for LTO (in estimated instructions).
+Value of 0 results in default value being used.
+
 @item lto-max-streaming-parallelism
 Maximal number of parallel processes used for LTO streaming.


[gcc r15-9566] Document locality partitioning params in invoke.texi

2025-04-22 Thread Kyrylo Tkachov via Gcc-cvs
https://gcc.gnu.org/g:7faf49825ca47b07bca7b966db66f9f50121076f

commit r15-9566-g7faf49825ca47b07bca7b966db66f9f50121076f
Author: Kyrylo Tkachov 
Date:   Thu Apr 17 10:50:44 2025 -0700

Document locality partitioning params in invoke.texi

Filip Kastl pointed out that contrib/check-params-in-docs.py complains
about params not documented in invoke.texi, so this patch adds the short
explanation from params.opt for these to the invoke.texi section.
Thanks for the reminder.

Signed-off-by: Kyrylo Tkachov 

gcc/

* doc/invoke.texi (lto-partition-locality-frequency-cutoff,
lto-partition-locality-size-cutoff, lto-max-locality-partition):
Document.

(cherry picked from commit b7fb18dcf79476aa30ed2ad6cc2eaeab1f266107)

Diff:
---
 gcc/doc/invoke.texi | 13 +
 1 file changed, 13 insertions(+)

diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 14a78fd236f6..c2e1bf8031b8 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -16961,6 +16961,19 @@ Size of max partition for WHOPR (in estimated 
instructions).
 to provide an upper bound for individual size of partition.
 Meant to be used only with balanced partitioning.
 
+@item lto-partition-locality-frequency-cutoff
+The denominator n of fraction 1/n of the execution frequency of callee to be
+cloned for a particular caller. Special value of 0 dictates to always clone
+without a cut-off.
+
+@item lto-partition-locality-size-cutoff
+Size cut-off for callee including inlined calls to be cloned for a particular
+caller.
+
+@item lto-max-locality-partition
+Maximal size of a locality partition for LTO (in estimated instructions).
+Value of 0 results in default value being used.
+
 @item lto-max-streaming-parallelism
 Maximal number of parallel processes used for LTO streaming.


[gcc r15-9571] aarch64: Update FP8 dependencies for -mcpu=olympus

2025-04-22 Thread Kyrylo Tkachov via Gcc-cvs
https://gcc.gnu.org/g:f1b9d0380a4b5896b95f088799661d903ede80b5

commit r15-9571-gf1b9d0380a4b5896b95f088799661d903ede80b5
Author: Kyrylo Tkachov 
Date:   Tue Apr 22 06:17:34 2025 -0700

aarch64: Update FP8 dependencies for -mcpu=olympus

We had not noticed that after g:299a8e2dc667e795991bc439d2cad5ea5bd379e2 the
FP8FMA and FP8DOT4 features aren't implied by FP8FMA.  The intent is for
-mcpu=olympus to support all of them.
Fix the definition to include the relevant sub-features explicitly.

Signed-off-by: Kyrylo Tkachov 

gcc/

* config/aarch64/aarch64-cores.def (olympus): Add fp8fma, fp8dot4
explicitly.

(cherry picked from commit 5d5e8e87a42af8c0d962fa16dc9835fb71778250)

Diff:
---
 gcc/config/aarch64/aarch64-cores.def | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/config/aarch64/aarch64-cores.def 
b/gcc/config/aarch64/aarch64-cores.def
index 7f204fd0ac92..12096300d012 100644
--- a/gcc/config/aarch64/aarch64-cores.def
+++ b/gcc/config/aarch64/aarch64-cores.def
@@ -224,7 +224,7 @@ AARCH64_CORE("neoverse-v3ae", neoversev3ae, cortexa57, 
V9_2A, (SVE2_BITPERM, RNG
 AARCH64_CORE("demeter", demeter, cortexa57, V9A, (I8MM, BF16, SVE2_BITPERM, 
RNG, MEMTAG, PROFILE), neoversev2, 0x41, 0xd4f, -1)
 
 /* NVIDIA ('N') cores. */
-AARCH64_CORE("olympus", olympus, cortexa57, V9_2A, (SVE2_BITPERM, RNG, LS64, 
MEMTAG, PROFILE, FAMINMAX, FP8DOT2, LUT, SVE2_AES, SVE2_SHA3, SVE2_SM4), 
neoversev3, 0x4e, 0x10, -1)
+AARCH64_CORE("olympus", olympus, cortexa57, V9_2A, (SVE2_BITPERM, RNG, LS64, 
MEMTAG, PROFILE, FAMINMAX, FP8FMA, FP8DOT2, FP8DOT4, LUT, SVE2_AES, SVE2_SHA3, 
SVE2_SM4), neoversev3, 0x4e, 0x10, -1)
 
 /* Generic Architecture Processors.  */
 AARCH64_CORE("generic",  generic, cortexa53, V8A,  (), generic, 0x0, 0x0, -1)


[gcc r16-82] aarch64: Update FP8 dependencies for -mcpu=olympus

2025-04-22 Thread Kyrylo Tkachov via Gcc-cvs
https://gcc.gnu.org/g:5d5e8e87a42af8c0d962fa16dc9835fb71778250

commit r16-82-g5d5e8e87a42af8c0d962fa16dc9835fb71778250
Author: Kyrylo Tkachov 
Date:   Tue Apr 22 06:17:34 2025 -0700

aarch64: Update FP8 dependencies for -mcpu=olympus

We had not noticed that after g:299a8e2dc667e795991bc439d2cad5ea5bd379e2 the
FP8FMA and FP8DOT4 features aren't implied by FP8FMA.  The intent is for
-mcpu=olympus to support all of them.
Fix the definition to include the relevant sub-features explicitly.

Signed-off-by: Kyrylo Tkachov 

gcc/

* config/aarch64/aarch64-cores.def (olympus): Add fp8fma, fp8dot4
explicitly.

Diff:
---
 gcc/config/aarch64/aarch64-cores.def | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/config/aarch64/aarch64-cores.def 
b/gcc/config/aarch64/aarch64-cores.def
index 7f204fd0ac92..12096300d012 100644
--- a/gcc/config/aarch64/aarch64-cores.def
+++ b/gcc/config/aarch64/aarch64-cores.def
@@ -224,7 +224,7 @@ AARCH64_CORE("neoverse-v3ae", neoversev3ae, cortexa57, 
V9_2A, (SVE2_BITPERM, RNG
 AARCH64_CORE("demeter", demeter, cortexa57, V9A, (I8MM, BF16, SVE2_BITPERM, 
RNG, MEMTAG, PROFILE), neoversev2, 0x41, 0xd4f, -1)
 
 /* NVIDIA ('N') cores. */
-AARCH64_CORE("olympus", olympus, cortexa57, V9_2A, (SVE2_BITPERM, RNG, LS64, 
MEMTAG, PROFILE, FAMINMAX, FP8DOT2, LUT, SVE2_AES, SVE2_SHA3, SVE2_SM4), 
neoversev3, 0x4e, 0x10, -1)
+AARCH64_CORE("olympus", olympus, cortexa57, V9_2A, (SVE2_BITPERM, RNG, LS64, 
MEMTAG, PROFILE, FAMINMAX, FP8FMA, FP8DOT2, FP8DOT4, LUT, SVE2_AES, SVE2_SHA3, 
SVE2_SM4), neoversev3, 0x4e, 0x10, -1)
 
 /* Generic Architecture Processors.  */
 AARCH64_CORE("generic",  generic, cortexa53, V8A,  (), generic, 0x0, 0x0, -1)


[gcc r15-9580] opts.cc: Use opts rather than opts_set for validating -fipa-reorder-for-locality

2025-04-24 Thread Kyrylo Tkachov via Gcc-cvs
https://gcc.gnu.org/g:f873125cbd513c6c8ec9f223e52cd5ad68fa7bbd

commit r15-9580-gf873125cbd513c6c8ec9f223e52cd5ad68fa7bbd
Author: Kyrylo Tkachov 
Date:   Thu Apr 24 05:33:54 2025 -0700

opts.cc: Use opts rather than opts_set for validating 
-fipa-reorder-for-locality

This ensures -fno-ipa-reorder-for-locality doesn't complain with an explicit
-flto-partition=.

Signed-off-by: Kyrylo Tkachov 

* opts.cc (validate_ipa_reorder_locality_lto_partition): Check opts
instead of opts_set for x_flag_ipa_reorder_for_locality.
(finish_options): Update call site.

(cherry picked from commit fbf8443961f484ed7fb7e953206af1ee60558a24)

Diff:
---
 gcc/opts.cc | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/gcc/opts.cc b/gcc/opts.cc
index 5480b9dff2ce..ffcbdfef0bd9 100644
--- a/gcc/opts.cc
+++ b/gcc/opts.cc
@@ -1037,18 +1037,19 @@ report_conflicting_sanitizer_options (struct 
gcc_options *opts, location_t loc,
 }
 }
 
-/* Validate from OPTS_SET that when -fipa-reorder-for-locality is
+/* Validate from OPTS and OPTS_SET that when -fipa-reorder-for-locality is
enabled no explicit -flto-partition is also passed as the locality cloning
pass uses its own partitioning scheme.  */
 
 static void
-validate_ipa_reorder_locality_lto_partition (struct gcc_options *opts_set)
+validate_ipa_reorder_locality_lto_partition (struct gcc_options *opts,
+struct gcc_options *opts_set)
 {
   static bool validated_p = false;
 
   if (opts_set->x_flag_lto_partition)
 {
-  if (opts_set->x_flag_ipa_reorder_for_locality && !validated_p)
+  if (opts->x_flag_ipa_reorder_for_locality && !validated_p)
error ("%<-fipa-reorder-for-locality%> is incompatible with"
   " an explicit %qs option", "-flto-partition");
 }
@@ -1267,7 +1268,7 @@ finish_options (struct gcc_options *opts, struct 
gcc_options *opts_set,
   if (opts->x_flag_reorder_blocks_and_partition)
 SET_OPTION_IF_UNSET (opts, opts_set, flag_reorder_functions, 1);
 
-  validate_ipa_reorder_locality_lto_partition (opts_set);
+  validate_ipa_reorder_locality_lto_partition (opts, opts_set);
 
   /* The -gsplit-dwarf option requires -ggnu-pubnames.  */
   if (opts->x_dwarf_split_debug_info)


[gcc r15-9579] opts.cc Simplify handling of explicit -flto-partition= and -fipa-reorder-for-locality

2025-04-24 Thread Kyrylo Tkachov via Gcc-cvs
https://gcc.gnu.org/g:340a9f871f02a23a4480a7b5f4eacadf689089e5

commit r15-9579-g340a9f871f02a23a4480a7b5f4eacadf689089e5
Author: Kyrylo Tkachov 
Date:   Thu Apr 24 00:34:09 2025 -0700

opts.cc Simplify handling of explicit -flto-partition= and 
-fipa-reorder-for-locality

The handling of an explicit -flto-partition= and -fipa-reorder-for-locality
should be simpler.  No need to have a new default option.  We can use 
opts_set
to check if -flto-partition is explicitly set and use that information in 
the
error handling.
Remove -flto-partition=default and update accordingly.

Bootstrapped and tested on aarch64-none-linux-gnu.

Signed-off-by: Kyrylo Tkachov 

gcc/

* common.opt (LTO_PARTITION_DEFAULT): Delete.
(flto-partition=): Change default back to balanced.
* flag-types.h (lto_partition_model): Remove LTO_PARTITION_DEFAULT.
* opts.cc (validate_ipa_reorder_locality_lto_partition):
Check opts_set->x_flag_lto_partition instead of 
LTO_PARTITION_DEFAULT.
(finish_options): Remove handling of LTO_PARTITION_DEFAULT.

gcc/testsuite/

* gcc.dg/completion-2.c: Remove check for default.

(cherry picked from commit 040f94d1f63c3607a2f3faf5c329c3b2b6bf7d1e)

Diff:
---
 gcc/common.opt  |  5 +
 gcc/flag-types.h|  3 +--
 gcc/opts.cc | 11 ---
 gcc/testsuite/gcc.dg/completion-2.c |  1 -
 4 files changed, 6 insertions(+), 14 deletions(-)

diff --git a/gcc/common.opt b/gcc/common.opt
index 88d987e6ab14..e3fa0dacec4c 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -2278,9 +2278,6 @@ Number of cache entries in incremental LTO after which to 
prune old entries.
 Enum
 Name(lto_partition_model) Type(enum lto_partition_model) UnknownError(unknown 
LTO partitioning model %qs)
 
-EnumValue
-Enum(lto_partition_model) String(default) Value(LTO_PARTITION_DEFAULT)
-
 EnumValue
 Enum(lto_partition_model) String(none) Value(LTO_PARTITION_NONE)
 
@@ -2300,7 +2297,7 @@ EnumValue
 Enum(lto_partition_model) String(cache) Value(LTO_PARTITION_CACHE)
 
 flto-partition=
-Common Joined RejectNegative Enum(lto_partition_model) Var(flag_lto_partition) 
Init(LTO_PARTITION_DEFAULT)
+Common Joined RejectNegative Enum(lto_partition_model) Var(flag_lto_partition) 
Init(LTO_PARTITION_BALANCED)
 Specify the algorithm to partition symbols and vars at linktime.
 
 ; The initial value of -1 comes from Z_DEFAULT_COMPRESSION in zlib.h.
diff --git a/gcc/flag-types.h b/gcc/flag-types.h
index db573768c23d..9a3cc4a2e165 100644
--- a/gcc/flag-types.h
+++ b/gcc/flag-types.h
@@ -404,8 +404,7 @@ enum lto_partition_model {
   LTO_PARTITION_BALANCED = 2,
   LTO_PARTITION_1TO1 = 3,
   LTO_PARTITION_MAX = 4,
-  LTO_PARTITION_CACHE = 5,
-  LTO_PARTITION_DEFAULT= 6
+  LTO_PARTITION_CACHE = 5
 };
 
 /* flag_lto_locality_cloning initialization values.  */
diff --git a/gcc/opts.cc b/gcc/opts.cc
index 5e7b77dab2fd..5480b9dff2ce 100644
--- a/gcc/opts.cc
+++ b/gcc/opts.cc
@@ -1037,17 +1037,16 @@ report_conflicting_sanitizer_options (struct 
gcc_options *opts, location_t loc,
 }
 }
 
-/* Validate from OPTS and OPTS_SET that when -fipa-reorder-for-locality is
+/* Validate from OPTS_SET that when -fipa-reorder-for-locality is
enabled no explicit -flto-partition is also passed as the locality cloning
pass uses its own partitioning scheme.  */
 
 static void
-validate_ipa_reorder_locality_lto_partition (struct gcc_options *opts,
-struct gcc_options *opts_set)
+validate_ipa_reorder_locality_lto_partition (struct gcc_options *opts_set)
 {
   static bool validated_p = false;
 
-  if (opts->x_flag_lto_partition != LTO_PARTITION_DEFAULT)
+  if (opts_set->x_flag_lto_partition)
 {
   if (opts_set->x_flag_ipa_reorder_for_locality && !validated_p)
error ("%<-fipa-reorder-for-locality%> is incompatible with"
@@ -1268,9 +1267,7 @@ finish_options (struct gcc_options *opts, struct 
gcc_options *opts_set,
   if (opts->x_flag_reorder_blocks_and_partition)
 SET_OPTION_IF_UNSET (opts, opts_set, flag_reorder_functions, 1);
 
-  validate_ipa_reorder_locality_lto_partition (opts, opts_set);
-  if (opts_set->x_flag_lto_partition != LTO_PARTITION_DEFAULT)
-opts_set->x_flag_lto_partition = opts->x_flag_lto_partition = 
LTO_PARTITION_BALANCED;
+  validate_ipa_reorder_locality_lto_partition (opts_set);
 
   /* The -gsplit-dwarf option requires -ggnu-pubnames.  */
   if (opts->x_dwarf_split_debug_info)
diff --git a/gcc/testsuite/gcc.dg/completion-2.c 
b/gcc/testsuite/gcc.dg/completion-2.c
index 46c511c8c2f4..99e653122016 100644
--- a/gcc/testsuite/gcc.dg/completion-2.c
+++ b/gcc/testsuite/gcc.dg/completion-2.c
@@ -5,7 +5,6 @@
 -flto-partition=1to1
 -flto-partition=balanced
 -flto-partition=cache
--flto-partition=default
 -flto-partition=max
 -flto-partition=none
 -flto-partition=one


[gcc r16-108] opts.cc Simplify handling of explicit -flto-partition= and -fipa-reorder-for-locality

2025-04-24 Thread Kyrylo Tkachov via Gcc-cvs
https://gcc.gnu.org/g:040f94d1f63c3607a2f3faf5c329c3b2b6bf7d1e

commit r16-108-g040f94d1f63c3607a2f3faf5c329c3b2b6bf7d1e
Author: Kyrylo Tkachov 
Date:   Thu Apr 24 00:34:09 2025 -0700

opts.cc Simplify handling of explicit -flto-partition= and 
-fipa-reorder-for-locality

The handling of an explicit -flto-partition= and -fipa-reorder-for-locality
should be simpler.  No need to have a new default option.  We can use 
opts_set
to check if -flto-partition is explicitly set and use that information in 
the
error handling.
Remove -flto-partition=default and update accordingly.

Bootstrapped and tested on aarch64-none-linux-gnu.

Signed-off-by: Kyrylo Tkachov 

gcc/

* common.opt (LTO_PARTITION_DEFAULT): Delete.
(flto-partition=): Change default back to balanced.
* flag-types.h (lto_partition_model): Remove LTO_PARTITION_DEFAULT.
* opts.cc (validate_ipa_reorder_locality_lto_partition):
Check opts_set->x_flag_lto_partition instead of 
LTO_PARTITION_DEFAULT.
(finish_options): Remove handling of LTO_PARTITION_DEFAULT.

gcc/testsuite/

* gcc.dg/completion-2.c: Remove check for default.

Diff:
---
 gcc/common.opt  |  5 +
 gcc/flag-types.h|  3 +--
 gcc/opts.cc | 11 ---
 gcc/testsuite/gcc.dg/completion-2.c |  1 -
 4 files changed, 6 insertions(+), 14 deletions(-)

diff --git a/gcc/common.opt b/gcc/common.opt
index 88d987e6ab14..e3fa0dacec4c 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -2278,9 +2278,6 @@ Number of cache entries in incremental LTO after which to 
prune old entries.
 Enum
 Name(lto_partition_model) Type(enum lto_partition_model) UnknownError(unknown 
LTO partitioning model %qs)
 
-EnumValue
-Enum(lto_partition_model) String(default) Value(LTO_PARTITION_DEFAULT)
-
 EnumValue
 Enum(lto_partition_model) String(none) Value(LTO_PARTITION_NONE)
 
@@ -2300,7 +2297,7 @@ EnumValue
 Enum(lto_partition_model) String(cache) Value(LTO_PARTITION_CACHE)
 
 flto-partition=
-Common Joined RejectNegative Enum(lto_partition_model) Var(flag_lto_partition) 
Init(LTO_PARTITION_DEFAULT)
+Common Joined RejectNegative Enum(lto_partition_model) Var(flag_lto_partition) 
Init(LTO_PARTITION_BALANCED)
 Specify the algorithm to partition symbols and vars at linktime.
 
 ; The initial value of -1 comes from Z_DEFAULT_COMPRESSION in zlib.h.
diff --git a/gcc/flag-types.h b/gcc/flag-types.h
index db573768c23d..9a3cc4a2e165 100644
--- a/gcc/flag-types.h
+++ b/gcc/flag-types.h
@@ -404,8 +404,7 @@ enum lto_partition_model {
   LTO_PARTITION_BALANCED = 2,
   LTO_PARTITION_1TO1 = 3,
   LTO_PARTITION_MAX = 4,
-  LTO_PARTITION_CACHE = 5,
-  LTO_PARTITION_DEFAULT= 6
+  LTO_PARTITION_CACHE = 5
 };
 
 /* flag_lto_locality_cloning initialization values.  */
diff --git a/gcc/opts.cc b/gcc/opts.cc
index 5e7b77dab2fd..5480b9dff2ce 100644
--- a/gcc/opts.cc
+++ b/gcc/opts.cc
@@ -1037,17 +1037,16 @@ report_conflicting_sanitizer_options (struct 
gcc_options *opts, location_t loc,
 }
 }
 
-/* Validate from OPTS and OPTS_SET that when -fipa-reorder-for-locality is
+/* Validate from OPTS_SET that when -fipa-reorder-for-locality is
enabled no explicit -flto-partition is also passed as the locality cloning
pass uses its own partitioning scheme.  */
 
 static void
-validate_ipa_reorder_locality_lto_partition (struct gcc_options *opts,
-struct gcc_options *opts_set)
+validate_ipa_reorder_locality_lto_partition (struct gcc_options *opts_set)
 {
   static bool validated_p = false;
 
-  if (opts->x_flag_lto_partition != LTO_PARTITION_DEFAULT)
+  if (opts_set->x_flag_lto_partition)
 {
   if (opts_set->x_flag_ipa_reorder_for_locality && !validated_p)
error ("%<-fipa-reorder-for-locality%> is incompatible with"
@@ -1268,9 +1267,7 @@ finish_options (struct gcc_options *opts, struct 
gcc_options *opts_set,
   if (opts->x_flag_reorder_blocks_and_partition)
 SET_OPTION_IF_UNSET (opts, opts_set, flag_reorder_functions, 1);
 
-  validate_ipa_reorder_locality_lto_partition (opts, opts_set);
-  if (opts_set->x_flag_lto_partition != LTO_PARTITION_DEFAULT)
-opts_set->x_flag_lto_partition = opts->x_flag_lto_partition = 
LTO_PARTITION_BALANCED;
+  validate_ipa_reorder_locality_lto_partition (opts_set);
 
   /* The -gsplit-dwarf option requires -ggnu-pubnames.  */
   if (opts->x_dwarf_split_debug_info)
diff --git a/gcc/testsuite/gcc.dg/completion-2.c 
b/gcc/testsuite/gcc.dg/completion-2.c
index 46c511c8c2f4..99e653122016 100644
--- a/gcc/testsuite/gcc.dg/completion-2.c
+++ b/gcc/testsuite/gcc.dg/completion-2.c
@@ -5,7 +5,6 @@
 -flto-partition=1to1
 -flto-partition=balanced
 -flto-partition=cache
--flto-partition=default
 -flto-partition=max
 -flto-partition=none
 -flto-partition=one


[gcc r16-110] opts.cc: Use opts rather than opts_set for validating -fipa-reorder-for-locality

2025-04-24 Thread Kyrylo Tkachov via Gcc-cvs
https://gcc.gnu.org/g:fbf8443961f484ed7fb7e953206af1ee60558a24

commit r16-110-gfbf8443961f484ed7fb7e953206af1ee60558a24
Author: Kyrylo Tkachov 
Date:   Thu Apr 24 05:33:54 2025 -0700

opts.cc: Use opts rather than opts_set for validating 
-fipa-reorder-for-locality

This ensures -fno-ipa-reorder-for-locality doesn't complain with an explicit
-flto-partition=.

Signed-off-by: Kyrylo Tkachov 

* opts.cc (validate_ipa_reorder_locality_lto_partition): Check opts
instead of opts_set for x_flag_ipa_reorder_for_locality.
(finish_options): Update call site.

Diff:
---
 gcc/opts.cc | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/gcc/opts.cc b/gcc/opts.cc
index 5480b9dff2ce..ffcbdfef0bd9 100644
--- a/gcc/opts.cc
+++ b/gcc/opts.cc
@@ -1037,18 +1037,19 @@ report_conflicting_sanitizer_options (struct 
gcc_options *opts, location_t loc,
 }
 }
 
-/* Validate from OPTS_SET that when -fipa-reorder-for-locality is
+/* Validate from OPTS and OPTS_SET that when -fipa-reorder-for-locality is
enabled no explicit -flto-partition is also passed as the locality cloning
pass uses its own partitioning scheme.  */
 
 static void
-validate_ipa_reorder_locality_lto_partition (struct gcc_options *opts_set)
+validate_ipa_reorder_locality_lto_partition (struct gcc_options *opts,
+struct gcc_options *opts_set)
 {
   static bool validated_p = false;
 
   if (opts_set->x_flag_lto_partition)
 {
-  if (opts_set->x_flag_ipa_reorder_for_locality && !validated_p)
+  if (opts->x_flag_ipa_reorder_for_locality && !validated_p)
error ("%<-fipa-reorder-for-locality%> is incompatible with"
   " an explicit %qs option", "-flto-partition");
 }
@@ -1267,7 +1268,7 @@ finish_options (struct gcc_options *opts, struct 
gcc_options *opts_set,
   if (opts->x_flag_reorder_blocks_and_partition)
 SET_OPTION_IF_UNSET (opts, opts_set, flag_reorder_functions, 1);
 
-  validate_ipa_reorder_locality_lto_partition (opts_set);
+  validate_ipa_reorder_locality_lto_partition (opts, opts_set);
 
   /* The -gsplit-dwarf option requires -ggnu-pubnames.  */
   if (opts->x_dwarf_split_debug_info)


[gcc r16-327] Aarch64: Use BUILTIN_VHSDF_HSDF for vector and scalar sqrt builtins

2025-05-01 Thread Kyrylo Tkachov via Gcc-cvs
https://gcc.gnu.org/g:5c917a585d765b0878afd9435e3b3eece9f820f9

commit r16-327-g5c917a585d765b0878afd9435e3b3eece9f820f9
Author: Ayan Shafqat 
Date:   Thu May 1 06:14:44 2025 -0700

Aarch64: Use BUILTIN_VHSDF_HSDF for vector and scalar sqrt builtins

This patch changes the `sqrt` builtin definition from `BUILTIN_VHSDF_DF`
to `BUILTIN_VHSDF_HSDF` in `aarch64-simd-builtins.def`, ensuring the
builtin covers half, single, and double precision variants. The redundant
`VAR1 (UNOP, sqrt, 2, FP, hf)` lines are removed, as they are no longer
needed now that `BUILTIN_VHSDF_HSDF` handles those cases.

gcc/ChangeLog:

* config/aarch64/aarch64-simd-builtins.def: Change
BUILTIN_VHSDF_DF to BUILTIN_VHSDF_HSDF.

Signed-off-by: Ayan Shafqat 
Signed-off-by: Andrew Pinski 

Diff:
---
 gcc/config/aarch64/aarch64-simd-builtins.def | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-simd-builtins.def 
b/gcc/config/aarch64/aarch64-simd-builtins.def
index 6cc45b18a723..685bf0dc4086 100644
--- a/gcc/config/aarch64/aarch64-simd-builtins.def
+++ b/gcc/config/aarch64/aarch64-simd-builtins.def
@@ -57,7 +57,7 @@
   VAR1 (BINOPP, pmull, 0, DEFAULT, v8qi)
   VAR1 (BINOPP, pmull_hi, 0, DEFAULT, v16qi)
   BUILTIN_VHSDF_HSDF (BINOP, fmulx, 0, FP)
-  BUILTIN_VHSDF_DF (UNOP, sqrt, 2, FP)
+  BUILTIN_VHSDF_HSDF (UNOP, sqrt, 2, FP)
   BUILTIN_VDQ_I (BINOP, addp, 0, DEFAULT)
   BUILTIN_VDQ_I (BINOPU, addp, 0, DEFAULT)
   BUILTIN_VDQ_BHSI (UNOP, clrsb, 2, DEFAULT)
@@ -848,9 +848,6 @@
   BUILTIN_VHSDF_HSDF (BINOP_USS, facgt, 0, FP)
   BUILTIN_VHSDF_HSDF (BINOP_USS, facge, 0, FP)
 
-  /* Implemented by sqrt2.  */
-  VAR1 (UNOP, sqrt, 2, FP, hf)
-
   /* Implemented by hf2.  */
   VAR1 (UNOP, floatdi, 2, FP, hf)
   VAR1 (UNOP, floatsi, 2, FP, hf)


[gcc r16-328] Aarch64: Add __sqrt and __sqrtf intrinsics and corresponding tests

2025-05-01 Thread Kyrylo Tkachov via Gcc-cvs
https://gcc.gnu.org/g:05df554536a8d33f4c438cfc7b006b3b2083246a

commit r16-328-g05df554536a8d33f4c438cfc7b006b3b2083246a
Author: Ayan Shafqat 
Date:   Thu May 1 06:17:30 2025 -0700

Aarch64: Add __sqrt and __sqrtf intrinsics and corresponding tests

This patch introduces two new inline functions, __sqrt and __sqrtf, in
arm_acle.h for Aarch64 targets. These functions wrap the new builtins
__builtin_aarch64_sqrtdf and __builtin_aarch64_sqrtsf, respectively,
providing direct access to hardware instructions without relying on the
standard math library or optimization levels.

This patch also introduces acle_sqrt.c in the AArch64 testsuite,
verifying that the new __sqrt and __sqrtf intrinsics emit the expected
fsqrt instructions for double and float arguments.

Coverage for new intrinsics ensures that __sqrt and __sqrtf are
correctly expanded to hardware instructions and do not fall back to
library calls, regardless of optimization levels.

gcc/ChangeLog:

* config/aarch64/arm_acle.h (__sqrt, __sqrtf): New function.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/acle/acle_sqrt.c: New test.

Signed-off-by: Ayan Shafqat 

Diff:
---
 gcc/config/aarch64/arm_acle.h | 14 ++
 gcc/testsuite/gcc.target/aarch64/acle/acle_sqrt.c | 19 +++
 2 files changed, 33 insertions(+)

diff --git a/gcc/config/aarch64/arm_acle.h b/gcc/config/aarch64/arm_acle.h
index d9e2401ea9f6..507b6e72bcb1 100644
--- a/gcc/config/aarch64/arm_acle.h
+++ b/gcc/config/aarch64/arm_acle.h
@@ -118,6 +118,20 @@ __revl (unsigned long __value)
 return __rev (__value);
 }
 
+__extension__ extern __inline double
+__attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
+__sqrt (double __x)
+{
+  return __builtin_aarch64_sqrtdf (__x);
+}
+
+__extension__ extern __inline float
+__attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
+__sqrtf (float __x)
+{
+  return __builtin_aarch64_sqrtsf (__x);
+}
+
 #pragma GCC push_options
 #pragma GCC target ("+nothing+jscvt")
 __extension__ extern __inline int32_t
diff --git a/gcc/testsuite/gcc.target/aarch64/acle/acle_sqrt.c 
b/gcc/testsuite/gcc.target/aarch64/acle/acle_sqrt.c
new file mode 100644
index ..482351fa7e66
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/acle/acle_sqrt.c
@@ -0,0 +1,19 @@
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+
+#include "arm_acle.h"
+
+double
+test_acle_sqrt (double x)
+{
+  return __sqrt (x);
+}
+
+float
+test_acle_sqrtf (float x)
+{
+  return __sqrtf (x);
+}
+
+/* { dg-final { scan-assembler-times "fsqrt\td\[0-9\]" 1 } } */
+/* { dg-final { scan-assembler-times "fsqrt\ts\[0-9\]" 1 } } */


[gcc r15-9487] Locality cloning pass: -fipa-reorder-for-locality

2025-04-15 Thread Kyrylo Tkachov via Gcc-cvs
https://gcc.gnu.org/g:6d9fdf4bf57353f9260a2e0c8774854fb50f5128

commit r15-9487-g6d9fdf4bf57353f9260a2e0c8774854fb50f5128
Author: Kyrylo Tkachov 
Date:   Thu Feb 27 09:24:10 2025 -0800

Locality cloning pass: -fipa-reorder-for-locality

Implement partitioning and cloning in the callgraph to help locality.
A new -fipa-reorder-for-locality flag is used to enable this.
The majority of the logic is in the new IPA pass in ipa-locality-cloning.cc
The optimization has two components:
* Partitioning the callgraph so as to group callers and callees that 
frequently
call each other in the same partition
* Cloning functions that straddle multiple callchains and allowing each 
clone
to be local to the partition of its callchain.

The majority of the logic is in the new IPA pass in ipa-locality-cloning.cc.
It creates a partitioning plan and does the prerequisite cloning.
The partitioning is then implemented during the existing LTO partitioning 
pass.

To guide these locality heuristics we use PGO data.
In the absence of PGO data we use a static heuristic that uses the 
accumulated
estimated edge frequencies of the callees for each function to guide the
reordering.
We are investigating some more elaborate static heuristics, in particular 
using
the demangled C++ names to group template instantiatios together.
This is promising but we are working out some kinks in the implementation
currently and want to send that out as a follow-up once we're more confident
in it.

A new bootstrap-lto-locality bootstrap config is added that allows us to 
test
this on GCC itself with either static or PGO heuristics.
GCC bootstraps with both (normal LTO bootstrap and profiledbootstrap).

As this new pass enables a new partitioning scheme it is incompatible with
explicit -flto-partition= options so an error is introduced when the user
uses both flags explicitly.

With this optimization we are seeing good performance gains on some large
internal workloads that stress the parts of the processor that is sensitive
to code locality, but we'd appreciate wider performance evaluation.

Bootstrapped and tested on aarch64-none-linux-gnu.
Ok for mainline?
Thanks,
Kyrill

Signed-off-by: Prachi Godbole 
Co-authored-by: Kyrylo Tkachov 

config/ChangeLog:

* bootstrap-lto-locality.mk: New file.

gcc/ChangeLog:

* Makefile.in (OBJS): Add ipa-locality-cloning.o.
* cgraph.h (set_new_clone_decl_and_node_flags): Declare prototype.
* cgraphclones.cc (set_new_clone_decl_and_node_flags): Remove static
qualifier.
* common.opt (fipa-reorder-for-locality): New flag.
(LTO_PARTITION_DEFAULT): Declare.
(flto-partition): Change default to LTO_PARTITION_DFEAULT.
* doc/invoke.texi: Document -fipa-reorder-for-locality.
* flag-types.h (enum lto_locality_cloning_model): Declare.
(lto_partitioning_model): Add LTO_PARTITION_DEFAULT.
* lto-cgraph.cc (lto_set_symtab_encoder_in_partition): Add dumping 
of
node and index.
* opts.cc (validate_ipa_reorder_locality_lto_partition): Define.
(finish_options): Handle LTO_PARTITION_DEFAULT.
* params.opt (lto_locality_cloning_model): New enum.
(lto-partition-locality-cloning): New param.
(lto-partition-locality-frequency-cutoff): Likewise.
(lto-partition-locality-size-cutoff): Likewise.
(lto-max-locality-partition): Likewise.
* passes.def: Register pass_ipa_locality_cloning.
* timevar.def (TV_IPA_LC): New timevar.
* tree-pass.h (make_pass_ipa_locality_cloning): Declare.
* ipa-locality-cloning.cc: New file.
* ipa-locality-cloning.h: New file.

gcc/lto/ChangeLog:

* lto-partition.cc (add_node_references_to_partition): Define.
(create_partition): Likewise.
(lto_locality_map): Likewise.
(lto_promote_cross_file_statics): Add extra dumping.
* lto-partition.h (lto_locality_map): Declare prototype.
* lto.cc (do_whole_program_analysis): Handle
flag_ipa_reorder_for_locality.

Diff:
---
 config/bootstrap-lto-locality.mk |   20 +
 gcc/Makefile.in  |2 +
 gcc/cgraph.h |1 +
 gcc/cgraphclones.cc  |2 +-
 gcc/common.opt   |9 +-
 gcc/doc/invoke.texi  |   32 +-
 gcc/flag-types.h |   10 +-
 gcc/ipa-locality-cloning.cc  | 1137 ++
 gcc/ipa-locality-cloning.h   |   35 ++
 gcc/lto-cgraph.cc|2 +
 gcc/lto/lto-partition.cc |  126 +
 gcc/lto/lto-partition.h  |1 +
 gcc/lto/lto.cc

[gcc r15-9498] Regenerate common.opt.urls

2025-04-15 Thread Kyrylo Tkachov via Gcc-cvs
https://gcc.gnu.org/g:5621b3b5c9ebd98f1f18787a6fceb015d19d33a5

commit r15-9498-g5621b3b5c9ebd98f1f18787a6fceb015d19d33a5
Author: Kyrylo Tkachov 
Date:   Tue Apr 15 09:22:05 2025 -0700

Regenerate common.opt.urls

Signed-off-by: Kyrylo Tkachov 

* common.opt.urls: Regenerate.

Diff:
---
 gcc/common.opt.urls | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/gcc/common.opt.urls b/gcc/common.opt.urls
index a4b14f5241fc..8bd75b1153b0 100644
--- a/gcc/common.opt.urls
+++ b/gcc/common.opt.urls
@@ -868,6 +868,9 @@ UrlSuffix(gcc/Optimize-Options.html#index-fipa-bit-cp)
 fipa-modref
 UrlSuffix(gcc/Optimize-Options.html#index-fipa-modref)
 
+fipa-reorder-for-locality
+UrlSuffix(gcc/Optimize-Options.html#index-fipa-reorder-for-locality)
+
 fipa-profile
 UrlSuffix(gcc/Optimize-Options.html#index-fipa-profile)


[gcc r15-7833] PR rtl-optimization/119046: aarch64: Fix PARALLEL mode for vec_perm DUP expansion

2025-03-05 Thread Kyrylo Tkachov via Gcc-cvs
https://gcc.gnu.org/g:ff505948631713d8c62523005059b10e25343617

commit r15-7833-gff505948631713d8c62523005059b10e25343617
Author: Kyrylo Tkachov 
Date:   Wed Mar 5 03:03:52 2025 -0800

PR rtl-optimization/119046: aarch64: Fix PARALLEL mode for vec_perm DUP 
expansion

The PARALLEL created in aarch64_evpc_dup is used to hold the lane number.
It is not appropriate for it to have a vector mode.
Other such uses use VOIDmode.
Do this here as well.
This avoids the risk of generic code treating the PARALLEL as trapping when 
it
has floating-point mode.

Bootstrapped and tested on aarch64-none-linux-gnu.

Signed-off-by: Kyrylo Tkachov 

PR rtl-optimization/119046
* config/aarch64/aarch64.cc (aarch64_evpc_dup): Use VOIDmode for
PARALLEL.

Diff:
---
 gcc/config/aarch64/aarch64.cc | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index af3871ce8a1f..9196b8d906c8 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -26301,7 +26301,7 @@ aarch64_evpc_dup (struct expand_vec_perm_d *d)
   in0 = d->op0;
   lane = GEN_INT (elt); /* The pattern corrects for big-endian.  */
 
-  rtx parallel = gen_rtx_PARALLEL (vmode, gen_rtvec (1, lane));
+  rtx parallel = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, lane));
   rtx select = gen_rtx_VEC_SELECT (GET_MODE_INNER (vmode), in0, parallel);
   emit_set_insn (out, gen_rtx_VEC_DUPLICATE (vmode, select));
   return true;


[gcc r15-7832] PR rtl-optimization/119046: Don't mark PARALLEL RTXes with floating-point mode as trapping

2025-03-05 Thread Kyrylo Tkachov via Gcc-cvs
https://gcc.gnu.org/g:db76482175c4e76db273d7fb3a00ae0f932529a6

commit r15-7832-gdb76482175c4e76db273d7fb3a00ae0f932529a6
Author: Kyrylo Tkachov 
Date:   Thu Feb 27 09:00:25 2025 -0800

PR rtl-optimization/119046: Don't mark PARALLEL RTXes with floating-point 
mode as trapping

In this testcase late-combine was failing to merge:
dup v31.4s, v31.s[3]
fmlav30.4s, v31.4s, v29.4s
into the lane-wise fmla form.
This is because late-combine checks may_trap_p under the hood on the dup 
insn.
This ended up returning true for the insn:
(set (reg:V4SF 152 [ _32 ])
(vec_duplicate:V4SF (vec_select:SF (reg:V4SF 111 [ rhs_panel.8_31 ])
(parallel:V4SF [
(const_int 3 [0x3])]

Although mem_trap_p correctly reasoned that vec_duplicate and vec_select of
floating-point modes can't trap, it assumed that the V4SF parallel can trap.
The correct behaviour is to recurse into vector inside the PARALLEL and 
check
the sub-expression.  This patch adjusts may_trap_p_1 to do just that.
With this check the above insn is not deemed to be trapping and is 
propagated
into the FMLA giving:
fmlavD.4s, vA.4s, vB.s[3]

Bootstrapped and tested on aarch64-none-linux-gnu.
Apparently this also fixes a regression in
gcc.target/aarch64/vmul_element_cost.c that I observed.

Signed-off-by: Kyrylo Tkachov 

gcc/

PR rtl-optimization/119046
* rtlanal.cc (may_trap_p_1): Don't mark FP-mode PARALLELs as 
trapping.

gcc/testsuite/

PR rtl-optimization/119046
* gcc.target/aarch64/pr119046.c: New test.

Diff:
---
 gcc/rtlanal.cc  |  1 +
 gcc/testsuite/gcc.target/aarch64/pr119046.c | 16 
 2 files changed, 17 insertions(+)

diff --git a/gcc/rtlanal.cc b/gcc/rtlanal.cc
index 8caffafdaa44..7ad67afb9fe8 100644
--- a/gcc/rtlanal.cc
+++ b/gcc/rtlanal.cc
@@ -3252,6 +3252,7 @@ may_trap_p_1 (const_rtx x, unsigned flags)
return true;
   break;
 
+case PARALLEL:
 case NEG:
 case ABS:
 case SUBREG:
diff --git a/gcc/testsuite/gcc.target/aarch64/pr119046.c 
b/gcc/testsuite/gcc.target/aarch64/pr119046.c
new file mode 100644
index ..aa5fa7c848c1
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr119046.c
@@ -0,0 +1,16 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O2" } */
+
+#include 
+
+float32x4_t madd_helper_1(float32x4_t a, float32x4_t b, float32x4_t d)
+{
+  float32x4_t t = a;
+  t = vfmaq_f32 (t, vdupq_n_f32(vgetq_lane_f32 (b, 1)), d);
+  t = vfmaq_f32 (t, vdupq_n_f32(vgetq_lane_f32 (b, 1)), d);
+  return t;
+}
+
+/* { dg-final { scan-assembler-not {\tdup\tv[0-9]+\.4s, v[0-9]+.s\[1\]\n} } } 
*/
+/* { dg-final { scan-assembler-times {\tfmla\tv[0-9]+\.4s, v[0-9]+\.4s, 
v[0-9]+\.s\[1\]\n} 2 } } */
+


[gcc r15-9062] PR middle-end/119442: expr.cc: Fix vec_duplicate into vector boolean modes

2025-03-31 Thread Kyrylo Tkachov via Gcc-cvs
https://gcc.gnu.org/g:70391e3958db791edea4e877636592de47a785e7

commit r15-9062-g70391e3958db791edea4e877636592de47a785e7
Author: Kyrylo Tkachov 
Date:   Mon Mar 24 01:53:06 2025 -0700

PR middle-end/119442: expr.cc: Fix vec_duplicate into vector boolean modes

In this testcase GCC tries to expand a VNx4BI vector:
  vector(4)  _40;
  _39 = () _24;
  _40 = {_39, _39, _39, _39};

This ends up in a scalarised sequence of bitfield insert operations.
This is despite the fact that AArch64 provides a vec_duplicate pattern
specifically for vec_duplicate into VNx4BI.

The store_constructor code is overly conservative when trying vec_duplicate
as it sees a requested VNx4BImode and an element mode of QImode, which I 
guess
is the storage mode of BImode objects.

The vec_duplicate expander in aarch64-sve.md explicitly allows QImode 
element
modes so it should be safe to use it.  This patch extends that mode check
to allow such expanders.

The testcase is heavily auto-reduced from a real application but in itself 
is
nonsensical, but it does demonstrate the current problematic codegen.

This the testcase goes from:
pfalse  p15.b
str p15, [sp, #6, mul vl]
mov w0, 0
ldr w2, [sp, 12]
bfi w2, w0, 0, 4
uxtwx2, w2
bfi w2, w0, 4, 4
uxtwx2, w2
bfi w2, w0, 8, 4
uxtwx2, w2
bfi w2, w0, 12, 4
str w2, [sp, 12]
ldr p15, [sp, #6, mul vl]

into:
whilelo p15.s, wzr, wzr

The whilelo could be optimised away into a pfalse of course, but the 
important
part is that the bfis are gones.

Bootstrapped and tested on aarch64-none-linux-gnu.

Signed-off-by: Kyrylo Tkachov 

gcc/

PR middle-end/119442
* expr.cc (store_constructor): Also allow element modes explicitly
accepted by target vec_duplicate pattern.

gcc/testsuite/

PR middle-end/119442
* gcc.target/aarch64/vls_sve_vec_dup_1.c: New test.

Diff:
---
 gcc/expr.cc  | 11 ---
 gcc/testsuite/gcc.target/aarch64/vls_sve_vec_dup_1.c | 15 +++
 2 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/gcc/expr.cc b/gcc/expr.cc
index 9f4382d7986b..2147eedad7be 100644
--- a/gcc/expr.cc
+++ b/gcc/expr.cc
@@ -7920,11 +7920,16 @@ store_constructor (tree exp, rtx target, int cleared, 
poly_int64 size,
gcc_assert (eltmode != BLKmode);
 
/* Try using vec_duplicate_optab for uniform vectors.  */
+   icode = optab_handler (vec_duplicate_optab, mode);
if (!TREE_SIDE_EFFECTS (exp)
&& VECTOR_MODE_P (mode)
-   && eltmode == GET_MODE_INNER (mode)
-   && ((icode = optab_handler (vec_duplicate_optab, mode))
-   != CODE_FOR_nothing)
+   && icode != CODE_FOR_nothing
+   /* If the vec_duplicate target pattern does not specify an element
+  mode check that eltmode is the normal inner mode of the
+  requested vector mode.  But if the target allows eltmode
+  explicitly go ahead and use it.  */
+   && (eltmode == GET_MODE_INNER (mode)
+   || insn_data[icode].operand[1].mode == eltmode)
&& (elt = uniform_vector_p (exp))
&& !VECTOR_TYPE_P (TREE_TYPE (elt)))
  {
diff --git a/gcc/testsuite/gcc.target/aarch64/vls_sve_vec_dup_1.c 
b/gcc/testsuite/gcc.target/aarch64/vls_sve_vec_dup_1.c
new file mode 100644
index ..ada0d4fc0a43
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/vls_sve_vec_dup_1.c
@@ -0,0 +1,15 @@
+/* { dg-do compile } */
+/* { dg-options "-O3 -march=armv8.2-a+sve -msve-vector-bits=128" } */
+
+float fasten_main_etot_0;
+void fasten_main() {
+  for (int l = 0; l < 2;) {
+int phphb_nz;
+for (; l < 32; l++) {
+  float dslv_e = l && phphb_nz;
+  fasten_main_etot_0 += dslv_e;
+}
+  }
+}
+
+/* { dg-final { scan-assembler-not {bfi\tw\[0-9\]+} } } */


[gcc r16-928] aarch64: Enable newly implemented features for FUJITSU-MONAKA

2025-05-28 Thread Kyrylo Tkachov via Gcc-cvs
https://gcc.gnu.org/g:33ee574a7444b238005d89fdfdf2f21f50b1fc6e

commit r16-928-g33ee574a7444b238005d89fdfdf2f21f50b1fc6e
Author: Yuta Mukai 
Date:   Fri May 23 04:51:11 2025 +

aarch64: Enable newly implemented features for FUJITSU-MONAKA

This patch enables newly implemented features in GCC (FAMINMAX, FP8FMA,
FP8DOT2, FP8DOT4, LUT) for FUJITSU-MONAKA
processor (-mcpu=fujitsu-monaka).

2025-05-23  Yuta Mukai  

gcc/ChangeLog:

* config/aarch64/aarch64-cores.def (fujitsu-monaka): Update ISA
features.

Diff:
---
 gcc/config/aarch64/aarch64-cores.def | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/config/aarch64/aarch64-cores.def 
b/gcc/config/aarch64/aarch64-cores.def
index 12096300d012..24b7cd362aaf 100644
--- a/gcc/config/aarch64/aarch64-cores.def
+++ b/gcc/config/aarch64/aarch64-cores.def
@@ -132,7 +132,7 @@ AARCH64_CORE("octeontx2f95mm", octeontx2f95mm, cortexa57, 
V8_2A,  (CRYPTO, PROFI
 
 /* Fujitsu ('F') cores. */
 AARCH64_CORE("a64fx", a64fx, a64fx, V8_2A,  (F16, SVE), a64fx, 0x46, 0x001, -1)
-AARCH64_CORE("fujitsu-monaka", fujitsu_monaka, cortexa57, V9_3A, (F16, FP8, 
LS64, RNG, CRYPTO, SVE2_AES, SVE2_BITPERM, SVE2_SHA3, SVE2_SM4), 
fujitsu_monaka, 0x46, 0x003, -1)
+AARCH64_CORE("fujitsu-monaka", fujitsu_monaka, cortexa57, V9_3A, (F16, 
FAMINMAX, FP8FMA, FP8DOT2, FP8DOT4, LS64, LUT, RNG, CRYPTO, SVE2_AES, 
SVE2_BITPERM, SVE2_SHA3, SVE2_SM4), fujitsu_monaka, 0x46, 0x003, -1)
 
 /* HiSilicon ('H') cores. */
 AARCH64_CORE("tsv110",  tsv110, tsv110, V8_2A,  (CRYPTO, F16), tsv110,   0x48, 
0xd01, -1)


[gcc r15-9742] aarch64: Enable newly implemented features for FUJITSU-MONAKA

2025-05-29 Thread Kyrylo Tkachov via Gcc-cvs
https://gcc.gnu.org/g:d79b3dc85d26051665b3e7412d5e1bd35915b882

commit r15-9742-gd79b3dc85d26051665b3e7412d5e1bd35915b882
Author: Yuta Mukai 
Date:   Fri May 23 04:51:11 2025 +

aarch64: Enable newly implemented features for FUJITSU-MONAKA

This patch enables newly implemented features in GCC (FAMINMAX, FP8FMA,
FP8DOT2, FP8DOT4, LUT) for FUJITSU-MONAKA
processor (-mcpu=fujitsu-monaka).

2025-05-23  Yuta Mukai  

gcc/ChangeLog:

* config/aarch64/aarch64-cores.def (fujitsu-monaka): Update ISA
features.

(cherry picked from commit 33ee574a7444b238005d89fdfdf2f21f50b1fc6e)

Diff:
---
 gcc/config/aarch64/aarch64-cores.def | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/config/aarch64/aarch64-cores.def 
b/gcc/config/aarch64/aarch64-cores.def
index 12096300d012..24b7cd362aaf 100644
--- a/gcc/config/aarch64/aarch64-cores.def
+++ b/gcc/config/aarch64/aarch64-cores.def
@@ -132,7 +132,7 @@ AARCH64_CORE("octeontx2f95mm", octeontx2f95mm, cortexa57, 
V8_2A,  (CRYPTO, PROFI
 
 /* Fujitsu ('F') cores. */
 AARCH64_CORE("a64fx", a64fx, a64fx, V8_2A,  (F16, SVE), a64fx, 0x46, 0x001, -1)
-AARCH64_CORE("fujitsu-monaka", fujitsu_monaka, cortexa57, V9_3A, (F16, FP8, 
LS64, RNG, CRYPTO, SVE2_AES, SVE2_BITPERM, SVE2_SHA3, SVE2_SM4), 
fujitsu_monaka, 0x46, 0x003, -1)
+AARCH64_CORE("fujitsu-monaka", fujitsu_monaka, cortexa57, V9_3A, (F16, 
FAMINMAX, FP8FMA, FP8DOT2, FP8DOT4, LS64, LUT, RNG, CRYPTO, SVE2_AES, 
SVE2_BITPERM, SVE2_SHA3, SVE2_SM4), fujitsu_monaka, 0x46, 0x003, -1)
 
 /* HiSilicon ('H') cores. */
 AARCH64_CORE("tsv110",  tsv110, tsv110, V8_2A,  (CRYPTO, F16), tsv110,   0x48, 
0xd01, -1)