[Patch] AMD GCN: Set HSA_XNACK for USM and 'xnack+' / 'xnack-'

2024-10-29 Thread Tobias Burnus

While users can set HSA_XNACK themselves, it is much more convenient if
the compiler sets it for them (at least if it is overriddable).

Some systems don't have XNACK, but for those that have it, the somewhat
newisher object code versions support three modes: unset (GCC: '-mxnack=any';
supporting both XNACK and not), set and unset; the last two only work if the
compiled-for mode matches the actual mode in which the GPU is running.
Therefore, setting HSA_XNACK in this case makes sense.

XNACK (when available) also needs to be enabled in order to have a working
unified-shared memory access, hence, setting it in that case also makes sense.
Therefore, this patch sets HSA_XNACK to 0 or 1.

This somewhat matches what is done in OG13 and in Andrew's patch at
https://gcc.gnu.org/pipermail/gcc-patches/2024-June/655951.html
albeit the code is somewhat different.
[For some reasons, this code is not in OG14 ?!?]

While doing so, I also updated the documentation and moved the code from
the existing stack-size constructor in the existing 'init' constructor to
reduce the number of used constructors.

OK for mainline?

Tobias
AMD GCN: Set HSA_XNACK for USM and 'xnack+' / 'xnack-'

Code compiled with explicit 'xnack-' (-mxnack=off) or 'xnack+' (-mnack=on)
only runs when the hardware mode matches.  Additionally, on GPUs that support
XNACK, the GPU has to use this mode in order that unified-shared memory access
is possible.

This commit adds code to the constructor, created by mkoffload.cc, to set the
HSA_XNACK environment variable (if not set by the user) - either to 0 for
'xnack-' or to 1 for 'xnack+' and if the code contains
'omp requires self_maps' or '... unified_shared_memory'.

There is a compile-time warning when combining 'xnack-' with USM. At runtime,
when XNACK is supported but not enabled, the GPU is excluded (host fallback).

The new setenv has been added to the 'init' constructor - and the existing
setenv code for the stack size has been moved there to reduce the pointless
larger overhead due having multiple constructors.

Note: In GCC, gfx90{0,6,8} default to 'xnack-'; hence, the constructor will
set HSA_XNACK=0 by default. 

Note 2: There is probably a preexisting endian issue in handling omp_requires
if the host compiler and the target compiler (in the GCC sense) have a different
endianess, which would be fixable. [If target and offload target would have
different endianess, it would be mostly unfixable, esp. with USM.]

Co-Authored-By: Andrew Stubbs 

gcc/ChangeLog:

	* config/gcn/mkoffload.cc (process_asm): Take omp_requires argument;
	extend conditions for requiring an '#include' in the generated C file.
	Remove code generation for the GCN_STACK_SIZE constructor.
	(process_obj): Inside the generated 'init' constructor, set
	GCN_STACK_SIZE and HSA_XNACK, if applicable.
	(main): Update process_asm call.

libgomp/ChangeLog:

	* libgomp.texi (nvptx): Mention 'self_maps' besides
	'unified_shared_memory' clause.
	(AMD GCN): Likewise; state that GCC might automatically
	set HSA_XNACK.
	* testsuite/libgomp.c-c++-common/requires-4.c: Add dg-prune-output
	for the -mxnack=off with USM warning.
	* testsuite/libgomp.c-c++-common/requires-4a.c: Likewise.
	* testsuite/libgomp.c-c++-common/requires-5.c: Likewise.
	* testsuite/libgomp.c-c++-common/requires-7.c: Likewise.
	* testsuite/libgomp.c-c++-common/target-implicit-map-4.c: Likewise.
	* testsuite/libgomp.c-c++-common/target-link-3.c: Likewise.
	* testsuite/libgomp.c-c++-common/target-link-4.c: Likewise.
	* testsuite/libgomp.fortran/self_maps.f90: Likewise.

 gcc/config/gcn/mkoffload.cc| 67 --
 libgomp/libgomp.texi   | 24 +---
 .../testsuite/libgomp.c-c++-common/requires-4.c|  5 ++
 .../testsuite/libgomp.c-c++-common/requires-4a.c   |  5 ++
 .../testsuite/libgomp.c-c++-common/requires-5.c|  5 ++
 .../testsuite/libgomp.c-c++-common/requires-7.c|  5 ++
 .../libgomp.c-c++-common/target-implicit-map-4.c   |  5 ++
 .../testsuite/libgomp.c-c++-common/target-link-3.c |  5 ++
 .../testsuite/libgomp.c-c++-common/target-link-4.c |  5 ++
 libgomp/testsuite/libgomp.fortran/self_maps.f90|  5 ++
 10 files changed, 105 insertions(+), 26 deletions(-)

diff --git a/gcc/config/gcn/mkoffload.cc b/gcc/config/gcn/mkoffload.cc
index 17a33421134..27066ba1be3 100644
--- a/gcc/config/gcn/mkoffload.cc
+++ b/gcc/config/gcn/mkoffload.cc
@@ -140,6 +140,7 @@ uint32_t elf_arch = EF_AMDGPU_MACH_AMDGCN_GFX900;  // Default GPU architecture.
 uint32_t elf_flags = EF_AMDGPU_FEATURE_SRAMECC_UNSUPPORTED_V4;
 
 static int gcn_stack_size = 0;  /* Zero means use default.  */
+bool xnack_supported = false;
 
 /* Delete tempfiles.  */
 
@@ -442,7 +443,7 @@ copy_early_debug_info (const char *infile, const char *outfile)
encoded as structured data.  */
 
 static void
-process_asm (FILE *in, FILE *out, FILE *cfile)
+process_asm (FILE *in, FILE *out, FILE *cfile, uint32_t omp_requires)
 {
   int fn_count = 0, var_count

[Patch] AMD GCN: mkoffload.cc - set HSA_XNACK for USM and 'xnack+' / 'xnack-' (was [Patch] AMD GCN: Set HSA_XNACK for USM and 'xnack+' / 'xnack-')

2024-10-29 Thread Tobias Burnus

Reposted because of two reasons:

First, I realized that the message should contain the
word 'mkoffload.cc' to be clearer.

But the main reason is that I kept changing whether I wanted
to set HSA_XNACK=0 and warn with USM for gfx90{0,6,8} or only
one or not. (In GCC, those default to xnack=no as they have XNACK
but do not have a well-working USM support.)

At the end, I settled on: Yes, both set the env var and warn.
But I forgot to remove all traces of the unused variable.

Appended the corrected patch …

Tobias

Tobias Burnus wrote:


While users can set HSA_XNACK themselves, it is much more convenient if
the compiler sets it for them (at least if it is overriddable).

Some systems don't have XNACK, but for those that have it, the somewhat
newisher object code versions support three modes: unset (GCC: 
'-mxnack=any';
supporting both XNACK and not), set and unset; the last two only work 
if the

compiled-for mode matches the actual mode in which the GPU is running.
Therefore, setting HSA_XNACK in this case makes sense.

XNACK (when available) also needs to be enabled in order to have a 
working
unified-shared memory access, hence, setting it in that case also 
makes sense.

Therefore, this patch sets HSA_XNACK to 0 or 1.

This somewhat matches what is done in OG13 and in Andrew's patch at
https://gcc.gnu.org/pipermail/gcc-patches/2024-June/655951.html
albeit the code is somewhat different.
[For some reasons, this code is not in OG14 ?!?]

While doing so, I also updated the documentation and moved the code from
the existing stack-size constructor in the existing 'init' constructor to
reduce the number of used constructors.

OK for mainline?

TobiasAMD GCN: mkoffload.cc - set HSA_XNACK for USM and 'xnack+' / 'xnack-'

Code compiled with explicit 'xnack-' (-mxnack=off) or 'xnack+' (-mnack=on)
only runs when the hardware mode matches.  Additionally, on GPUs that support
XNACK, the GPU has to use this mode in order that unified-shared memory access
is possible.

This commit adds code to the constructor, created by mkoffload.cc, to set the
HSA_XNACK environment variable (if not set by the user) - either to 0 for
'xnack-' or to 1 for 'xnack+' and if the code contains
'omp requires self_maps' or '... unified_shared_memory'.

There is a compile-time warning when combining 'xnack-' with USM. At runtime,
when XNACK is supported but not enabled, the GPU is excluded (host fallback).

The new setenv has been added to the 'init' constructor - and the existing
setenv code for the stack size has been moved there to reduce the pointless
larger overhead due having multiple constructors.

Note: In GCC, gfx90{0,6,8} default to 'xnack-'; hence, the constructor will
set HSA_XNACK=0 by default. 

Note 2: There is probably a preexisting endian issue in handling omp_requires
if the host compiler and the target compiler (in the GCC sense) have a different
endianess, which would be fixable. [If target and offload target would have
different endianess, it would be mostly unfixable, esp. with USM.]

Co-Authored-By: Andrew Stubbs 

gcc/ChangeLog:

	* config/gcn/mkoffload.cc (process_asm): Take omp_requires argument;
	extend conditions for requiring an '#include' in the generated C file.
	Remove code generation for the GCN_STACK_SIZE constructor.
	(process_obj): Inside the generated 'init' constructor, set
	GCN_STACK_SIZE and HSA_XNACK, if applicable.
	(main): Update process_asm call.

libgomp/ChangeLog:

	* libgomp.texi (nvptx): Mention 'self_maps' besides
	'unified_shared_memory' clause.
	(AMD GCN): Likewise; state that GCC might automatically
	set HSA_XNACK.
	* testsuite/libgomp.c-c++-common/requires-4.c: Add dg-prune-output
	for the -mxnack=off with USM warning.
	* testsuite/libgomp.c-c++-common/requires-4a.c: Likewise.
	* testsuite/libgomp.c-c++-common/requires-5.c: Likewise.
	* testsuite/libgomp.c-c++-common/requires-7.c: Likewise.
	* testsuite/libgomp.c-c++-common/target-implicit-map-4.c: Likewise.
	* testsuite/libgomp.c-c++-common/target-link-3.c: Likewise.
	* testsuite/libgomp.c-c++-common/target-link-4.c: Likewise.
	* testsuite/libgomp.fortran/self_maps.f90: Likewise.

 gcc/config/gcn/mkoffload.cc| 61 --
 libgomp/libgomp.texi   | 24 +
 .../testsuite/libgomp.c-c++-common/requires-4.c|  5 ++
 .../testsuite/libgomp.c-c++-common/requires-4a.c   |  5 ++
 .../testsuite/libgomp.c-c++-common/requires-5.c|  5 ++
 .../testsuite/libgomp.c-c++-common/requires-7.c|  5 ++
 .../libgomp.c-c++-common/target-implicit-map-4.c   |  5 ++
 .../testsuite/libgomp.c-c++-common/target-link-3.c |  5 ++
 .../testsuite/libgomp.c-c++-common/target-link-4.c |  5 ++
 libgomp/testsuite/libgomp.fortran/self_maps.f90|  5 ++
 10 files changed, 100 insertions(+), 25 deletions(-)

diff --git a/gcc/config/gcn/mkoffload.cc b/gcc/config/gcn/mkoffload.cc
index 17a33421134..b50c28881da 100644
--- a/gcc/config/gcn/mkoffload.cc
+++ b/gcc/config/gcn/mkoffload.cc
@@ -442,7 +442

Re: [PATCH v2 5/8] amdgcn, openmp: Auto-detect USM mode and set HSA_XNACK

2024-10-29 Thread Tobias Burnus

Hi Andrew,

Am 28.06.24 um 12:24 schrieb Andrew Stubbs:

--- a/gcc/config/gcn/gcn.cc
+++ b/gcc/config/gcn/gcn.cc
@@ -70,6 +70,11 @@ static bool ext_gcn_constants_init = 0;
  
  enum gcn_isa gcn_isa = ISA_GCN3;	/* Default to GCN3.  */
  
+/* Record whether the host compiler added "omp unifed memory" attributes to

+   any functions.  We can then pass this on to mkoffload to ensure xnack is
+   compatible there too.  */
+static bool unified_shared_memory_enabled = false;


…

Why is this needed instead of relying on omp_requires?


+  if (fndecl && lookup_attribute ("omp unified memory",
+   case PROCESSOR_GFX1036:
+   case PROCESSOR_GFX1100:
+   case PROCESSOR_GFX1103:
+ error ("GPU architecture does not support Unified Shared Memory");
+ break;


This seems to be wrong in two ways:

First, while not really usable, it feels wrong to print an error (and 
not a warning) if USM is not supported. Running the code with host 
fallback seems to be fine, albeit there is admittedly the question 
whether it makes sense to generate the GCN code in this case.


Secondly, gfx1036 has the property 
HSA_AMD_SYSTEM_INFO_SVM_ACCESSIBLE_BY_DEFAULT, i.e. this APU supports 
USM, albeit not XNACK.


* * *


+   default:
+ if (flag_xnack == HSACO_ATTR_OFF)
+   error ("Unified Shared Memory is enabled, but XNACK is disabled");


Likewise – I understand that USM won't work in this case, but the 
question is whether that should be a warning or an error as it does work 
(by using host fallback in this case).


* * *


+  /* Emit a constructor function to set the HSA_XNACK environment variable.
+ This must be done before the ROCr runtime library is loaded.
+ We never override a user value (exit empty string), but we do emit a
+ useful diagnostic in the wrong mode (the ROCr message is not good.  */
+  if (TEST_XNACK_OFF (elf_flags) && unified_shared_memory_enabled)
+fatal_error (input_location,
+"conflicting settings; XNACK is forced off but Unified "
+"Shared Memory is on");


Is this reachable? I thought the code in gcn-lto1 already prints a 
diagnostic?


Otherwise, the same applies here: error vs. warning.


+  if (!TEST_XNACK_ANY (elf_flags) || unified_shared_memory_enabled)


I think you need to exclude XNACK UNSUPPORTED on the RHS, albeit I might 
have missed some condition, which ensures that 
unified_shared_memory_enabled is not set in that case.



+fprintf (cfile,
+"static __attribute__((constructor))\n"
+"void configure_xnack (void)\n"
+"{\n"


Constructors are somewhat expensive. Why don't you combine all three 
constructors features into a single constructor?



+"  const char *val = getenv (\"HSA_XNACK\");\n"
+"  if (!val || val[0] == '\\0')\n"
+"setenv (\"HSA_XNACK\", \"%d\", true);\n"
+"  else if (%s)\n"
+"{\n"
+"  fprintf (stderr, \"error: HSA_XNACK=%%s is incompatible; "
+   "please unset\\n\", val);\n"
+"  exit (1);\n"
+"}\n"


This looks wrong – not having support for USM is not an error but should 
fall back to a host execution.


But, admittedly, XNACK- and HSA_XNACK=1 or +/0 are incompatible, i.e. we 
might want to keep the warning in that case.



--- a/gcc/omp-low.cc
+++ b/gcc/omp-low.cc
@@ -2124,6 +2124,10 @@ create_omp_child_function (omp_context *ctx, bool 
task_copy)
DECL_ATTRIBUTES (decl)
  = tree_cons (get_identifier (target_attr),
   NULL_TREE, DECL_ATTRIBUTES (decl));
+  if (omp_requires_mask & OMP_REQUIRES_UNIFIED_SHARED_MEMORY)
+   DECL_ATTRIBUTES (decl)
+ = tree_cons (get_identifier ("omp unified memory"),
+  NULL_TREE, DECL_ATTRIBUTES (decl));


As mentioned, it is unclear to me why 'omp_requires' is not enough. In 
any case, it also needs to handle the (admittedly newer) self_maps flag.


Finally, I think it would be user friendly to mention this feature in 
libgomp.texi.


Tobias


Re: [PATCH 7/7] RISC-V: Disable by pieces for vector setmem length > UNITS_PER_WORD

2024-10-29 Thread Craig Blackmore



On 20/10/2024 17:36, Jeff Law wrote:



On 10/19/24 7:09 AM, Jeff Law wrote:



On 10/18/24 7:13 AM, Craig Blackmore wrote:

For fast unaligned access targets, by pieces uses up to UNITS_PER_WORD
size pieces resulting in more store instructions than needed. For
example gcc.target/riscv/rvv/base/setmem-1.c:f1 built with
`-O3 -march=rv64gcv -mtune=thead-c906`:
```
f1:
 vsetivli    zero,8,e8,mf2,ta,ma
 vmv.v.x v1,a1
 vsetivli    zero,0,e32,mf2,ta,ma
 sb  a1,14(a0)
 vmv.x.s a4,v1
 vsetivli    zero,8,e16,m1,ta,ma
 vmv.x.s a5,v1
 vse8.v  v1,0(a0)
 sw  a4,8(a0)
 sh  a5,12(a0)
 ret
```

The slow unaligned access version built with `-O3 -march=rv64gcv` used
15 sb instructions:
```
f1:
 sb  a1,0(a0)
 sb  a1,1(a0)
 sb  a1,2(a0)
 sb  a1,3(a0)
 sb  a1,4(a0)
 sb  a1,5(a0)
 sb  a1,6(a0)
 sb  a1,7(a0)
 sb  a1,8(a0)
 sb  a1,9(a0)
 sb  a1,10(a0)
 sb  a1,11(a0)
 sb  a1,12(a0)
 sb  a1,13(a0)
 sb  a1,14(a0)
 ret
```

After this patch, the following is generated in both cases:
```
f1:
 vsetivli    zero,15,e8,m1,ta,ma
 vmv.v.x v1,a1
 vse8.v  v1,0(a0)
 ret
```

gcc/ChangeLog:

* config/riscv/riscv.cc (riscv_use_by_pieces_infrastructure_p):
New function.
(TARGET_USE_BY_PIECES_INFRASTRUCTURE_P): Define.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/autovec/pr113469.c: Expect mf2 setmem.
* gcc.target/riscv/rvv/base/setmem-2.c: Update f1 to expect
straight-line vector memset.
* gcc.target/riscv/rvv/base/setmem-3.c: Likewise.
This looks independent of 6/7, so I went ahead and pushed it. There's 
a slight chance we'll have a test goof if there's an unexpected 
dependency.
I ended up reverting.  It appears there is a dependency on patch #6.  
So we'll figure out patch #6, then revisit patch #7.


Yes there is a dependency. `check_vectorise_memory_operation` returns 
false for `(length < (TARGET_MIN_VLEN / 8))`. Patch 6 changes 
`expand_vec_setmem` to use `use_vector_stringop_p` which follows the 
same logic as for memcpy and allows smaller lengths such as the 
`MIN_VECTOR_BYTES - 1` length memset in setmem-1.c be to vectorized.


Craig




Jeff



Re: [PATCH] testcase: Add testcase for tree-optimization/117341

2024-10-29 Thread Jeff Law




On 10/28/24 11:08 PM, Andrew Pinski wrote:

Even though PR 117341 was a duplicate of PR 116768, another
testcase this time C++ does not hurt to have.
The testcase is a self-contained and does not use directly libstdc++
except for operator new (it does not even call delete).

Tested on x86_64-linux-gnu with it working.

PR tree-optimization/117341

gcc/testsuite/ChangeLog:

* g++.dg/torture/pr117341-1.C: New test.

OK
jeff



Re: [PATCH] Match: Fold pow calls to ldexp when possible [PR57492]

2024-10-29 Thread Richard Biener
On Mon, 28 Oct 2024, Soumya AR wrote:

> This patch transforms the following POW calls to equivalent LDEXP calls, as
> discussed in PR57492:
> 
> powi (2.0, i) -> ldexp (1.0, i)
> 
> a * powi (2.0, i) -> ldexp (a, i)
> 
> 2.0 * powi (2.0, i) -> ldexp (1.0, i + 1)
> 
> pow (powof2, i) -> ldexp (1.0, i * log2 (powof2))
> 
> powof2 * pow (2, i) -> ldexp (1.0, i + log2 (powof2))

For the multiplication cases why not handle powof2 * ldexp (1., i)
to ldexp (1., i + log2 (powof2)) and a * ldexp (1., i) -> ldexp (a, i)
instead?  exp2 * ldexp (1., i) is another candidate.

So please split out the multiplication parts.

+ /* Simplify pow (powof2, i) to ldexp (1, i * log2 (powof2)). */

the below pattern handles POWI, not POW.

+ (simplify 
+  (POWI REAL_CST@0 @1)
+  (with { HOST_WIDE_INT tmp = 0;
+ tree integer_arg1 = NULL_TREE; }
+  (if (integer_valued_real_p (@0)
+   && real_isinteger (&TREE_REAL_CST (@0), &tmp)
+   && integer_pow2p (integer_arg1 = build_int_cst (integer_type_node, 
tmp)))

  && tmp > 0
  && pow2p_hwi (tmp)

+(LDEXP { build_one_cst (type); }
+  (mult @1 { build_int_cst (integer_type_node,
+ tree_log2 (integer_arg1)); })

build_int_cst (integer_type_node, exact_log2 (tmp))

+ /* Simplify powi (2.0, i) to ldexp (1, i). */
+ (simplify
+  (POWI REAL_CST@0 @1)
+  (if (real_equal (TREE_REAL_CST_PTR (@0), &dconst2))
+   (LDEXP { build_one_cst (type); } @1)))
+

You'll have a duplicate pattern here, instead merge them.  2.0
is power-of-two so I wonder why the pattern is needed.

Richard.

> 
> This is especially helpful for SVE architectures as LDEXP calls can be
> implemented using the FSCALE instruction, as seen in the following patch:
> https://gcc.gnu.org/pipermail/gcc-patches/2024-September/664160.html
> 
> SPEC2017 was run with this patch, while there are no noticeable improvements,
> there are no non-noise regressions either.
> 
> The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression.
> OK for mainline?
> 
> Signed-off-by: Soumya AR 
> 
> gcc/ChangeLog:
>   PR target/57492
>   * match.pd: Added patterns to fold certain calls to pow to ldexp.
> 
> gcc/testsuite/ChangeLog:
>   PR target/57492
>   * gcc.dg/tree-ssa/pow-to-ldexp.c: New test.
> 
> 

-- 
Richard Biener 
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)


Re: [PATCH 6/7] RISC-V: Make vectorized memset handle more cases

2024-10-29 Thread Craig Blackmore



On 19/10/2024 14:05, Jeff Law wrote:



On 10/18/24 7:12 AM, Craig Blackmore wrote:

`expand_vec_setmem` only generated vectorized memset if it fitted into a
single vector store.  Extend it to generate a loop for longer and
unknown lengths.

The test cases now use -O1 so that they are not sensitive to scheduling.

gcc/ChangeLog:

* config/riscv/riscv-string.cc
(use_vector_stringop_p): Add comment.
(expand_vec_setmem): Use use_vector_stringop_p instead of
check_vectorise_memory_operation.  Add loop generation.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/base/setmem-1.c: Use -O1.  Expect a loop
instead of a libcall.  Add test for unknown length.
* gcc.target/riscv/rvv/base/setmem-2.c: Likewise.
* gcc.target/riscv/rvv/base/setmem-3.c: Likewise and expect smaller
lmul.
So why handle memset differently than the other mem* routines where we 
limit ourselves to what we can handle without needing loops?


My suspicion is that once we're moving enough data that we can't do it 
with a single big lmul store that calling out to the library variant 
probably isn't a big deal for memset.  Do you have data which suggests 
otherwise?
I don't have data for this yet. My thinking was that the glibc and 
newlib memset implementations are scalar and they also do byte stores to 
reach alignment which is unnecessary on fast unaligned access targets.


This patch may still be useful in the meantime if I removed the loop 
generation parts as it would still allow us to generate vector setmem 
for smaller lengths than currently allowed.


Craig



Jeff




Re: [PATCH 6/6] simplify-rtx: Simplify ROTATE:HI (X:HI, 8) into BSWAP:HI (X)

2024-10-29 Thread Jeff Law




On 10/29/24 4:15 AM, Kyrylo Tkachov wrote:





Thanks, I’ll extend the comment when I commit the series. Would you be able to 
help with the review of the first one in the series by any chance?
https://gcc.gnu.org/pipermail/gcc-patches/2024-October/05.html

It's in my queue :-)

jeff



Re: [PATCH 2/5] Vect: Introduce MASK_LEN_STRIDED_LOAD{STORE} to loop vectorizer

2024-10-29 Thread Richard Biener
On Wed, Oct 23, 2024 at 12:47 PM  wrote:
>
> From: Pan Li 
>
> This patch would like to allow generation of MASK_LEN_STRIDED_LOAD{STORE} IR
> for invariant stride memory access.  For example as below
>
> void foo (int * __restrict a, int * __restrict b, int stride, int n)
> {
> for (int i = 0; i < n; i++)
>   a[i*stride] = b[i*stride] + 100;
> }
>
> Before this patch:
>   66   │   _73 = .SELECT_VL (ivtmp_71, POLY_INT_CST [4, 4]);
>   67   │   _52 = _54 * _73;
>   68   │   vect__5.16_61 = .MASK_LEN_GATHER_LOAD (vectp_b.14_59, _58, 4, { 0, 
> ... }, { -1, ... }, _73, 0);
>   69   │   vect__7.17_63 = vect__5.16_61 + { 100, ... };
>   70   │   .MASK_LEN_SCATTER_STORE (vectp_a.18_67, _58, 4, vect__7.17_63, { 
> -1, ... }, _73, 0);
>   71   │   vectp_b.14_60 = vectp_b.14_59 + _52;
>   72   │   vectp_a.18_68 = vectp_a.18_67 + _52;
>   73   │   ivtmp_72 = ivtmp_71 - _73;
>
> After this patch:
>   60   │   _70 = .SELECT_VL (ivtmp_68, POLY_INT_CST [4, 4]);
>   61   │   _52 = _54 * _70;
>   62   │   vect__5.16_58 = .MASK_LEN_STRIDED_LOAD (vectp_b.14_56, _55, { 0, 
> ... }, { -1, ... }, _70, 0);
>   63   │   vect__7.17_60 = vect__5.16_58 + { 100, ... };
>   64   │   .MASK_LEN_STRIDED_STORE (vectp_a.18_64, _55, vect__7.17_60, { -1, 
> ... }, _70, 0);
>   65   │   vectp_b.14_57 = vectp_b.14_56 + _52;
>   66   │   vectp_a.18_65 = vectp_a.18_64 + _52;
>   67   │   ivtmp_69 = ivtmp_68 - _70;
>
> The below test suites are passed for this patch:
> * The x86 bootstrap test.
> * The x86 fully regression test.
> * The riscv fully regression test.
>
> gcc/ChangeLog:
>
> * tree-vect-stmts.cc (vect_get_strided_load_store_ops): Handle
> MASK_LEN_STRIDED_LOAD{STORE} after supported check.
> (vectorizable_store): Generate MASK_LEN_STRIDED_LOAD when the offset
> of gater is not vector type.
> (vectorizable_load): Ditto but for store.
>
> Signed-off-by: Pan Li 
> Co-Authored-By: Juzhe-Zhong 
> ---
>  gcc/tree-vect-stmts.cc | 45 +-
>  1 file changed, 36 insertions(+), 9 deletions(-)
>
> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> index e7f14c3144c..78d66a4ef9d 100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -2950,6 +2950,15 @@ vect_get_strided_load_store_ops (stmt_vec_info 
> stmt_info,
>*dataref_bump = cse_and_gimplify_to_preheader (loop_vinfo, bump);
>  }
>
> +  internal_fn ifn
> += DR_IS_READ (dr) ? IFN_MASK_LEN_STRIDED_LOAD : 
> IFN_MASK_LEN_STRIDED_STORE;
> +  if (direct_internal_fn_supported_p (ifn, vectype, OPTIMIZE_FOR_SPEED))
> +{
> +  *vec_offset = cse_and_gimplify_to_preheader (loop_vinfo,
> +  unshare_expr (DR_STEP 
> (dr)));
> +  return;
> +}

I'll note that to get here the target has to claim support for general
gather/scatter,
I guess OK for now and RISC-V specifically.

OK.

Thanks,
Richard.

> +
>/* The offset given in GS_INFO can have pointer type, so use the element
>   type of the vector instead.  */
>tree offset_type = TREE_TYPE (gs_info->offset_vectype);
> @@ -9194,10 +9203,20 @@ vectorizable_store (vec_info *vinfo,
>
>   gcall *call;
>   if (final_len && final_mask)
> -   call = gimple_build_call_internal
> -(IFN_MASK_LEN_SCATTER_STORE, 7, dataref_ptr,
> - vec_offset, scale, vec_oprnd, final_mask,
> - final_len, bias);
> +   {
> + if (VECTOR_TYPE_P (TREE_TYPE (vec_offset)))
> +   call = gimple_build_call_internal (
> + IFN_MASK_LEN_SCATTER_STORE, 7, dataref_ptr,
> + vec_offset, scale, vec_oprnd, final_mask, final_len,
> + bias);
> + else
> +   /* Non-vector offset indicates that prefer to take
> +  MASK_LEN_STRIDED_STORE instead of the
> +  IFN_MASK_SCATTER_STORE with direct stride arg.  */
> +   call = gimple_build_call_internal (
> + IFN_MASK_LEN_STRIDED_STORE, 6, dataref_ptr,
> + vec_offset, vec_oprnd, final_mask, final_len, bias);
> +   }
>   else if (final_mask)
> call = gimple_build_call_internal
>  (IFN_MASK_SCATTER_STORE, 5, dataref_ptr,
> @@ -11194,11 +11213,19 @@ vectorizable_load (vec_info *vinfo,
>
>   gcall *call;
>   if (final_len && final_mask)
> -   call
> - = gimple_build_call_internal (IFN_MASK_LEN_GATHER_LOAD, 
> 7,
> -   dataref_ptr, vec_offset,
> -   scale, zero, final_mask,
> -

Re: [PATCH v2 5/8] aarch64: Add masked-load else operands.

2024-10-29 Thread Richard Sandiford
"Robin Dapp"  writes:
>>> For the lack of a better idea I used a function call property to specify
>>> whether a builtin needs an else operand or not.  Somebody with better
>>> knowledge of the aarch64 target can surely improve that.
>>
>> Yeah, those flags are really for source-level/gimple-level attributes.
>> Would it work to pass a new parameter to use_contiguous_load instead?
>
> I tried this first (before adding the call property) and immediate fallout
> from it was the direct expansion of sve intrinsics failing.  I didn't touch
> those.  Should we amend them with a zero else value or is there another
> way?

Could you give an example of what you mean?  In the patch, it seemed
like whether a class's call_properties returned CP_HAS_ELSE or not was
a static property of the class.  So rather than doing:

  unsigned int
  call_properties (const function_instance &) const override
  {
return ... | CP_HAS_ELSE;
  }

...
/* Add the else operand.  */
e.args.quick_push (CONST0_RTX (e.vector_mode (1)));
return e.use_contiguous_load_insn (icode);

I thought we could instead make the interface:

rtx
function_expander::use_contiguous_load_insn (insn_code icode, bool has_else)

with has_else being declared default-false.  Then use_contiguous_load_insn
could use:

  if (has_else)
add_input_operand (icode, const0_rtx);

(add_input_operand should take care of broadcasting the zero to the
right vector mode.)

The caller would then just be:

return e.use_contiguous_load_insn (icode, true);

without any changes to e.args.

Is that what you tried?

Thanks,
Richard


Re: [PATCH][AARCH64][PR115258]Fix excess moves

2024-10-29 Thread Richard Sandiford
Kugan Vivekanandarajah  writes:
> Hi,
>
> Fix for PR115258 cases a performance regression in some of the TSVC kernels 
> by adding additional mov instructions.
> This patch fixes this. 
>
> i.e., When operands are equal, it is likely that all of them get the same 
> register similar to:
> (insn 19 15 20 3 (set (reg:V2x16QI 62 v30 [117])
> (unspec:V2x16QI [
> (reg:V16QI 62 v30 [orig:102 vect__1.7 ] [102])
> (reg:V16QI 62 v30 [orig:102 vect__1.7 ] [102])
> ] UNSPEC_CONCAT)) "tsvc.c":11:12 4871 {aarch64_combinev16qi}
>  (nil))
>
> In this case, aarch64_split_combinev16qi would split it with one insm. Hence, 
> when the operands are equal, split after reload.
>
> Bootstrapped and recession tested on aarch64-linux-gnu, Is this ok for trunk?

Thanks for the patch.  I'm not sure this is the right fix though.
I'm planning to have a look at the PR once stage 1 closes.

Richard

>
> Thanks,
> Kugan
>
> From ace50a5eb5d459901325ff17ada83791cef0a354 Mon Sep 17 00:00:00 2001
> From: Kugan 
> Date: Wed, 23 Oct 2024 05:03:02 +0530
> Subject: [PATCH] [PATCH][AARCH64][PR115258]Fix excess moves
>
> When operands are equal, it is likely that all of them get the same register
> similar to:
> (insn 19 15 20 3 (set (reg:V2x16QI 62 v30 [117])
> (unspec:V2x16QI [
> (reg:V16QI 62 v30 [orig:102 vect__1.7 ] [102])
> (reg:V16QI 62 v30 [orig:102 vect__1.7 ] [102])
> ] UNSPEC_CONCAT)) "tsvc.c":11:12 4871 {aarch64_combinev16qi}
>  (nil))
>
> In this case, aarch64_split_combinev16qi would split it with one insn. Hence,
> when the operands are equal, prefer splitting after reload.
>
>   PR target/115258
>
> gcc/ChangeLog:
>
>   PR target/115258
>   * config/aarch64/aarch64-simd.md (aarch64_combinev16qi): Restrict
>   the split before reload if operands are equal.
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/aarch64/pr115258-2.c: New test.
>
> Signed-off-by: Kugan Vivekanandarajah 
> ---
>  gcc/config/aarch64/aarch64-simd.md|  2 +-
>  gcc/testsuite/gcc.target/aarch64/pr115258-2.c | 18 ++
>  2 files changed, 19 insertions(+), 1 deletion(-)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/pr115258-2.c
>
> diff --git a/gcc/config/aarch64/aarch64-simd.md 
> b/gcc/config/aarch64/aarch64-simd.md
> index 2a44aa3fcc3..e56100b3766 100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -8525,7 +8525,7 @@
>   UNSPEC_CONCAT))]
>"TARGET_SIMD"
>"#"
> -  "&& 1"
> +  "&& reload_completed || !rtx_equal_p (operands[1], operands[2])"
>[(const_int 0)]
>  {
>aarch64_split_combinev16qi (operands);
> diff --git a/gcc/testsuite/gcc.target/aarch64/pr115258-2.c 
> b/gcc/testsuite/gcc.target/aarch64/pr115258-2.c
> new file mode 100644
> index 000..f28190cef32
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/pr115258-2.c
> @@ -0,0 +1,18 @@
> +
> +/* { dg-do compile } */
> +/* { dg-options "-Ofast -mcpu=neoverse-v2" } */
> +
> +extern __attribute__((aligned(64))) float a[32000], b[32000];
> +int dummy(float[32000], float[32000], float);
> +
> +void s1112() {
> +
> +  for (int nl = 0; nl < 10 * 3; nl++) {
> +for (int i = 32000 - 1; i >= 0; i--) {
> +  a[i] = b[i] + (float)1.;
> +}
> +dummy(a, b, 0.);
> +  }
> +}
> +
> +/* { dg-final { scan-assembler-times "mov\\tv\[0-9\]+\.16b, v\[0-9\]+\.16b" 
> 2 } } */


Re: [PATCH] config: add -Werror=lto-type-mismatch, odr to bootstrap-lto*

2024-10-29 Thread Richard Biener
On Mon, Oct 28, 2024 at 12:22 PM Sam James  wrote:
>
> Sam James  writes:
>
> > Sam James  writes:
> >
> >> Add -Werror=lto-type-mismatch,odr to bootstrap-lto* configurations to
> >> help stop LTO breakage/correctness issues sneaking in.
> >>
> >> We discussed -Werror=strict-aliasing but it runs early and doesn't
> >> give better diagnostics with LTO so left it out.
> >>
> >> config/ChangeLog:
> >>  PR rust/108087
> >>  PR ada/115917
> >>  PR modula2/114529
> >>  PR modula2/116181
> >>  PR other/116182
> >>
> >>  * bootstrap-lto-lean.mk: Pass -Werror=lto-type-mismatch,odr.
> >>  * bootstrap-lto-noplugin.mk: Ditto.
> >>  * bootstrap-lto.mk: Ditto.
> >> ---
> >> OK once PR117038 is fixed? (It snuck in yesterday).
> >
> > ... which is now fixed.
> >
>
> Ping.

I think we want to arrange for this to not be active when release checking
is enabled - specific target configurations we do not test may otherwise
result in failed builds for releases.  Tieing to --enable-werror somehow
would be OK I guess.

Richard.



> >>
> >> Bootstrapped all languages on x86_64-pc-linux-gnu.
> >>
> >>  config/bootstrap-lto-lean.mk |  8 +---
> >>  config/bootstrap-lto-noplugin.mk | 10 +-
> >>  config/bootstrap-lto.mk  | 10 +-
> >>  3 files changed, 15 insertions(+), 13 deletions(-)
> >>
> >> diff --git a/config/bootstrap-lto-lean.mk b/config/bootstrap-lto-lean.mk
> >> index 42cb3394c70b..f176390ba21a 100644
> >> --- a/config/bootstrap-lto-lean.mk
> >> +++ b/config/bootstrap-lto-lean.mk
> >> @@ -1,10 +1,12 @@
> >>  # This option enables LTO for stage4 and LTO for generators in stage3 
> >> with profiledbootstrap.
> >>  # Otherwise, LTO is used in only stage3.
> >>
> >> -STAGE3_CFLAGS += -flto=jobserver
> >> +
> >> +STAGE2_CFLAGS += -flto=jobserver -Werror=lto-type-mismatch -Werror=odr
> >> +STAGE3_CFLAGS += -flto=jobserver -Werror=lto-type-mismatch -Werror=odr
> >>  override STAGEtrain_CFLAGS := $(filter-out 
> >> -flto=jobserver,$(STAGEtrain_CFLAGS))
> >> -STAGEtrain_GENERATOR_CFLAGS += -flto=jobserver
> >> -STAGEfeedback_CFLAGS += -flto=jobserver
> >> +STAGEtrain_GENERATOR_CFLAGS += -flto=jobserver -Werror=lto-type-mismatch 
> >> -Werror=odr
> >> +STAGEfeedback_CFLAGS += -flto=jobserver -Werror=lto-type-mismatch 
> >> -Werror=odr
> >>
> >>  # assumes the host supports the linker plugin
> >>  LTO_AR = $$r/$(HOST_SUBDIR)/prev-gcc/gcc-ar$(exeext) 
> >> -B$$r/$(HOST_SUBDIR)/prev-gcc/
> >> diff --git a/config/bootstrap-lto-noplugin.mk 
> >> b/config/bootstrap-lto-noplugin.mk
> >> index 0f50708e49d1..660ca60dbd3d 100644
> >> --- a/config/bootstrap-lto-noplugin.mk
> >> +++ b/config/bootstrap-lto-noplugin.mk
> >> @@ -1,9 +1,9 @@
> >>  # This option enables LTO for stage2 and stage3 on
> >>  # hosts without linker plugin support.
> >>
> >> -STAGE2_CFLAGS += -flto=jobserver -frandom-seed=1 -ffat-lto-objects
> >> -STAGE3_CFLAGS += -flto=jobserver -frandom-seed=1 -ffat-lto-objects
> >> -STAGEprofile_CFLAGS += -flto=jobserver -frandom-seed=1
> >> -STAGEtrain_CFLAGS += -flto=jobserver -frandom-seed=1
> >> -STAGEfeedback_CFLAGS += -flto=jobserver -frandom-seed=1
> >> +STAGE2_CFLAGS += -flto=jobserver -frandom-seed=1 -ffat-lto-objects 
> >> -Werror=lto-type-mismatch -Werror=odr
> >> +STAGE3_CFLAGS += -flto=jobserver -frandom-seed=1 -ffat-lto-objects 
> >> -Werror=lto-type-mismatch -Werror=odr
> >> +STAGEprofile_CFLAGS += -flto=jobserver -frandom-seed=1 
> >> -Werror=lto-type-mismatch -Werror=odr
> >> +STAGEtrain_CFLAGS += -flto=jobserver -frandom-seed=1 
> >> -Werror=lto-type-mismatch -Werror=odr
> >> +STAGEfeedback_CFLAGS += -flto=jobserver -frandom-seed=1 
> >> -Werror=lto-type-mismatch -Werror=odr
> >>  do-compare = /bin/true
> >> diff --git a/config/bootstrap-lto.mk b/config/bootstrap-lto.mk
> >> index 1ddb1d870bab..9f76c03f8a68 100644
> >> --- a/config/bootstrap-lto.mk
> >> +++ b/config/bootstrap-lto.mk
> >> @@ -1,10 +1,10 @@
> >>  # This option enables LTO for stage2 and stage3 in slim mode
> >>
> >> -STAGE2_CFLAGS += -flto=jobserver -frandom-seed=1
> >> -STAGE3_CFLAGS += -flto=jobserver -frandom-seed=1
> >> -STAGEprofile_CFLAGS += -flto=jobserver -frandom-seed=1
> >> -STAGEtrain_CFLAGS += -flto=jobserver -frandom-seed=1
> >> -STAGEfeedback_CFLAGS += -flto=jobserver -frandom-seed=1
> >> +STAGE2_CFLAGS += -flto=jobserver -frandom-seed=1 
> >> -Werror=lto-type-mismatch -Werror=odr
> >> +STAGE3_CFLAGS += -flto=jobserver -frandom-seed=1 
> >> -Werror=lto-type-mismatch -Werror=odr
> >> +STAGEprofile_CFLAGS += -flto=jobserver -frandom-seed=1 
> >> -Werror=lto-type-mismatch -Werror=odr
> >> +STAGEtrain_CFLAGS += -flto=jobserver -frandom-seed=1 
> >> -Werror=lto-type-mismatch -Werror=odr
> >> +STAGEfeedback_CFLAGS += -flto=jobserver -frandom-seed=1 
> >> -Werror=lto-type-mismatch -Werror=odr
> >>
> >>  # assumes the host supports the linker plugin
> >>  LTO_AR = $$r/$(HOST_SUBDIR)/prev-gcc/gcc-ar$(exeext) 
> >> -B$$r/$(HOST_SUBDIR)/prev-gcc/
> >>
> >> base-commit: 9df0772d50d8f8a75389d

Re: [Patch] AMD GCN: Set HSA_XNACK for USM and 'xnack+' / 'xnack-'

2024-10-29 Thread Andrew Stubbs

On 29/10/2024 12:10, Tobias Burnus wrote:

Hi Andrew,

Am 29.10.24 um 13:07 schrieb Andrew Stubbs:

On 29/10/2024 11:44, Tobias Burnus wrote:

This somewhat matches what is done in OG13 and in Andrew's patch at
https://gcc.gnu.org/pipermail/gcc-patches/2024-June/655951.html
albeit the code is somewhat different.
[For some reasons, this code is not in OG14 ?!?]

...
This conflicts with my patch that already does (some of) this that is 
submitted as part of the USM series that is still awaiting review.


https://patchwork.sourceware.org/project/gcc/ 
patch/20240628102449.562467-6-...@baylibre.com/


Well, you you go to the link above, it shows the same patch …


So, can we get the USM stuff reviewed and cleared before we deliberately 
introduce conflicts?


BTW, it's not on OG14 because it was delayed while I was reworking it to 
repost. I should have pushed the series there at that time, but I 
forgot. :-(  I should fix that.


Andrew


Re: [PATCH] Match: Optimize log (x) CMP CST and exp (x) CMP CST operations

2024-10-29 Thread Richard Biener
On Mon, 28 Oct 2024, Soumya AR wrote:

> This patch implements transformations for the following optimizations.
> 
> logN(x) CMP CST -> x CMP expN(CST)
> expN(x) CMP CST -> x CMP logN(CST)
> 
> For example:
> 
> int
> foo (float x)
> {
>   return __builtin_logf (x) < 0.0f;
> }
> 
> can just be:
> 
> int
> foo (float x)
> {
>   return x < 1.0f;
> } 
> 
> The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression.
> OK for mainline?

+   (for cmp (lt le gt ge eq ne)
+(for logs (LOG LOG2 LOG10)
+exps (EXP EXP2 EXP10)
+/* Simplify logN (x) CMP CST into x CMP expN (CST) */
+(simplify
+(cmp:c (logs:s @0) @1)
+ (cmp @0 (exps @1)))
+
+/* Simplify expN (x) CMP CST into x CMP logN (CST) */
+(simplify
+(cmp:c (exps:s @0) @1)
+ (cmp @0 (logs @1))

this doesn't restrict @1 to be constant.  You should use

 (cmp:c (exps:s @0) REAL_CST@1)

I think this transform is also very susceptible to rounding
issues - esp. using it for eq and ne looks very dangerous
to me.  Unless you check a roundtrip through exp/log gets
you back exactly the original constant.

I think the compare kinds "most safe" would be le and ge.

You can look at fold-const-call.cc:do_mpfr_arg1, mpfr gets
you the information on whether the result is exact for example.

Richard.


> Signed-off-by: Soumya AR 
> 
> gcc/ChangeLog:
> 
>   * match.pd: Fold logN(x) CMP CST -> x CMP expN(CST)
>   and expN(x) CMP CST -> x CMP logN(CST)
> 
> gcc/testsuite/ChangeLog:
> 
>   * gcc.dg/tree-ssa/log_exp.c: New test.
> 
> 

-- 
Richard Biener 
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)


[to-be-committed][RISC-V] Aggressively hoist VXRM assignments

2024-10-29 Thread Jeff Law
So a while back I was looking at pixel_avg for RISC-V where we try to 
use vaaddu for the halfword-ceiling-average step.  The problem with 
vaaddu is that you must set VXRM to a suitable rounding mode as it has 
an undetermined state at function entry or after a function call.


It turns out some designs will fully flush their pipelines on a write to 
VXRM which you can imagine is incredibly expensive.


VXRM assignments are handled by an LCM based algorithm to find "optimal" 
placement points based on what insns in the stream need VXRM assignments 
and the particular mode they need.


Unfortunately in pixel_avg an LCM algorithm only allows hoisting out of 
the innermost loop, but not the outer loop.  The core issue is that LCM 
does not allow any speculation and there are paths which would bypass 
the inner loop (which don't actually trigger at runtime IIRC).


The expectation is that VXRM assignments should be exceedingly rare and 
needing more than one mode even rarer.  So hoisting more aggressively 
seems like a reasonable thing to do, but we don't want to burn too much 
time trying to do something fancy.


So what this patch does is scan the IL once collecting any VXRM needs. 
If the current function has precisely one VXRM mode needed, then we 
pretend (for the sake of LCM) that the first instruction in the function 
also has that need.


By doing so the VXRM assignment is essentially anticipated everywhere in
the function.  The standard LCM algorithm is run and has enough 
information to hoist the VXRM assignment more aggressively, most often 
to the prologue.


This helps the BPI in a measurable way (IIRC it was 2-3%).  It probably 
helps some of the SiFive designs, but I've been told they still benefit 
from the longer sequence of shifts & adds, hoisting just isn't enough 
for those designs.  The Ventana design basically doesn't care where the 
VXRM assignment is.  Point is we may want to have a tuning knob for the 
patterns which need VXRM (vaadd[u], vasub[u]) at some point in the near 
future.


Bootstrapped and regression tested on riscv64 and regression tested on 
riscv32-elf and riscv64-elf.  We've been using this internally for a 
while a while on spec as well.   Obviously I'll wait for the pre-commit 
tester to do its thing.


Jeff* riscv.cc (singleton_vxrm_need): New function.
(riscv_mode_needed): See if there is a singleton need and if so,
claim it happens on the first insn in the chain.

diff --git a/gcc/config/riscv/riscv.cc b/gcc/config/riscv/riscv.cc
index 0cf7ee4904d..c9b48d6fac0 100644
--- a/gcc/config/riscv/riscv.cc
+++ b/gcc/config/riscv/riscv.cc
@@ -11772,6 +11772,65 @@ riscv_frm_mode_needed (rtx_insn *cur_insn, int code)
   return mode;
 }
 
+/* If the current function needs a single VXRM mode, return it.  Else
+   return VXRM_MODE_NONE.
+
+   This is called on the first insn in the chain and scans the full function
+   once to collect VXRM mode settings.  If a single mode is needed, it will
+   often be better to set it once at the start of the function rather than
+   at an anticipation point.  */
+static int
+singleton_vxrm_need (void)
+{
+  /* Only needed for vector code.  */
+  if (!TARGET_VECTOR)
+return VXRM_MODE_NONE;
+
+  /* If ENTRY has more than once successor, then don't optimize, just to
+ keep things simple.  */
+  if (EDGE_COUNT (ENTRY_BLOCK_PTR_FOR_FN (cfun)->succs) > 1)
+return VXRM_MODE_NONE;
+
+  /* Walk the IL noting if VXRM is needed and if there's more than one
+ mode needed.  */
+  bool found = false;
+  int saved_vxrm_mode;
+  for (rtx_insn *insn = get_insns (); insn; insn = NEXT_INSN (insn))
+{
+  if (!INSN_P (insn) || DEBUG_INSN_P (insn))
+   continue;
+
+  int code = recog_memoized (insn);
+  if (code < 0)
+   continue;
+
+  int vxrm_mode = get_attr_vxrm_mode (insn);
+  if (vxrm_mode == VXRM_MODE_NONE)
+   continue;
+
+  /* If this is the first VXRM need, note it.  */
+  if (!found)
+   {
+ saved_vxrm_mode = vxrm_mode;
+ found = true;
+ continue;
+   }
+
+  /* Not the first VXRM need.  If this is different than
+the saved need, then we're not going to be able to
+optimize and we can stop scanning now.  */
+  if (saved_vxrm_mode != vxrm_mode)
+   return VXRM_MODE_NONE;
+
+  /* Same mode as we've seen, keep scanning.  */
+}
+
+  /* If we got here we scanned the whole function.  If we found
+ some VXRM state, then we can optimize.  If we didn't find
+ VXRM state, then there's nothing to optimize.  */
+  return found ? saved_vxrm_mode : VXRM_MODE_NONE;
+}
+
 /* Return mode that entity must be switched into
prior to the execution of insn.  */
 
@@ -11783,6 +11842,16 @@ riscv_mode_needed (int entity, rtx_insn *insn, 
HARD_REG_SET)
   switch (entity)
 {
 case RISCV_VXRM:
+  /* If CUR_INSN is the first insn in the function, then determine if we
+want to signal a need in ENTRY->suc

Re: [PATCH 16/22] aarch64: libgcc: add GCS marking to asm

2024-10-29 Thread Yury Khrustalev
On Thu, Oct 24, 2024 at 05:31:58PM +0100, Richard Sandiford wrote:
> Yury Khrustalev  writes:
> > From: Szabolcs Nagy 
> >
> > libgcc/ChangeLog:
> >
> > * config/aarch64/aarch64-asm.h (FEATURE_1_GCS): Define.
> > (GCS_FLAG): Define if GCS is enabled.
> > (GNU_PROPERTY): Add GCS_FLAG.
> 
> This might be a daft question, but don't we also want to use the
> new build attributes, where supported?  Or is that handled separately?
> 
> Same question for the other libraries.

The new build attributes will be handled separately as BA support can
be seen as orthogonal to GCS.

Thanks,
Yury



Re: Ping: [PATCH] Always set SECTION_RELRO for or .data.rel.ro{,.local} [PR116887]

2024-10-29 Thread Jeff Law




On 10/29/24 7:10 AM, Xi Ruoyao wrote:

On Fri, 2024-10-11 at 02:54 +0800, Xi Ruoyao wrote:

At least two ports (hppa and loongarch) need to set SECTION_RELRO for
.data.rel.ro{,.local} in section_type_flags (PR52999 and PR116887), and
I cannot see a reason not to just set it in the generic code.

With this applied we can also remove the hppa-specific
pa_section_type_flags in a future patch.

gcc/ChangeLog:

PR target/116887
* varasm.cc (default_section_type_flags): Always set
SECTION_RELRO if name is .data.rel.ro{,.local}.

gcc/testsuite/ChangeLog:

PR target/116887
* gcc.dg/pr116887.c: New test.


Ping.

Sorry, I missed this the first time around.  Thanks for pinging.

OK for the trunk.  Though we do need to keep an eye out for regressions. 
 We've had some surprises in the past with this kind of change.


jeff



[PATCH] c-family: Handle RAW_DATA_CST in complete_array_type [PR117313]

2024-10-29 Thread Jakub Jelinek
Hi!

The following testcase ICEs, because
add_flexible_array_elts_to_size -> complete_array_type
is done only after braced_lists_to_strings which optimizes
RAW_DATA_CST surrounded by INTEGER_CST into a larger RAW_DATA_CST
covering even the boundaries, while I thought it is done before
that.
So, RAW_DATA_CST now can be the last constructor_elt in a CONSTRUCTOR
and so we need the function to take it into account (handle it as
RAW_DATA_CST standing for RAW_DATA_LENGTH consecutive elements).

The function wants to support both CONSTRUCTORs without indexes and with
them (for non-RAW_DATA_CST elts it was just adding 1 for the current
index).  So, if the RAW_DATA_CST elt has ce->index, we need to add
RAW_DATA_LENGTH (ce->value) - 1, while if it doesn't (and it isn't cnt == 0
case where curindex is 0), add that plus 1, i.e. RAW_DATA_LENGTH (ce->value).

Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?

2024-10-29  Jakub Jelinek  

PR c/117313
gcc/c-family/
* c-common.cc (complete_array_type): For RAW_DATA_CST elements
advance curindex by RAW_DATA_LENGTH or one less than that if
ce->index is non-NULL.  Handle even the first element if
it is RAW_DATA_CST.  Formatting fix.
gcc/testsuite/
* c-c++-common/init-6.c: New test.

--- gcc/c-family/c-common.cc.jj 2024-10-27 16:39:55.090871381 +0100
+++ gcc/c-family/c-common.cc2024-10-28 12:30:01.215814079 +0100
@@ -7044,7 +7044,8 @@ complete_array_type (tree *ptype, tree i
{
  int eltsize
= int_size_in_bytes (TREE_TYPE (TREE_TYPE (initial_value)));
- maxindex = size_int (TREE_STRING_LENGTH (initial_value)/eltsize - 1);
+ maxindex = size_int (TREE_STRING_LENGTH (initial_value) / eltsize
+  - 1);
}
   else if (TREE_CODE (initial_value) == CONSTRUCTOR)
{
@@ -7059,23 +7060,25 @@ complete_array_type (tree *ptype, tree i
  else
{
  tree curindex;
- unsigned HOST_WIDE_INT cnt;
+ unsigned HOST_WIDE_INT cnt = 1;
  constructor_elt *ce;
  bool fold_p = false;
 
  if ((*v)[0].index)
maxindex = (*v)[0].index, fold_p = true;
+ if (TREE_CODE ((*v)[0].value) == RAW_DATA_CST)
+   cnt = 0;
 
  curindex = maxindex;
 
- for (cnt = 1; vec_safe_iterate (v, cnt, &ce); cnt++)
+ for (; vec_safe_iterate (v, cnt, &ce); cnt++)
{
  bool curfold_p = false;
  if (ce->index)
curindex = ce->index, curfold_p = true;
- else
+ if (!ce->index || TREE_CODE (ce->value) == RAW_DATA_CST)
{
- if (fold_p)
+ if (fold_p || curfold_p)
{
  /* Since we treat size types now as ordinary
 unsigned types, we need an explicit overflow
@@ -7083,9 +7086,17 @@ complete_array_type (tree *ptype, tree i
  tree orig = curindex;
  curindex = fold_convert (sizetype, curindex);
  overflow_p |= tree_int_cst_lt (curindex, orig);
+ curfold_p = false;
}
- curindex = size_binop (PLUS_EXPR, curindex,
-size_one_node);
+ if (TREE_CODE (ce->value) == RAW_DATA_CST)
+   curindex
+ = size_binop (PLUS_EXPR, curindex,
+   size_int (RAW_DATA_LENGTH (ce->value)
+ - ((ce->index || !cnt)
+? 1 : 0)));
+ else
+   curindex = size_binop (PLUS_EXPR, curindex,
+  size_one_node);
}
  if (tree_int_cst_lt (maxindex, curindex))
maxindex = curindex, fold_p = curfold_p;
--- gcc/testsuite/c-c++-common/init-6.c.jj  2024-10-28 12:35:59.526803017 
+0100
+++ gcc/testsuite/c-c++-common/init-6.c 2024-10-28 12:35:50.394930729 +0100
@@ -0,0 +1,29 @@
+/* PR c/117313 */
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+
+struct S { unsigned a; const unsigned char b[]; };
+struct S s = {
+  1,
+  { 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x7e, 0x81, 0xa5,
+0x81, 0xbd, 0x99, 0x81, 0x7e, 0x7e, 0xff, 0xdb, 0xff, 0xc3, 0xe7,
+0xff, 0x7e, 0x6c, 0xfe, 0xfe, 0xfe, 0x7c, 0x38, 0x10, 0x00, 0x10,
+0x38, 0x00, 0x00, 0x38, 0x6c, 0x6c, 0x38, 0x00, 0x00, 0x00, 0x00,
+0x00, 0x00, 0x00, 0x18, 0x18, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+0x18, 0x00, 0x00, 0x00, 0x00, 0x0c, 0x0c, 0x0c, 0xec, 0x6c, 0x3c,
+  }
+};
+struct S t = {
+  2,
+  { 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x7e, 

[PATCH 1/5] Match: Simplify branch form 4 of unsigned SAT_ADD into branchless

2024-10-29 Thread pan2 . li
From: Pan Li 

There are sorts of forms for the unsigned SAT_ADD.  Some of them are
complicated while others are cheap.  This patch would like to simplify
the complicated form into the cheap ones.  For example as below:

>From the form 4 (branch):
  SAT_U_ADD = (X + Y) < x ? -1 : (X + Y).

To (branchless):
  SAT_U_ADD = (X + Y) | - ((X + Y) < X).

  #define T uint8_t

  T sat_add_u_1 (T x, T y)
  {
return (T)(x + y) < x ? -1 : (x + y);
  }

Before this patch:
   1   │ uint8_t sat_add_u_1 (uint8_t x, uint8_t y)
   2   │ {
   3   │   uint8_t D.2809;
   4   │
   5   │   _1 = x + y;
   6   │   if (x <= _1) goto ; else goto ;
   7   │   :
   8   │   D.2809 = x + y;
   9   │   goto ;
  10   │   :
  11   │   D.2809 = 255;
  12   │   :
  13   │   return D.2809;
  14   │ }

After this patch:
   1   │ uint8_t sat_add_u_1 (uint8_t x, uint8_t y)
   2   │ {
   3   │   uint8_t D.2809;
   4   │
   5   │   _1 = x + y;
   6   │   _2 = x + y;
   7   │   _3 = x > _2;
   8   │   _4 = (unsigned char) _3;
   9   │   _5 = -_4;
  10   │   D.2809 = _1 | _5;
  11   │   return D.2809;
  12   │ }

The below test suites are passed for this patch.
* The rv64gcv fully regression test.
* The x86 bootstrap test.
* The x86 fully regression test.

gcc/ChangeLog:

* match.pd: Remove unsigned branch form 4 for SAT_ADD, and
add simplify to branchless instead.

gcc/testsuite/ChangeLog:

* gcc.dg/tree-ssa/sat_u_add-simplify-2-u16.c: New test.
* gcc.dg/tree-ssa/sat_u_add-simplify-2-u32.c: New test.
* gcc.dg/tree-ssa/sat_u_add-simplify-2-u64.c: New test.
* gcc.dg/tree-ssa/sat_u_add-simplify-2-u8.c: New test.

Signed-off-by: Pan Li 
---
 gcc/match.pd  | 11 +++
 .../gcc.dg/tree-ssa/sat_u_add-simplify-2-u16.c| 15 +++
 .../gcc.dg/tree-ssa/sat_u_add-simplify-2-u32.c| 15 +++
 .../gcc.dg/tree-ssa/sat_u_add-simplify-2-u64.c| 15 +++
 .../gcc.dg/tree-ssa/sat_u_add-simplify-2-u8.c | 15 +++
 5 files changed, 67 insertions(+), 4 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/sat_u_add-simplify-2-u16.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/sat_u_add-simplify-2-u32.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/sat_u_add-simplify-2-u64.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/sat_u_add-simplify-2-u8.c

diff --git a/gcc/match.pd b/gcc/match.pd
index 809c717bc86..4d1143b6ec3 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -3154,10 +3154,13 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
   && types_match (type, @0, @1))
   (bit_ior @2 (negate (convert (lt @2 @0))
 
-/* Unsigned saturation add, case 4 (branch with lt):
-   SAT_U_ADD = (X + Y) < x ? -1 : (X + Y).  */
-(match (unsigned_integer_sat_add @0 @1)
- (cond^ (lt (usadd_left_part_1@2 @0 @1) @0) integer_minus_onep @2))
+/* Simplify SAT_U_ADD to the cheap form
+   From: SAT_U_ADD = (X + Y) < x ? -1 : (X + Y).
+   To:   SAT_U_ADD = (X + Y) | - ((X + Y) < X).  */
+(simplify (cond (lt (plus:c@2 @0 @1) @0) integer_minus_onep @2)
+ (if (INTEGRAL_TYPE_P (type) && TYPE_UNSIGNED (type)
+  && types_match (type, @0, @1))
+  (bit_ior @2 (negate (convert (lt @2 @0))
 
 /* Unsigned saturation add, case 5 (branch with eq .ADD_OVERFLOW):
SAT_U_ADD = REALPART_EXPR <.ADD_OVERFLOW> == 0 ? .ADD_OVERFLOW : -1.  */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/sat_u_add-simplify-2-u16.c 
b/gcc/testsuite/gcc.dg/tree-ssa/sat_u_add-simplify-2-u16.c
new file mode 100644
index 000..6e327f58d46
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/sat_u_add-simplify-2-u16.c
@@ -0,0 +1,15 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-gimple-details" } */
+
+#include 
+
+#define T uint16_t
+
+T sat_add_u_1 (T x, T y)
+{
+  return (T)(x + y) < x ? -1 : (x + y);
+}
+
+/* { dg-final { scan-tree-dump-not " if " "gimple" } } */
+/* { dg-final { scan-tree-dump-not " else " "gimple" } } */
+/* { dg-final { scan-tree-dump-not " goto " "gimple" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/sat_u_add-simplify-2-u32.c 
b/gcc/testsuite/gcc.dg/tree-ssa/sat_u_add-simplify-2-u32.c
new file mode 100644
index 000..17b09396b69
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/sat_u_add-simplify-2-u32.c
@@ -0,0 +1,15 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-gimple-details" } */
+
+#include 
+
+#define T uint32_t
+
+T sat_add_u_1 (T x, T y)
+{
+  return (T)(x + y) < x ? -1 : (x + y);
+}
+
+/* { dg-final { scan-tree-dump-not " if " "gimple" } } */
+/* { dg-final { scan-tree-dump-not " else " "gimple" } } */
+/* { dg-final { scan-tree-dump-not " goto " "gimple" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/sat_u_add-simplify-2-u64.c 
b/gcc/testsuite/gcc.dg/tree-ssa/sat_u_add-simplify-2-u64.c
new file mode 100644
index 000..8db9b427a65
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/sat_u_add-simplify-2-u64.c
@@ -0,0 +1,15 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-gim

[PATCH 4/5] Match: Remove usadd_left_part_1 as it has only one reference [NFC]

2024-10-29 Thread pan2 . li
From: Pan Li 

In previous, we extract matching usadd_left_part_1 to avoid duplication.
After we simplify some usadd patterns into cheap form, there will be
only one reference to this matching.  Thus, remove this matching pattern
and unfold it to the reference place.

The below test suites are passed for this patch:
1. The rv64gcv fully regression tests.
2. The x86 bootstrap tests.
3. The x86 fully regression tests.

gcc/ChangeLog:

* match.pd: Remove matching usadd_left_part_1 and unfold it at
its reference place

Signed-off-by: Pan Li 
---
 gcc/match.pd | 14 +++---
 1 file changed, 3 insertions(+), 11 deletions(-)

diff --git a/gcc/match.pd b/gcc/match.pd
index 7105aedb40c..a804d9c58fc 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -3086,14 +3086,6 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
|| POINTER_TYPE_P (itype))
   && wi::eq_p (wi::to_wide (int_cst), wi::max_value (itype))
 
-/* Unsigned Saturation Add */
-/* SAT_ADD = usadd_left_part_1 | usadd_right_part_1, aka:
-   SAT_ADD = (X + Y) | -((X + Y) < X)  */
-(match (usadd_left_part_1 @0 @1)
- (plus:c @0 @1)
- (if (INTEGRAL_TYPE_P (type) && TYPE_UNSIGNED (type)
-  && types_match (type, @0, @1
-
 /* SAT_ADD = usadd_left_part_2 | usadd_right_part_2, aka:
SAT_ADD = REALPART_EXPR <.ADD_OVERFLOW> | (IMAGPART_EXPR <.ADD_OVERFLOW> != 
0) */
 (match (usadd_left_part_2 @0 @1)
@@ -3101,7 +3093,7 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
  (if (INTEGRAL_TYPE_P (type) && TYPE_UNSIGNED (type)
   && types_match (type, @0, @1
 
-/* SAT_ADD = usadd_left_part_1 | usadd_right_part_1, aka:
+/* SAT_ADD = (X + Y) | usadd_right_part_1, aka:
SAT_ADD = (X + Y) | -((type)(X + Y) < X)  */
 (match (usadd_right_part_1 @0 @1)
  (negate (convert (lt (plus:c @0 @1) @0)))
@@ -3129,7 +3121,7 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
  (if (INTEGRAL_TYPE_P (type) && TYPE_UNSIGNED (type)
   && types_match (type, @0, @1
 
-/* We cannot merge or overload usadd_left_part_1 and usadd_left_part_2
+/* We cannot merge or overload (X + Y) and usadd_left_part_2
because the sub part of left_part_2 cannot work with right_part_1.
For example, left_part_2 pattern focus one .ADD_OVERFLOW but the
right_part_1 has nothing to do with .ADD_OVERFLOW.  */
@@ -3138,7 +3130,7 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
SAT_U_ADD = (X + Y) | - ((X + Y) < X) or
SAT_U_ADD = (X + Y) | - (X > (X + Y)).  */
 (match (unsigned_integer_sat_add @0 @1)
- (bit_ior:c (usadd_left_part_1 @0 @1) (usadd_right_part_1 @0 @1)))
+ (bit_ior:c (plus:c @0 @1) (usadd_right_part_1 @0 @1)))
 
 /* Unsigned saturation add, case 2 (branchless with .ADD_OVERFLOW):
SAT_ADD = REALPART_EXPR <.ADD_OVERFLOW> | -IMAGPART_EXPR <.ADD_OVERFLOW> or
-- 
2.43.0



[PATCH 5/5] Match: Update the comments of unsigned integer SAT_ADD [NFC]

2024-10-29 Thread pan2 . li
From: Pan Li 

Sorts of comments of unsigned integer SAT_ADD matching is not updated
to date.  This patch would like to refine it.

The below test suites are passed for this patch:
1. The rv64gcv fully regression tests.
2. The x86 bootstrap tests.
3. The x86 fully regression tests.

gcc/ChangeLog:

* match.pd: Update the comments of unsigned integer SAT_ADD.

Signed-off-by: Pan Li 
---
 gcc/match.pd | 37 ++---
 1 file changed, 22 insertions(+), 15 deletions(-)

diff --git a/gcc/match.pd b/gcc/match.pd
index a804d9c58fc..8dd7f9af62e 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -3086,36 +3086,39 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
|| POINTER_TYPE_P (itype))
   && wi::eq_p (wi::to_wide (int_cst), wi::max_value (itype))
 
-/* SAT_ADD = usadd_left_part_2 | usadd_right_part_2, aka:
-   SAT_ADD = REALPART_EXPR <.ADD_OVERFLOW> | (IMAGPART_EXPR <.ADD_OVERFLOW> != 
0) */
+/* SAT_U_ADD = usadd_left_part_2 | usadd_right_part_2, aka:
+   SUM = ADD_OVERFLOW (X, Y)
+   SAT_U_ADD = REALPART_EXPR (SUM) | (IMAGPART_EXPR (SUM) != 0) */
 (match (usadd_left_part_2 @0 @1)
  (realpart (IFN_ADD_OVERFLOW:c @0 @1))
  (if (INTEGRAL_TYPE_P (type) && TYPE_UNSIGNED (type)
   && types_match (type, @0, @1
 
-/* SAT_ADD = (X + Y) | usadd_right_part_1, aka:
-   SAT_ADD = (X + Y) | -((type)(X + Y) < X)  */
+/* SAT_U_ADD = (X + Y) | usadd_right_part_1, aka:
+   SAT_U_ADD = (X + Y) | -((type)(X + Y) < X)  */
 (match (usadd_right_part_1 @0 @1)
  (negate (convert (lt (plus:c @0 @1) @0)))
  (if (INTEGRAL_TYPE_P (type) && TYPE_UNSIGNED (type)
   && types_match (type, @0, @1
 
-/* SAT_ADD = usadd_left_part_1 | usadd_right_part_1, aka:
-   SAT_ADD = (X + Y) | -(X > (X + Y))  */
+/* SAT_U_ADD = usadd_left_part_1 | usadd_right_part_1, aka:
+   SAT_U_ADD = (X + Y) | -(X > (X + Y))  */
 (match (usadd_right_part_1 @0 @1)
  (negate (convert (gt @0 (plus:c @0 @1
  (if (INTEGRAL_TYPE_P (type) && TYPE_UNSIGNED (type)
   && types_match (type, @0, @1
 
-/* SAT_ADD = usadd_left_part_2 | usadd_right_part_2, aka:
-   SAT_ADD = REALPART_EXPR <.ADD_OVERFLOW> | (IMAGPART_EXPR <.ADD_OVERFLOW> != 
0) */
+/* SAT_U_ADD = usadd_left_part_2 | usadd_right_part_2, aka:
+   SUM = ADD_OVERFLOW (X, Y)
+   SAT_U_ADD = REALPART_EXPR (SUM) | (IMAGPART_EXPR (SUM) != 0) */
 (match (usadd_right_part_2 @0 @1)
  (negate (convert (ne (imagpart (IFN_ADD_OVERFLOW:c @0 @1)) integer_zerop)))
  (if (INTEGRAL_TYPE_P (type) && TYPE_UNSIGNED (type)
   && types_match (type, @0, @1
 
-/* SAT_ADD = usadd_left_part_2 | usadd_right_part_2, aka:
-   SAT_ADD = REALPART_EXPR <.ADD_OVERFLOW> | -IMAGPART_EXPR <.ADD_OVERFLOW> */
+/* SAT_U_ADD = usadd_left_part_2 | usadd_right_part_2, aka:
+   SUM = ADD_OVERFLOW (X, Y)
+   SAT_U_ADD = REALPART_EXPR (SUM) | -IMAGPART_EXPR (SUM) */
 (match (usadd_right_part_2 @0 @1)
  (negate (imagpart (IFN_ADD_OVERFLOW:c @0 @1)))
  (if (INTEGRAL_TYPE_P (type) && TYPE_UNSIGNED (type)
@@ -3133,8 +3136,9 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
  (bit_ior:c (plus:c @0 @1) (usadd_right_part_1 @0 @1)))
 
 /* Unsigned saturation add, case 2 (branchless with .ADD_OVERFLOW):
-   SAT_ADD = REALPART_EXPR <.ADD_OVERFLOW> | -IMAGPART_EXPR <.ADD_OVERFLOW> or
-   SAT_ADD = REALPART_EXPR <.ADD_OVERFLOW> | (IMAGPART_EXPR <.ADD_OVERFLOW> != 
0) */
+   SUM = ADD_OVERFLOW (X, Y)
+   SAT_U_ADD = REALPART_EXPR (SUM) | -IMAGPART_EXPR (SUM) or
+   SAT_U_ADD = REALPART_EXPR (SUM) | (IMAGPART_EXPR (SUM) != 0) */
 (match (unsigned_integer_sat_add @0 @1)
  (bit_ior:c (usadd_left_part_2 @0 @1) (usadd_right_part_2 @0 @1)))
 
@@ -3171,13 +3175,15 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
   (bit_ior @2 (negate (convert (lt @2 @0))
 
 /* Unsigned saturation add, case 5 (branch with eq .ADD_OVERFLOW):
-   SAT_U_ADD = REALPART_EXPR <.ADD_OVERFLOW> == 0 ? .ADD_OVERFLOW : -1.  */
+   SUM = ADD_OVERFLOW (X, Y)
+   SAT_U_ADD = IMAGPART_EXPR (SUM) == 0 ? REALPART_EXPR (SUM) : -1.  */
 (match (unsigned_integer_sat_add @0 @1)
  (cond^ (eq (imagpart (IFN_ADD_OVERFLOW:c @0 @1)) integer_zerop)
   (usadd_left_part_2 @0 @1) integer_minus_onep))
 
 /* Unsigned saturation add, case 6 (branch with ne .ADD_OVERFLOW):
-   SAT_U_ADD = REALPART_EXPR <.ADD_OVERFLOW> != 0 ? -1 : .ADD_OVERFLOW.  */
+   SUM = ADD_OVERFLOW (X, Y)
+   SAT_U_ADD = IMAGPART_EXPR (SUM) != 0 ? -1 : REALPART_EXPR (SUM).  */
 (match (unsigned_integer_sat_add @0 @1)
  (cond^ (ne (imagpart (IFN_ADD_OVERFLOW:c @0 @1)) integer_zerop)
   integer_minus_onep (usadd_left_part_2 @0 @1)))
@@ -3199,7 +3205,8 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
(if (wi::eq_p (max, sum))
 
 /* Unsigned saturation add, case 10 (one op is imm):
-   SAT_U_ADD = __builtin_add_overflow (X, 3, &ret) == 0 ? ret : -1.  */
+   SUM = ADD_OVERFLOW (X, IMM)
+   SAT_U_ADD = IMAGPART_EXPR (SUM) == 0 ? REALPART_EXPR (SUM) : -1.  */
 (match (unsigned_integer_sat_add @0 @1)
  (cond^ (ne (imagpart (IFN_ADD_OVERFLOW@2 @0 INTEGER_CST@1)) integer_zerop)
   integer_minus_onep (realpart @2))
-

Re: [PATCH 1/5] Internal-fn: Introduce new IFN MASK_LEN_STRIDED_LOAD{STORE}

2024-10-29 Thread Richard Biener
On Wed, Oct 23, 2024 at 12:47 PM  wrote:
>
> From: Pan Li 
>
> This patch would like to introduce new IFN for strided load and store.
>
> LOAD:  v = MASK_LEN_STRIDED_LOAD (ptr, stride, mask, len, bias)
> STORE: MASK_LEN_STRIED_STORE (ptr, stride, v, mask, len, bias)
>
> The IFN target below code example similar as below
>
> void foo (int * a, int * b, int stride, int n)
> {
>   for (int i = 0; i < n; i++)
> a[i * stride] = b[i * stride];
> }
>
> The below test suites are passed for this patch.
> * The rv64gcv fully regression test.
> * The x86 bootstrap test.
> * The x86 fully regression test.
>
> gcc/ChangeLog:
>
> * internal-fn.cc (strided_load_direct): Add new define direct
> for strided load.
> (strided_store_direct): Ditto but for store.
> (expand_strided_load_optab_fn): Add new func to expand the IFN
> MASK_LEN_STRIDED_LOAD in middle-end.
> (expand_strided_store_optab_fn): Ditto but for store.
> (direct_strided_load_optab_supported_p): Add define for stride
> load optab supported.
> (direct_strided_store_optab_supported_p): Ditto but for store.
> (internal_fn_len_index): Add strided load/store len index.
> (internal_fn_mask_index): Ditto but for mask.
> (internal_fn_stored_value_index): Add strided store value index.
> * internal-fn.def (MASK_LEN_STRIDED_LOAD): Add new IFN for
> strided load.
> (MASK_LEN_STRIDED_STORE): Ditto but for store.
> * optabs.def (OPTAB_D): Add strided load/store optab.

Please mention the full optab names.

There is documentation missing for doc/md.texi for the new optabs.

Otherwise looks OK.  I'll note that non-masked or non-len-only-masked
variants are missing but this is OK I guess.

Richard.

>
> Signed-off-by: Pan Li 
> Co-Authored-By: Juzhe-Zhong 
> ---
>  gcc/internal-fn.cc  | 71 +
>  gcc/internal-fn.def |  6 
>  gcc/optabs.def  |  2 ++
>  3 files changed, 79 insertions(+)
>
> diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
> index d89a04fe412..bfbbba8e2dd 100644
> --- a/gcc/internal-fn.cc
> +++ b/gcc/internal-fn.cc
> @@ -159,6 +159,7 @@ init_internal_fns ()
>  #define load_lanes_direct { -1, -1, false }
>  #define mask_load_lanes_direct { -1, -1, false }
>  #define gather_load_direct { 3, 1, false }
> +#define strided_load_direct { -1, -1, false }
>  #define len_load_direct { -1, -1, false }
>  #define mask_len_load_direct { -1, 4, false }
>  #define mask_store_direct { 3, 2, false }
> @@ -168,6 +169,7 @@ init_internal_fns ()
>  #define vec_cond_mask_len_direct { 1, 1, false }
>  #define vec_cond_direct { 2, 0, false }
>  #define scatter_store_direct { 3, 1, false }
> +#define strided_store_direct { 1, 1, false }
>  #define len_store_direct { 3, 3, false }
>  #define mask_len_store_direct { 4, 5, false }
>  #define vec_set_direct { 3, 3, false }
> @@ -3712,6 +3714,64 @@ expand_gather_load_optab_fn (internal_fn, gcall *stmt, 
> direct_optab optab)
>assign_call_lhs (lhs, lhs_rtx, &ops[0]);
>  }
>
> +/* Expand MASK_LEN_STRIDED_LOAD call CALL by optab OPTAB.  */
> +
> +static void
> +expand_strided_load_optab_fn (ATTRIBUTE_UNUSED internal_fn, gcall *stmt,
> + direct_optab optab)
> +{
> +  tree lhs = gimple_call_lhs (stmt);
> +  tree base = gimple_call_arg (stmt, 0);
> +  tree stride = gimple_call_arg (stmt, 1);
> +
> +  rtx lhs_rtx = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
> +  rtx base_rtx = expand_normal (base);
> +  rtx stride_rtx = expand_normal (stride);
> +
> +  unsigned i = 0;
> +  class expand_operand ops[6];
> +  machine_mode mode = TYPE_MODE (TREE_TYPE (lhs));
> +
> +  create_output_operand (&ops[i++], lhs_rtx, mode);
> +  create_address_operand (&ops[i++], base_rtx);
> +  create_address_operand (&ops[i++], stride_rtx);
> +
> +  i = add_mask_and_len_args (ops, i, stmt);
> +  expand_insn (direct_optab_handler (optab, mode), i, ops);
> +
> +  if (!rtx_equal_p (lhs_rtx, ops[0].value))
> +emit_move_insn (lhs_rtx, ops[0].value);
> +}
> +
> +/* Expand MASK_LEN_STRIDED_STORE call CALL by optab OPTAB.  */
> +
> +static void
> +expand_strided_store_optab_fn (ATTRIBUTE_UNUSED internal_fn, gcall *stmt,
> +  direct_optab optab)
> +{
> +  internal_fn fn = gimple_call_internal_fn (stmt);
> +  int rhs_index = internal_fn_stored_value_index (fn);
> +
> +  tree base = gimple_call_arg (stmt, 0);
> +  tree stride = gimple_call_arg (stmt, 1);
> +  tree rhs = gimple_call_arg (stmt, rhs_index);
> +
> +  rtx base_rtx = expand_normal (base);
> +  rtx stride_rtx = expand_normal (stride);
> +  rtx rhs_rtx = expand_normal (rhs);
> +
> +  unsigned i = 0;
> +  class expand_operand ops[6];
> +  machine_mode mode = TYPE_MODE (TREE_TYPE (rhs));
> +
> +  create_address_operand (&ops[i++], base_rtx);
> +  create_address_operand (&ops[i++], stride_rtx);
> +  create_input_operand (&ops[i++], rhs_rtx, mode);
> +
> +  i = add_mask_and_

Re: [PATCH] Add 'cobol' to Makefile.def, take 2

2024-10-29 Thread Richard Biener
On Sat, Oct 26, 2024 at 10:37 PM James K. Lowden
 wrote:
>
> On Sat, 26 Oct 2024 11:22:20 +0800
> Xi Ruoyao  wrote:
>
> > The changelog is not formatted correctly.  gcc/ has its own
> > changelog. And gcc/cobol should have its own changelog too, like all
> > other frontends.
>
> Thank you for pointing that out.  I now have
>
> [snip]
> Subject: [PATCH]  Add 'cobol' to 10 files
>
> ChangeLog
> * Makefile.def: Add libgcobol module and cobol language.
> * configure: Regenerated
> * configure.ac: Add libgcobol module and cobol language.
>
> gcc/ChangeLog
> * gcc/common.opt: Add libgcobol module and cobol language.

gcc/ should be stripped from * gcc/common.opt, so just * common.opt

>
> gcc/cobol/ChangeLog
> * gcc/cobol/ChangeLog: Add gcc/cobol/ChangeLog
> * gcc/cobol/LICENSE: Add gcc/cobol/LICENSE
> * gcc/cobol/Make-lang.in: Add gcc/cobol/Make-lang.in
> * gcc/cobol/config-lang.in: Add gcc/cobol/config-lang.in
> * gcc/cobol/lang.opt: Add gcc/cobol/lang.opt
> * gcc/cobol/lang.opt.urls: Add gcc/cobol/lang.opt.urls

Likewise for gcc/cobol.

It's probably best to have a first commit just generate the directories with the
empty ChangeLog and amend the contrib/gcc-changelog/git_commit.py
scipts default_changelog_locations.

I'm not sure about the exact order of the dance, Jakub possibly remembers.
We'll mainly have to remember when pushing any of the series.

Richard.

> [pins]
>
> > Please also use "git gcc-verify".
>
> Where is this documented?  I found
>
> https://gcc.gnu.org/pipermail/gcc-cvs/2020-May/288244.html
>
> and
>
> contrib/gcc-git-customization.sh
>
> There's quite a bit of stuff in there.  I genenerally don't use anyone else's 
> handy-dandy configuration, especially for git, which is baroque enough as it 
> is.  But I can run one command,
>
> $ grep gcc-verify gcc-git-customization.sh
> git config alias.gcc-verify '!f() { "`git rev-parse 
> --show-toplevel`/contrib/gcc-changelog/git_check_commit.py" $@; } ; f'
>
> if that's what we need.

Probably helpful (I'm also cherry-picking from the git customization).

Richard.

>
> --jkl


Re: [PATCH] ifcombine: For short circuit case, allow 2 defining statements [PR85605]

2024-10-29 Thread Richard Biener
On Tue, Oct 29, 2024 at 4:29 AM Andrew Pinski  wrote:
>
> r0-126134-g5d2a9da9a7f7c1 added support for circuiting and combing the ifs
> into using either AND or OR. But it only allowed the inner condition
> basic block having the conditional only. This changes to allow up to 2 
> defining
> statements as long as they are just nop conversions for either the lhs or rhs
> of the conditional.
>
> This should allow to use ccmp on aarch64 and x86_64 (APX) slightly more than 
> before.
>
> Boootstrapped and tested on x86_64-linux-gnu.
>
> PR tree-optimization/85605
>
> gcc/ChangeLog:
>
> * tree-ssa-ifcombine.cc (can_combine_bbs_with_short_circuit): New 
> function.
> (ifcombine_ifandif): Use can_combine_bbs_with_short_circuit instead 
> of checking
> if iterator is one before the last statement.
>
> gcc/testsuite/ChangeLog:
>
> * g++.dg/tree-ssa/ifcombine-ccmp-1.C: New test.
> * gcc.dg/tree-ssa/ssa-ifcombine-ccmp-7.c: New test.
> * gcc.dg/tree-ssa/ssa-ifcombine-ccmp-8.c: New test.
>
> Signed-off-by: Andrew Pinski 
> ---
>  .../g++.dg/tree-ssa/ifcombine-ccmp-1.C| 27 +
>  .../gcc.dg/tree-ssa/ssa-ifcombine-ccmp-7.c| 18 +
>  .../gcc.dg/tree-ssa/ssa-ifcombine-ccmp-8.c| 19 +
>  gcc/tree-ssa-ifcombine.cc | 39 ++-
>  4 files changed, 101 insertions(+), 2 deletions(-)
>  create mode 100644 gcc/testsuite/g++.dg/tree-ssa/ifcombine-ccmp-1.C
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-ifcombine-ccmp-7.c
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-ifcombine-ccmp-8.c
>
> diff --git a/gcc/testsuite/g++.dg/tree-ssa/ifcombine-ccmp-1.C 
> b/gcc/testsuite/g++.dg/tree-ssa/ifcombine-ccmp-1.C
> new file mode 100644
> index 000..282cec8c628
> --- /dev/null
> +++ b/gcc/testsuite/g++.dg/tree-ssa/ifcombine-ccmp-1.C
> @@ -0,0 +1,27 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -g -fdump-tree-optimized --param 
> logical-op-non-short-circuit=1" } */
> +
> +/* PR tree-optimization/85605 */
> +#include 
> +
> +template
> +inline bool cmp(T a, T2 b) {
> +  return a<0 ? true : T2(a) < b;
> +}
> +
> +template
> +inline bool cmp2(T a, T2 b) {
> +  return (a<0) | (T2(a) < b);
> +}
> +
> +bool f(int a, int b) {
> +return cmp(int64_t(a), unsigned(b));
> +}
> +
> +bool f2(int a, int b) {
> +return cmp2(int64_t(a), unsigned(b));
> +}
> +
> +
> +/* Both of these functions should be optimized to the same, and have an | in 
> them. */
> +/* { dg-final { scan-tree-dump-times " \\\| " 2 "optimized" } } */
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-ifcombine-ccmp-7.c 
> b/gcc/testsuite/gcc.dg/tree-ssa/ssa-ifcombine-ccmp-7.c
> new file mode 100644
> index 000..1bdbb9358b4
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-ifcombine-ccmp-7.c
> @@ -0,0 +1,18 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -g -fdump-tree-optimized --param 
> logical-op-non-short-circuit=1" } */
> +
> +/* PR tree-optimization/85605 */
> +/* Like ssa-ifcombine-ccmp-1.c but with conversion from unsigned to signed 
> in the
> +   inner bb which should be able to move too. */
> +
> +int t (int a, unsigned b)
> +{
> +  if (a > 0)
> +  {
> +signed t = b;
> +if (t > 0)
> +  return 0;
> +  }
> +  return 1;
> +}
> +/* { dg-final { scan-tree-dump "\&" "optimized" } } */
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-ifcombine-ccmp-8.c 
> b/gcc/testsuite/gcc.dg/tree-ssa/ssa-ifcombine-ccmp-8.c
> new file mode 100644
> index 000..8d74b4932c5
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-ifcombine-ccmp-8.c
> @@ -0,0 +1,19 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -g -fdump-tree-optimized --param 
> logical-op-non-short-circuit=1" } */
> +
> +/* PR tree-optimization/85605 */
> +/* Like ssa-ifcombine-ccmp-2.c but with conversion from unsigned to signed 
> in the
> +   inner bb which should be able to move too. */
> +
> +int t (int a, unsigned b)
> +{
> +  if (a > 0)
> +goto L1;
> +  signed t = b;
> +  if (t > 0)
> +goto L1;
> +  return 0;
> +L1:
> +  return 1;
> +}
> +/* { dg-final { scan-tree-dump "\|" "optimized" } } */
> diff --git a/gcc/tree-ssa-ifcombine.cc b/gcc/tree-ssa-ifcombine.cc
> index 39702929fc0..3acecda31cc 100644
> --- a/gcc/tree-ssa-ifcombine.cc
> +++ b/gcc/tree-ssa-ifcombine.cc
> @@ -400,6 +400,38 @@ update_profile_after_ifcombine (basic_block 
> inner_cond_bb,
>outer2->probability = profile_probability::never ();
>  }
>
> +/* Returns true if inner_cond_bb contains just the condition or 1/2 
> statements
> +   that define lhs or rhs with a nop conversion. */
> +
> +static bool
> +can_combine_bbs_with_short_circuit (basic_block inner_cond_bb, tree lhs, 
> tree rhs)
> +{
> +  gimple_stmt_iterator gsi;
> +  gsi = gsi_start_nondebug_after_labels_bb (inner_cond_bb);
> +  /* If only the condition, this should be allowed. */
> +  if (gsi_one_before_end_p (gsi))
> +return true;
> +  /* Can have up to 2 statements defin

Re: [RFC PATCH 5/5] vect: Also cost gconds for scalar

2024-10-29 Thread Richard Biener
On Tue, 29 Oct 2024, Richard Biener wrote:

> On Mon, 28 Oct 2024, Alex Coplan wrote:
> 
> > Currently we only cost gconds for the vector loop while we omit costing
> > them when analyzing the scalar loop; this unfairly penalizes the vector
> > loop in the case of loops with early exits.
> > 
> > This (together with the previous patches) enables us to vectorize
> > std::find with 64-bit element sizes.
> 
> OK.

Ah, wait - but we're now costing the scalar IV gcond but we're not
costing the vector (scalar) IV gcond.  So you want to exempt the
main IV gcond here.

Richard.

> Thanks,
> Richard.
> 
> > gcc/ChangeLog:
> > 
> > * tree-vect-loop.cc (vect_compute_single_scalar_iteration_cost):
> > Don't skip over gconds.
> > ---
> >  gcc/tree-vect-loop.cc | 4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> > 
> > 
> 
> 

-- 
Richard Biener 
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)


Re: [PATCH v2 3/3] Simplify switch bit test clustering algorithm

2024-10-29 Thread Richard Biener
On Mon, Oct 28, 2024 at 9:58 PM Andi Kleen  wrote:
>
> From: Andi Kleen 
>
> The current switch bit test clustering enumerates all possible case
> clusters combinations to find ones that fit the bit test constrains
> best.  This causes performance problems with very large switches.
>
> For bit test clustering which happens naturally in word sized chunks
> I don't think such an expensive algorithm is really needed.
>
> This patch implements a simple greedy algorithm that walks
> the sorted list and examines word sized windows and tries
> to cluster them.
>
> Surprisingly the new algorithm gives consistly better clusters
> for the examples I tried.
>
> For example from the gcc bootstrap:
>
> old: 0-15 16-31 96-175
> new: 0-31 96-175
>
> I'm not fully sure why that is, probably some bug in the old
> algorithm? This shows even up in the test suite where if-to-switch-6
> now can generate a switch, as well as a case in switch-1.c
>
> I don't have a proof that the new algorithm is always as good or better,
> but so far at least I don't see any counter examples.
>
> It also fixes the excessive compile time in PR117091,
> however this was already fixed by an earlier patch
> that doesn't run clustering when no targets have multiple
> values.

OK if you add a comment (as part of the function comment for example)
explaining the idea of the algorithm.

Thanks,
Richard.

> gcc/ChangeLog:
>
> PR middle-end/117091
> * tree-switch-conversion.cc (bit_test_cluster::find_bit_tests):
> Change clustering algorithm to simple greedy.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.dg/tree-ssa/if-to-switch-6.c: Allow condition chain.
> * gcc.dg/tree-ssa/switch-1.c: Allow more bit tests.
> ---
>  .../gcc.dg/tree-ssa/if-to-switch-6.c  |  2 +-
>  gcc/testsuite/gcc.dg/tree-ssa/switch-1.c  |  2 +-
>  gcc/tree-switch-conversion.cc | 76 ++-
>  3 files changed, 42 insertions(+), 38 deletions(-)
>
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/if-to-switch-6.c 
> b/gcc/testsuite/gcc.dg/tree-ssa/if-to-switch-6.c
> index b1640673eae1..657af770e438 100644
> --- a/gcc/testsuite/gcc.dg/tree-ssa/if-to-switch-6.c
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/if-to-switch-6.c
> @@ -39,4 +39,4 @@ int main(int argc, char **argv)
>return 0;
>  }
>
> -/* { dg-final { scan-tree-dump-not "Condition chain" "iftoswitch" } } */
> +/* { dg-final { scan-tree-dump "Condition chain" "iftoswitch" } } */
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/switch-1.c 
> b/gcc/testsuite/gcc.dg/tree-ssa/switch-1.c
> index 6f70c9de0c19..f1654aba6d99 100644
> --- a/gcc/testsuite/gcc.dg/tree-ssa/switch-1.c
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/switch-1.c
> @@ -107,4 +107,4 @@ int foo5 (int x)
>}
>  }
>
> -/* { dg-final { scan-tree-dump ";; GIMPLE switch case clusters: BT:10-62 
> 600-700 JT:1000-1021 11" "switchlower1" } } */
> +/* { dg-final { scan-tree-dump ";; GIMPLE switch case clusters: BT:10-62 
> 600-700 BT:1000-1021 11" "switchlower1" } } */
> diff --git a/gcc/tree-switch-conversion.cc b/gcc/tree-switch-conversion.cc
> index 3436c2a8b98c..b7736a9853d9 100644
> --- a/gcc/tree-switch-conversion.cc
> +++ b/gcc/tree-switch-conversion.cc
> @@ -1782,55 +1782,59 @@ bit_test_cluster::find_bit_tests (vec 
> &clusters, int max_c)
>  return clusters.copy ();
>
>unsigned l = clusters.length ();
> -  auto_vec min;
> -  min.reserve (l + 1);
> +  vec output;
>
> -  min.quick_push (min_cluster_item (0, 0, 0));
> +  output.create (l);
>
> -  for (unsigned i = 1; i <= l; i++)
> +  unsigned end;
> +  for (unsigned i = 0; i < l; i += end)
>  {
> -  /* Set minimal # of clusters with i-th item to infinite.  */
> -  min.quick_push (min_cluster_item (INT_MAX, INT_MAX, INT_MAX));
> +  HOST_WIDE_INT values = 0;
> +  hash_set targets;
> +  cluster *start_cluster = clusters[i];
>
> -  for (unsigned j = 0; j < i; j++)
> +  end = 0;
> +  while (i + end < l)
> {
> - if (min[j].m_count + 1 < min[i].m_count
> - && can_be_handled (clusters, j, i - 1))
> -   min[i] = min_cluster_item (min[j].m_count + 1, j, INT_MAX);
> + cluster *end_cluster = clusters[i + end];
> +
> + /* Does value range fit into the BITS_PER_WORD window?  */
> + HOST_WIDE_INT w = cluster::get_range (start_cluster->get_low (),
> +   end_cluster->get_high ());
> + if (w == 0 || w > BITS_PER_WORD)
> +   break;
> +
> + /* Compute # of values tested for new case.  */
> + HOST_WIDE_INT r = 1;
> + if (!end_cluster->is_single_value_p ())
> +   r = cluster::get_range (end_cluster->get_low (),
> +   end_cluster->get_high ());
> + if (r == 0)
> +   break;
> +
> + /* Check for max # of targets.  */
> + if (targets.elements() == m_max_case_bit_tests
> + && !targets.contains (end_cluster->m_case_bb)

Re: [PATCH v2 1/2] Match: support new case of unsigned scalar SAT_SUB

2024-10-29 Thread Richard Biener
On Mon, Oct 28, 2024 at 4:44 PM Akram Ahmad  wrote:
>
> This patch adds a new case for unsigned scalar saturating subtraction
> using a branch with a greater-than-or-equal condition. For example,
>
> X >= (X - Y) ? (X - Y) : 0
>
> is transformed into SAT_SUB (X, Y) when X and Y are unsigned scalars,
> which therefore correctly matches more cases of IFN SAT_SUB. New tests
> are added to verify this behaviour on targets which use the standard
> names for IFN SAT_SUB.
>
> This passes the aarch64 regression tests with no additional failures.

OK.

Thanks,
Richard.

> gcc/ChangeLog:
>
> * match.pd: Add new match for SAT_SUB.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.dg/tree-ssa/sat-u-sub-match-1-u16.c: New test.
> * gcc.dg/tree-ssa/sat-u-sub-match-1-u32.c: New test.
> * gcc.dg/tree-ssa/sat-u-sub-match-1-u64.c: New test.
> * gcc.dg/tree-ssa/sat-u-sub-match-1-u8.c: New test.
> ---
>  gcc/match.pd   |  8 
>  .../gcc.dg/tree-ssa/sat-u-sub-match-1-u16.c| 14 ++
>  .../gcc.dg/tree-ssa/sat-u-sub-match-1-u32.c| 14 ++
>  .../gcc.dg/tree-ssa/sat-u-sub-match-1-u64.c| 14 ++
>  .../gcc.dg/tree-ssa/sat-u-sub-match-1-u8.c | 14 ++
>  5 files changed, 64 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/sat-u-sub-match-1-u16.c
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/sat-u-sub-match-1-u32.c
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/sat-u-sub-match-1-u64.c
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/sat-u-sub-match-1-u8.c
>
> diff --git a/gcc/match.pd b/gcc/match.pd
> index ee53c25cef9..4fc5efa6247 100644
> --- a/gcc/match.pd
> +++ b/gcc/match.pd
> @@ -3360,6 +3360,14 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>}
>(if (wi::eq_p (sum, wi::uhwi (0, precision)))
>
> +/* Unsigned saturation sub, case 11 (branch with ge):
> +  SAT_U_SUB = X >= (X - Y) ? (X - Y) : 0.  */
> +(match (unsigned_integer_sat_sub @0 @1)
> + (cond^ (ge @0 (minus @0 @1))
> +  (convert? (minus (convert1? @0) (convert1? @1))) integer_zerop)
> + (if (INTEGRAL_TYPE_P (type) && TYPE_UNSIGNED (type)
> +  && TYPE_UNSIGNED (TREE_TYPE (@0)) && types_match (@0, @1
> +
>  /* Signed saturation sub, case 1:
> T minus = (T)((UT)X - (UT)Y);
> SAT_S_SUB = (X ^ Y) & (X ^ minus) < 0 ? (-(T)(X < 0) ^ MAX) : minus;
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/sat-u-sub-match-1-u16.c 
> b/gcc/testsuite/gcc.dg/tree-ssa/sat-u-sub-match-1-u16.c
> new file mode 100644
> index 000..164719980c3
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/sat-u-sub-match-1-u16.c
> @@ -0,0 +1,14 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-tree-optimized" } */
> +
> +#include 
> +
> +#define T uint16_t
> +
> +T sat_u_sub_1 (T a, T b)
> +{
> +  T sum = a - b;
> +  return sum > a ? 0 : sum;
> +}
> +
> +/* { dg-final { scan-tree-dump " .SAT_SUB " "optimized" } } */
> \ No newline at end of file
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/sat-u-sub-match-1-u32.c 
> b/gcc/testsuite/gcc.dg/tree-ssa/sat-u-sub-match-1-u32.c
> new file mode 100644
> index 000..40a28c6092b
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/sat-u-sub-match-1-u32.c
> @@ -0,0 +1,14 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-tree-optimized" } */
> +
> +#include 
> +
> +#define T uint32_t
> +
> +T sat_u_sub_1 (T a, T b)
> +{
> +  T sum = a - b;
> +  return sum > a ? 0 : sum;
> +}
> +
> +/* { dg-final { scan-tree-dump " .SAT_SUB " "optimized" } } */
> \ No newline at end of file
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/sat-u-sub-match-1-u64.c 
> b/gcc/testsuite/gcc.dg/tree-ssa/sat-u-sub-match-1-u64.c
> new file mode 100644
> index 000..5649858ef2a
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/sat-u-sub-match-1-u64.c
> @@ -0,0 +1,14 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-tree-optimized" } */
> +
> +#include 
> +
> +#define T uint64_t
> +
> +T sat_u_sub_1 (T a, T b)
> +{
> +  T sum = a - b;
> +  return sum > a ? 0 : sum;
> +}
> +
> +/* { dg-final { scan-tree-dump " .SAT_SUB " "optimized" } } */
> \ No newline at end of file
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/sat-u-sub-match-1-u8.c 
> b/gcc/testsuite/gcc.dg/tree-ssa/sat-u-sub-match-1-u8.c
> new file mode 100644
> index 000..785e48b92ee
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/sat-u-sub-match-1-u8.c
> @@ -0,0 +1,14 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-tree-optimized" } */
> +
> +#include 
> +
> +#define T uint8_t
> +
> +T sat_u_sub_1 (T a, T b)
> +{
> +  T sum = a - b;
> +  return sum > a ? 0 : sum;
> +}
> +
> +/* { dg-final { scan-tree-dump " .SAT_SUB " "optimized" } } */
> \ No newline at end of file
> --
> 2.34.1
>


Re: [PATCH v2 2/3] Only do switch bit test clustering when multiple labels point to same bb

2024-10-29 Thread Richard Biener
On Mon, Oct 28, 2024 at 9:58 PM Andi Kleen  wrote:
>
> From: Andi Kleen 
>
> The bit cluster code generation strategy is only beneficial when
> multiple case labels point to the same code. Do a quick check if
> that is the case before trying to cluster.
>
> This fixes the switch part of PR117091 where all case labels are unique
> however it doesn't address the performance problems for non unique
> cases.

OK.

Thanks,
Richard.

> gcc/ChangeLog:
>
> PR middle-end/117091
> * gimple-if-to-switch.cc (if_chain::is_beneficial): Update
> find_bit_test call.
> * tree-switch-conversion.cc (bit_test_cluster::find_bit_tests):
> Get max_c argument and bail out early if all case labels are
> unique.
> (switch_decision_tree::compute_cases_per_edge): Record number of
> targets per label and return.
> (switch_decision_tree::analyze_switch_statement): ... pass to
> find_bit_tests.
> * tree-switch-conversion.h: Update prototypes.
> ---
>  gcc/gimple-if-to-switch.cc|  2 +-
>  gcc/tree-switch-conversion.cc | 23 ---
>  gcc/tree-switch-conversion.h  |  5 +++--
>  3 files changed, 20 insertions(+), 10 deletions(-)
>
> diff --git a/gcc/gimple-if-to-switch.cc b/gcc/gimple-if-to-switch.cc
> index 96ce1c380a59..4151d1bb520e 100644
> --- a/gcc/gimple-if-to-switch.cc
> +++ b/gcc/gimple-if-to-switch.cc
> @@ -254,7 +254,7 @@ if_chain::is_beneficial ()
>else
>  output.release ();
>
> -  output = bit_test_cluster::find_bit_tests (filtered_clusters);
> +  output = bit_test_cluster::find_bit_tests (filtered_clusters, 2);
>r = output.length () < filtered_clusters.length ();
>if (r)
>  dump_clusters (&output, "BT can be built");
> diff --git a/gcc/tree-switch-conversion.cc b/gcc/tree-switch-conversion.cc
> index 00426d46..3436c2a8b98c 100644
> --- a/gcc/tree-switch-conversion.cc
> +++ b/gcc/tree-switch-conversion.cc
> @@ -1772,12 +1772,13 @@ jump_table_cluster::is_beneficial (const vec *> &,
>  }
>
>  /* Find bit tests of given CLUSTERS, where all members of the vector
> -   are of type simple_cluster.  New clusters are returned.  */
> +   are of type simple_cluster.   MAX_C is the approx max number of cases per
> +   label.  New clusters are returned.  */
>
>  vec
> -bit_test_cluster::find_bit_tests (vec &clusters)
> +bit_test_cluster::find_bit_tests (vec &clusters, int max_c)
>  {
> -  if (!is_enabled ())
> +  if (!is_enabled () || max_c == 1)
>  return clusters.copy ();
>
>unsigned l = clusters.length ();
> @@ -2206,18 +2207,26 @@ bit_test_cluster::hoist_edge_and_branch_if_true 
> (gimple_stmt_iterator *gsip,
>  }
>
>  /* Compute the number of case labels that correspond to each outgoing edge of
> -   switch statement.  Record this information in the aux field of the edge.  
> */
> +   switch statement.  Record this information in the aux field of the edge.
> +   Return the approx max number of cases per edge.  */
>
> -void
> +int
>  switch_decision_tree::compute_cases_per_edge ()
>  {
> +  int max_c = 0;
>reset_out_edges_aux (m_switch);
>int ncases = gimple_switch_num_labels (m_switch);
>for (int i = ncases - 1; i >= 1; --i)
>  {
>edge case_edge = gimple_switch_edge (cfun, m_switch, i);
>case_edge->aux = (void *) ((intptr_t) (case_edge->aux) + 1);
> +  /* For a range case add one extra. That's enough for the bit
> +cluster heuristic.  */
> +  if ((intptr_t)case_edge->aux > max_c)
> +   max_c = (intptr_t)case_edge->aux +
> +   !!CASE_HIGH (gimple_switch_label (m_switch, i));
>  }
> +  return max_c;
>  }
>
>  /* Analyze switch statement and return true when the statement is expanded
> @@ -2235,7 +2244,7 @@ switch_decision_tree::analyze_switch_statement ()
>m_case_bbs.reserve (l);
>m_case_bbs.quick_push (default_bb);
>
> -  compute_cases_per_edge ();
> +  int max_c = compute_cases_per_edge ();
>
>for (unsigned i = 1; i < l; i++)
>  {
> @@ -2256,7 +2265,7 @@ switch_decision_tree::analyze_switch_statement ()
>reset_out_edges_aux (m_switch);
>
>/* Find bit-test clusters.  */
> -  vec output = bit_test_cluster::find_bit_tests (clusters);
> +  vec output = bit_test_cluster::find_bit_tests (clusters, max_c);
>
>/* Find jump table clusters.  */
>vec output2;
> diff --git a/gcc/tree-switch-conversion.h b/gcc/tree-switch-conversion.h
> index 6468995eb316..e6a85fa60258 100644
> --- a/gcc/tree-switch-conversion.h
> +++ b/gcc/tree-switch-conversion.h
> @@ -399,7 +399,7 @@ public:
>
>/* Find bit tests of given CLUSTERS, where all members of the vector
>   are of type simple_cluster.  New clusters are returned.  */
> -  static vec find_bit_tests (vec &clusters);
> +  static vec find_bit_tests (vec &clusters, int max_c);
>
>/* Return true when RANGE of case values with UNIQ labels
>   can build a bit test.  */
> @@ -576,8 +576,9 @@ public:
>bool try_switch_expansion (vec &clusters);
>/* Comput

Re: [PATCH 1/5] Match: Simplify branch form 4 of unsigned SAT_ADD into branchless

2024-10-29 Thread Richard Biener
On Tue, Oct 29, 2024 at 9:27 AM  wrote:
>
> From: Pan Li 
>
> There are sorts of forms for the unsigned SAT_ADD.  Some of them are
> complicated while others are cheap.  This patch would like to simplify
> the complicated form into the cheap ones.  For example as below:
>
> From the form 4 (branch):
>   SAT_U_ADD = (X + Y) < x ? -1 : (X + Y).
>
> To (branchless):
>   SAT_U_ADD = (X + Y) | - ((X + Y) < X).
>
>   #define T uint8_t
>
>   T sat_add_u_1 (T x, T y)
>   {
> return (T)(x + y) < x ? -1 : (x + y);
>   }
>
> Before this patch:
>1   │ uint8_t sat_add_u_1 (uint8_t x, uint8_t y)
>2   │ {
>3   │   uint8_t D.2809;
>4   │
>5   │   _1 = x + y;
>6   │   if (x <= _1) goto ; else goto ;
>7   │   :
>8   │   D.2809 = x + y;
>9   │   goto ;
>   10   │   :
>   11   │   D.2809 = 255;
>   12   │   :
>   13   │   return D.2809;
>   14   │ }
>
> After this patch:
>1   │ uint8_t sat_add_u_1 (uint8_t x, uint8_t y)
>2   │ {
>3   │   uint8_t D.2809;
>4   │
>5   │   _1 = x + y;
>6   │   _2 = x + y;
>7   │   _3 = x > _2;
>8   │   _4 = (unsigned char) _3;
>9   │   _5 = -_4;
>   10   │   D.2809 = _1 | _5;
>   11   │   return D.2809;
>   12   │ }
>
> The below test suites are passed for this patch.
> * The rv64gcv fully regression test.
> * The x86 bootstrap test.
> * The x86 fully regression test.
>
> gcc/ChangeLog:
>
> * match.pd: Remove unsigned branch form 4 for SAT_ADD, and
> add simplify to branchless instead.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.dg/tree-ssa/sat_u_add-simplify-2-u16.c: New test.
> * gcc.dg/tree-ssa/sat_u_add-simplify-2-u32.c: New test.
> * gcc.dg/tree-ssa/sat_u_add-simplify-2-u64.c: New test.
> * gcc.dg/tree-ssa/sat_u_add-simplify-2-u8.c: New test.

You are testing GENERIC folding, so gcc.dg/ is a better location, not tree-ssa/
I wonder if the simplification is already applied by the frontend and thus
.original shows the simplified form or only .gimple?

Did you check the simplification applies when writing as

 if (x + y < x)
   return -1;
else
   return x + y;

?  If it doesn't then removing the match is likely premature.  I think phiopt
should be able to perform the matching here (it does the COND_EXPR
building).

Thanks,
Richard.

> Signed-off-by: Pan Li 
> ---
>  gcc/match.pd  | 11 +++
>  .../gcc.dg/tree-ssa/sat_u_add-simplify-2-u16.c| 15 +++
>  .../gcc.dg/tree-ssa/sat_u_add-simplify-2-u32.c| 15 +++
>  .../gcc.dg/tree-ssa/sat_u_add-simplify-2-u64.c| 15 +++
>  .../gcc.dg/tree-ssa/sat_u_add-simplify-2-u8.c | 15 +++
>  5 files changed, 67 insertions(+), 4 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/sat_u_add-simplify-2-u16.c
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/sat_u_add-simplify-2-u32.c
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/sat_u_add-simplify-2-u64.c
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/sat_u_add-simplify-2-u8.c
>
> diff --git a/gcc/match.pd b/gcc/match.pd
> index 809c717bc86..4d1143b6ec3 100644
> --- a/gcc/match.pd
> +++ b/gcc/match.pd
> @@ -3154,10 +3154,13 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>&& types_match (type, @0, @1))
>(bit_ior @2 (negate (convert (lt @2 @0))
>
> -/* Unsigned saturation add, case 4 (branch with lt):
> -   SAT_U_ADD = (X + Y) < x ? -1 : (X + Y).  */
> -(match (unsigned_integer_sat_add @0 @1)
> - (cond^ (lt (usadd_left_part_1@2 @0 @1) @0) integer_minus_onep @2))
> +/* Simplify SAT_U_ADD to the cheap form
> +   From: SAT_U_ADD = (X + Y) < x ? -1 : (X + Y).
> +   To:   SAT_U_ADD = (X + Y) | - ((X + Y) < X).  */
> +(simplify (cond (lt (plus:c@2 @0 @1) @0) integer_minus_onep @2)
> + (if (INTEGRAL_TYPE_P (type) && TYPE_UNSIGNED (type)
> +  && types_match (type, @0, @1))
> +  (bit_ior @2 (negate (convert (lt @2 @0))
>
>  /* Unsigned saturation add, case 5 (branch with eq .ADD_OVERFLOW):
> SAT_U_ADD = REALPART_EXPR <.ADD_OVERFLOW> == 0 ? .ADD_OVERFLOW : -1.  */
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/sat_u_add-simplify-2-u16.c 
> b/gcc/testsuite/gcc.dg/tree-ssa/sat_u_add-simplify-2-u16.c
> new file mode 100644
> index 000..6e327f58d46
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/sat_u_add-simplify-2-u16.c
> @@ -0,0 +1,15 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-tree-gimple-details" } */
> +
> +#include 
> +
> +#define T uint16_t
> +
> +T sat_add_u_1 (T x, T y)
> +{
> +  return (T)(x + y) < x ? -1 : (x + y);
> +}
> +
> +/* { dg-final { scan-tree-dump-not " if " "gimple" } } */
> +/* { dg-final { scan-tree-dump-not " else " "gimple" } } */
> +/* { dg-final { scan-tree-dump-not " goto " "gimple" } } */
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/sat_u_add-simplify-2-u32.c 
> b/gcc/testsuite/gcc.dg/tree-ssa/sat_u_add-simplify-2-u32.c
> new file mode 100644
> index 000..17b09396b69
> --- /dev/null
> +++ b/gcc/testsu

Re: [PATCH v2 2/8] ifn: Add else-operand handling.

2024-10-29 Thread Robin Dapp
>> +/* Integer constants representing which else value is supported for masked 
>> load
>> +   functions.  */
>> +#define MASK_LOAD_ELSE_ZERO -1
>> +#define MASK_LOAD_ELSE_M1 -2
>> +#define MASK_LOAD_ELSE_UNDEFINED -3
>> +
>> +#define MASK_LOAD_GATHER_ELSE_IDX 6
>
> Why this define?

I initially wanted to use internal_fn_else_index to query the optab's else
operand.  IFN and optab else indices match for maskload and mask_load_lanes
but not for mask_gather_load because the latter implicitly has the sign/zero
extension operand.  I documented this and replaced MASK_LOAD_GATHER_ELSE_IDX
with internal_fn_else_index (...) + 1 now.

In addition, I figured we don't need to query the else operand in pattern recog
but can use ZERO there as well.  It should be sufficient to "overwrite" it in
vectorizable_load.  That way vect_get_else_val_from_tree becomes unnecessary.

Most of the other reviewer comments are incorporated now.  Just still fighting
with an SLP ICE on aarch64 where a swapped
 oprnd_info->ops[j] seems to be NULL.

-- 
Regards
 Robin



Ping: [PATCH] Always set SECTION_RELRO for or .data.rel.ro{,.local} [PR116887]

2024-10-29 Thread Xi Ruoyao
On Fri, 2024-10-11 at 02:54 +0800, Xi Ruoyao wrote:
> At least two ports (hppa and loongarch) need to set SECTION_RELRO for
> .data.rel.ro{,.local} in section_type_flags (PR52999 and PR116887), and
> I cannot see a reason not to just set it in the generic code.
> 
> With this applied we can also remove the hppa-specific
> pa_section_type_flags in a future patch.
> 
> gcc/ChangeLog:
> 
>   PR target/116887
>   * varasm.cc (default_section_type_flags): Always set
>   SECTION_RELRO if name is .data.rel.ro{,.local}.
> 
> gcc/testsuite/ChangeLog:
> 
>   PR target/116887
>   * gcc.dg/pr116887.c: New test.

Ping.

> ---
> 
> Bootstrapped & regtested on x86_64-linux-gnu.  Ok for trunk?
> 
>  gcc/testsuite/gcc.dg/pr116887.c | 23 +++
>  gcc/varasm.cc   | 10 --
>  2 files changed, 27 insertions(+), 6 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.dg/pr116887.c
> 
> diff --git a/gcc/testsuite/gcc.dg/pr116887.c
> b/gcc/testsuite/gcc.dg/pr116887.c
> new file mode 100644
> index 000..b7255e09a18
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/pr116887.c
> @@ -0,0 +1,23 @@
> +/* { dg-do compile } */
> +/* { dg-options "-fpic" } */
> +
> +struct link_map
> +{
> +  struct link_map *l_next;
> +};
> +struct rtld_global
> +{
> +  struct link_map *_ns_loaded;
> +  char buf[4096];
> +  struct link_map _dl_rtld_map;
> +};
> +extern struct rtld_global _rtld_global;
> +static int _dlfo_main __attribute__ ((section (".data.rel.ro"),
> used));
> +void
> +_dlfo_process_initial (int ns)
> +{
> +  for (struct link_map *l = _rtld_global._ns_loaded; l != ((void
> *)0);
> +   l = l->l_next)
> +    if (l == &_rtld_global._dl_rtld_map)
> +  asm ("");
> +}
> diff --git a/gcc/varasm.cc b/gcc/varasm.cc
> index 4426e7ce6c6..aa450092ce5 100644
> --- a/gcc/varasm.cc
> +++ b/gcc/varasm.cc
> @@ -6863,6 +6863,9 @@ default_section_type_flags (tree decl, const
> char *name, int reloc)
>  
>    if (decl && TREE_CODE (decl) == FUNCTION_DECL)
>  flags = SECTION_CODE;
> +  else if (strcmp (name, ".data.rel.ro") == 0
> +    || strcmp (name, ".data.rel.ro.local") == 0)
> +    flags = SECTION_WRITE | SECTION_RELRO;
>    else if (decl)
>  {
>    enum section_category category
> @@ -6876,12 +6879,7 @@ default_section_type_flags (tree decl, const
> char *name, int reloc)
>   flags = SECTION_WRITE;
>  }
>    else
> -    {
> -  flags = SECTION_WRITE;
> -  if (strcmp (name, ".data.rel.ro") == 0
> -   || strcmp (name, ".data.rel.ro.local") == 0)
> - flags |= SECTION_RELRO;
> -    }
> +    flags = SECTION_WRITE;
>  
>    if (decl && DECL_P (decl) && DECL_COMDAT_GROUP (decl))
>  flags |= SECTION_LINKONCE;

-- 
Xi Ruoyao 
School of Aerospace Science and Technology, Xidian University


[PATCH] tree-optimization/117343 - decide_masked_load_lanes and stale graph

2024-10-29 Thread Richard Biener
It turns out decide_masked_load_lanes accesses a stale SLP graph
so the following re-builds it instead.

Bootstrapped and tested on x86_64-unknown-linux-gnu, pushed.

PR tree-optimization/117343
* tree-vect-slp.cc (vect_optimize_slp_pass::build_vertices):
Support re-building the SLP graph.
(vect_optimize_slp_pass::run): Re-build the SLP graph before
decide_masked_load_lanes.
---
 gcc/tree-vect-slp.cc | 4 
 1 file changed, 4 insertions(+)

diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
index 2e98a943e06..a7f064bb0ed 100644
--- a/gcc/tree-vect-slp.cc
+++ b/gcc/tree-vect-slp.cc
@@ -5632,6 +5632,8 @@ vect_optimize_slp_pass::build_vertices ()
   hash_set visited;
   unsigned i;
   slp_instance instance;
+  m_vertices.truncate (0);
+  m_leafs.truncate (0);
   FOR_EACH_VEC_ELT (m_vinfo->slp_instances, i, instance)
 build_vertices (visited, SLP_INSTANCE_TREE (instance));
 }
@@ -7244,6 +7246,8 @@ vect_optimize_slp_pass::run ()
 }
   else
 remove_redundant_permutations ();
+  free_graph (m_slpg);
+  build_graph ();
   decide_masked_load_lanes ();
   free_graph (m_slpg);
 }
-- 
2.43.0


Re: [PATCH v4 1/2] Match: Simplify (x != 0 ? x + ~0 : 0) to (x - x != 0).

2024-10-29 Thread Richard Biener
On Sat, Oct 26, 2024 at 12:20 AM Andrew Pinski  wrote:
>
> On Thu, Oct 24, 2024 at 6:22 PM Li Xu  wrote:
> >
> > From: xuli 
> >
> > When the imm operand op1=1 in the unsigned scalar sat_sub form2 below,
> > we can simplify (x != 0 ? x + ~0 : 0) to (x - x != 0), thereby eliminating
> > a branch instruction.This simplification also applies to signed integer.
> >
> > Form2:
> > T __attribute__((noinline)) \
> > sat_u_sub_imm##IMM##_##T##_fmt_2 (T x)  \
> > {   \
> >   return x >= (T)IMM ? x - (T)IMM : 0;  \
> > }
> >
> > Take below form 2 as example:
> > DEF_SAT_U_SUB_IMM_FMT_2(uint8_t, 1)
> >
> > Before this patch:
> > __attribute__((noinline))
> > uint8_t sat_u_sub_imm1_uint8_t_fmt_2 (uint8_t x)
> > {
> >   uint8_t _1;
> >   uint8_t _3;
> >
> >[local count: 1073741824]:
> >   if (x_2(D) != 0)
> > goto ; [50.00%]
> >   else
> > goto ; [50.00%]
> >
> >[local count: 536870912]:
> >   _3 = x_2(D) + 255;
> >
> >[local count: 1073741824]:
> >   # _1 = PHI 
> >   return _1;
> >
> > }
> >
> > Assembly code:
> > sat_u_sub_imm1_uint8_t_fmt_2:
> > beq a0,zero,.L2
> > addiw   a0,a0,-1
> > andia0,a0,0xff
> > .L2:
> > ret
> >
> > After this patch:
> > __attribute__((noinline))
> > uint8_t sat_u_sub_imm1_uint8_t_fmt_2 (uint8_t x)
> > {
> >   _Bool _1;
> >   unsigned char _2;
> >   uint8_t _4;
> >
> >[local count: 1073741824]:
> >   _1 = x_3(D) != 0;
> >   _2 = (unsigned char) _1;
> >   _4 = x_3(D) - _2;
> >   return _4;
> >
> > }
> >
> > Assembly code:
> > sat_u_sub_imm1_uint8_t_fmt_2:
> > sneza5,a0
> > subwa0,a0,a5
> > andia0,a0,0xff
> > ret
> >
> > The below test suites are passed for this patch:
> > 1. The rv64gcv fully regression tests.
> > 2. The x86 bootstrap tests.
> > 3. The x86 fully regression tests.
> >
> > Signed-off-by: Li Xu 
> > gcc/ChangeLog:
> >
> > * match.pd: Simplify (x != 0 ? x + ~0 : 0) to (x - x != 0).
> >
> > gcc/testsuite/ChangeLog:
> >
> > * gcc.dg/tree-ssa/phi-opt-44.c: New test.
> > ---
> >  gcc/match.pd   | 10 +
> >  gcc/testsuite/gcc.dg/tree-ssa/phi-opt-44.c | 26 ++
> >  gcc/testsuite/gcc.dg/tree-ssa/phi-opt-45.c | 26 ++
> >  3 files changed, 62 insertions(+)
> >  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/phi-opt-44.c
> >  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/phi-opt-45.c
> >
> > diff --git a/gcc/match.pd b/gcc/match.pd
> > index 0455dfa6993..f48fd7d52ba 100644
> > --- a/gcc/match.pd
> > +++ b/gcc/match.pd
> > @@ -3383,6 +3383,16 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
> >}
> >(if (wi::eq_p (sum, wi::uhwi (0, precision)))
> >
> > +/* The boundary condition for case 10: IMM = 1:
> > +   SAT_U_SUB = X >= IMM ? (X - IMM) : 0.
> > +   simplify (X != 0 ? X + ~0 : 0) to (X - X != 0).  */
> > +(simplify
> > + (cond (ne@1 @0 integer_zerop)
> > +   (nop_convert? (plus (nop_convert? @0) integer_all_onesp))
> > +   integer_zerop)
> > + (if (INTEGRAL_TYPE_P (type))
> > +   (minus @0 (convert @1
>
> This looks good to me, though I can't approve it.

OK.

Thanks,
Richard.

> Thanks,
> Andrew
>
> > +
> >  /* Signed saturation sub, case 1:
> > T minus = (T)((UT)X - (UT)Y);
> > SAT_S_SUB = (X ^ Y) & (X ^ minus) < 0 ? (-(T)(X < 0) ^ MAX) : minus;
> > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/phi-opt-44.c 
> > b/gcc/testsuite/gcc.dg/tree-ssa/phi-opt-44.c
> > new file mode 100644
> > index 000..962bf0954f6
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/tree-ssa/phi-opt-44.c
> > @@ -0,0 +1,26 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -fdump-tree-phiopt1" } */
> > +
> > +#include 
> > +
> > +uint8_t f1 (uint8_t x)
> > +{
> > +  return x >= (uint8_t)1 ? x - (uint8_t)1 : 0;
> > +}
> > +
> > +uint16_t f2 (uint16_t x)
> > +{
> > +  return x >= (uint16_t)1 ? x - (uint16_t)1 : 0;
> > +}
> > +
> > +uint32_t f3 (uint32_t x)
> > +{
> > +  return x >= (uint32_t)1 ? x - (uint32_t)1 : 0;
> > +}
> > +
> > +uint64_t f4 (uint64_t x)
> > +{
> > +  return x >= (uint64_t)1 ? x - (uint64_t)1 : 0;
> > +}
> > +
> > +/* { dg-final { scan-tree-dump-not "goto" "phiopt1" } } */
> > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/phi-opt-45.c 
> > b/gcc/testsuite/gcc.dg/tree-ssa/phi-opt-45.c
> > new file mode 100644
> > index 000..62a2ab63184
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/tree-ssa/phi-opt-45.c
> > @@ -0,0 +1,26 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -fdump-tree-phiopt1" } */
> > +
> > +#include 
> > +
> > +int8_t f1 (int8_t x)
> > +{
> > +  return x != 0 ? x - (int8_t)1 : 0;
> > +}
> > +
> > +int16_t f2 (int16_t x)
> > +{
> > +  return x != 0 ? x - (int16_t)1 : 0;
> > +}
> > +
> > +int32_t f3 (int32_t x)
> > +{
> > +  return x != 0 ? x - (int32_t)1 : 0;
> > +}
> > +
> > +int64_t f4 (int64_t x)
> > +{
> > +  return x != 0 ? x - (int64_t)1 : 0;
> > +}
> > +
> > +/* { dg-final { sc

Re: [PATCH 1/2] RISC-V:Add intrinsic support for the CMOs extensions

2024-10-29 Thread Kito Cheng
於 2024年10月29日 週二,18:13寫道:

> From: yulong 
>
> gcc/ChangeLog:
>
> * config.gcc: Add riscv_cmo.h.
> * config/riscv/riscv_cmo.h: New file.
>
> ---
>  gcc/config.gcc   |  2 +-
>  gcc/config/riscv/riscv_cmo.h | 93 
>  2 files changed, 94 insertions(+), 1 deletion(-)
>  create mode 100644 gcc/config/riscv/riscv_cmo.h
>
> diff --git a/gcc/config.gcc b/gcc/config.gcc
> index fd848228722..e2ed3b309cc 100644
> --- a/gcc/config.gcc
> +++ b/gcc/config.gcc
> @@ -558,7 +558,7 @@ riscv*)
> extra_objs="${extra_objs} riscv-vector-builtins.o
> riscv-vector-builtins-shapes.o riscv-vector-builtins-bases.o"
> extra_objs="${extra_objs} thead.o riscv-target-attr.o"
> d_target_objs="riscv-d.o"
> -   extra_headers="riscv_vector.h riscv_crypto.h riscv_bitmanip.h
> riscv_th_vector.h"
> +   extra_headers="riscv_vector.h riscv_crypto.h riscv_bitmanip.h
> riscv_th_vector.h riscv_cmo.h"
> target_gtfiles="$target_gtfiles
> \$(srcdir)/config/riscv/riscv-vector-builtins.cc"
> target_gtfiles="$target_gtfiles
> \$(srcdir)/config/riscv/riscv-vector-builtins.h"
> ;;
> diff --git a/gcc/config/riscv/riscv_cmo.h b/gcc/config/riscv/riscv_cmo.h
> new file mode 100644
> index 000..95bf60da082
> --- /dev/null
> +++ b/gcc/config/riscv/riscv_cmo.h
> @@ -0,0 +1,93 @@
> +/* RISC-V CMO Extension intrinsics include file.
> +   Copyright (C) 2024 Free Software Foundation, Inc.
> +
> +   This file is part of GCC.
> +
> +   GCC is free software; you can redistribute it and/or modify it
> +   under the terms of the GNU General Public License as published
> +   by the Free Software Foundation; either version 3, or (at your
> +   option) any later version.
> +
> +   GCC is distributed in the hope that it will be useful, but WITHOUT
> +   ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
> +   or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public
> +   License for more details.
> +
> +   Under Section 7 of GPL version 3, you are granted additional
> +   permissions described in the GCC Runtime Library Exception, version
> +   3.1, as published by the Free Software Foundation.
> +
> +   You should have received a copy of the GNU General Public License and
> +   a copy of the GCC Runtime Library Exception along with this program;
> +   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
> +   .  */
> +
> +#ifndef __RISCV_CMO_H
> +#define __RISCV_CMO_H
> +
> +#include 


It seems not used anything from stdint.h?



> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif


No need for this since all function suppose to be inlined, so no mangling
issue


> +
> +#if defined (__riscv_zicbom)
> +
> +extern __inline void
> +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
> +__riscv_cmo_clean (void *addr)
> +{
> +__builtin_riscv_zicbom_cbo_clean(addr);
> +}
> +
> +extern __inline void
> +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
> +__riscv_cmo_flush (void *addr)
> +{
> +__builtin_riscv_zicbom_cbo_flush(addr);
> +}
> +
> +extern __inline void
> +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
> +__riscv_cmo_inval (void *addr)
> +{
> +__builtin_riscv_zicbom_cbo_inval(addr);
> +}
> +
> +#endif // __riscv_zicbom
> +
> +#if defined (__riscv_zicbop)
> +
> +# define rnum 1
> +
> +extern __inline void
> +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
> +__riscv_cmo_prefetch (void *addr, const int vs1, const int vs2)
> +{
> +__builtin_prefetch(addr,vs1,vs2);
> +}
> +
> +extern __inline int
> +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
> +__riscv_cmo_prefetchi ()
> +{
> +return __builtin_riscv_zicbop_cbo_prefetchi(rnum);
> +}
> +
> +#endif // __riscv_zicbop
> +
> +#if defined (__riscv_zicboz)
> +
> +extern __inline void
> +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
> +__riscv_cmo_zero (void *addr)
> +{
> +__builtin_riscv_zicboz_cbo_zero(addr);
> +}
> +
> +#endif // __riscv_zicboz
> +
> +#if defined (__cplusplus)
> +}
> +#endif // __cplusplus
> +#endif // __RISCV_CMO_H
> \ No newline at end of file
> --
> 2.34.1
>
>


[PATCH] libstdc++: Avoid unnecessary copies in ranges::min/max [PR112349]

2024-10-29 Thread Patrick Palka
Tested on x86_64-pc-linux-gnu, does this look OK for trunk and perhaps
14?

-- >8 --

Use a local reference for the (now possibly lifetime extended) result of
*__first to avoid making unnecessary copies of it.

PR libstdc++/112349

libstdc++-v3/ChangeLog:

* include/bits/ranges_algo.h (__min_fn::operator()): Turn local
object __tmp into a reference.
* include/bits/ranges_util.h (__max_fn::operator()): Likewise.
* testsuite/25_algorithms/max/constrained.cc (test04): New test.
* testsuite/25_algorithms/min/constrained.cc (test04): New test.
---
 libstdc++-v3/include/bits/ranges_algo.h   |  4 +--
 libstdc++-v3/include/bits/ranges_util.h   |  4 +--
 .../25_algorithms/max/constrained.cc  | 25 +++
 .../25_algorithms/min/constrained.cc  | 25 +++
 4 files changed, 54 insertions(+), 4 deletions(-)

diff --git a/libstdc++-v3/include/bits/ranges_algo.h 
b/libstdc++-v3/include/bits/ranges_algo.h
index bae36637b3e..e1aba256241 100644
--- a/libstdc++-v3/include/bits/ranges_algo.h
+++ b/libstdc++-v3/include/bits/ranges_algo.h
@@ -2945,11 +2945,11 @@ namespace ranges
auto __result = *__first;
while (++__first != __last)
  {
-   auto __tmp = *__first;
+   auto&& __tmp = *__first;
if (std::__invoke(__comp,
  std::__invoke(__proj, __result),
  std::__invoke(__proj, __tmp)))
- __result = std::move(__tmp);
+ __result = std::forward(__tmp);
  }
return __result;
   }
diff --git a/libstdc++-v3/include/bits/ranges_util.h 
b/libstdc++-v3/include/bits/ranges_util.h
index 3f191e6d446..4a5349ae92a 100644
--- a/libstdc++-v3/include/bits/ranges_util.h
+++ b/libstdc++-v3/include/bits/ranges_util.h
@@ -754,11 +754,11 @@ namespace ranges
auto __result = *__first;
while (++__first != __last)
  {
-   auto __tmp = *__first;
+   auto&& __tmp = *__first;
if (std::__invoke(__comp,
  std::__invoke(__proj, __tmp),
  std::__invoke(__proj, __result)))
- __result = std::move(__tmp);
+ __result = std::forward(__tmp);
  }
return __result;
   }
diff --git a/libstdc++-v3/testsuite/25_algorithms/max/constrained.cc 
b/libstdc++-v3/testsuite/25_algorithms/max/constrained.cc
index e7269e1b734..717900656bd 100644
--- a/libstdc++-v3/testsuite/25_algorithms/max/constrained.cc
+++ b/libstdc++-v3/testsuite/25_algorithms/max/constrained.cc
@@ -73,10 +73,35 @@ test03()
   VERIFY( ranges::max({2,3,1,4}, ranges::greater{}, std::negate<>{}) == 4 );
 }
 
+void
+test04()
+{
+  // PR libstdc++/112349 - ranges::max/min make unnecessary copies
+  static int copies, moves;
+  struct A {
+A(int m) : m(m) { }
+A(const A& other) : m(other.m) { ++copies; }
+A(A&& other) : m(other.m) { ++moves; }
+A& operator=(const A& other) { m = other.m; ++copies; return *this; }
+A& operator=(A&& other) { m = other.m; ++moves; return *this; }
+int m;
+  };
+  A r[5] = {5, 4, 3, 2, 1};
+  ranges::max(r, std::less{}, &A::m);
+  VERIFY( copies == 1 );
+  VERIFY( moves == 0 );
+  copies = moves = 0;
+  A s[5] = {1, 2, 3, 4, 5};
+  ranges::max(s, std::less{}, &A::m);
+  VERIFY( copies == 5 );
+  VERIFY( moves == 0 );
+}
+
 int
 main()
 {
   test01();
   test02();
   test03();
+  test04();
 }
diff --git a/libstdc++-v3/testsuite/25_algorithms/min/constrained.cc 
b/libstdc++-v3/testsuite/25_algorithms/min/constrained.cc
index 7198df69adf..d338a86f186 100644
--- a/libstdc++-v3/testsuite/25_algorithms/min/constrained.cc
+++ b/libstdc++-v3/testsuite/25_algorithms/min/constrained.cc
@@ -73,10 +73,35 @@ test03()
   VERIFY( ranges::min({2,3,1,4}, ranges::greater{}, std::negate<>{}) == 1 );
 }
 
+void
+test04()
+{
+  // PR libstdc++/112349 - ranges::max/min make unnecessary copies
+  static int copies, moves;
+  struct A {
+A(int m) : m(m) { }
+A(const A& other) : m(other.m) { ++copies; }
+A(A&& other) : m(other.m) { ++moves; }
+A& operator=(const A& other) { m = other.m; ++copies; return *this; }
+A& operator=(A&& other) { m = other.m; ++moves; return *this; }
+int m;
+  };
+  A r[5] = {5, 4, 3, 2, 1};
+  ranges::min(r, std::less{}, &A::m);
+  VERIFY( copies == 5 );
+  VERIFY( moves == 0 );
+  copies = moves = 0;
+  A s[5] = {1, 2, 3, 4, 5};
+  ranges::min(s, std::less{}, &A::m);
+  VERIFY( copies == 1 );
+  VERIFY( moves == 0 );
+}
+
 int
 main()
 {
   test01();
   test02();
   test03();
+  test04();
 }
-- 
2.47.0.148.g6a11438f43



Re: SVE intrinsics: Fold constant operands for svlsl.

2024-10-29 Thread Soumya AR


> On 24 Oct 2024, at 2:55 PM, Richard Sandiford  
> wrote:
> 
> External email: Use caution opening links or attachments
> 
> 
> Kyrylo Tkachov  writes:
>>> On 24 Oct 2024, at 10:39, Soumya AR  wrote:
>>> 
>>> Hi Richard,
>>> 
 On 23 Oct 2024, at 5:58 PM, Richard Sandiford  
 wrote:
 
 External email: Use caution opening links or attachments
 
 
 Soumya AR  writes:
> diff --git a/gcc/config/aarch64/aarch64-sve-builtins.cc 
> b/gcc/config/aarch64/aarch64-sve-builtins.cc
> index 41673745cfe..aa556859d2e 100644
> --- a/gcc/config/aarch64/aarch64-sve-builtins.cc
> +++ b/gcc/config/aarch64/aarch64-sve-builtins.cc
> @@ -1143,11 +1143,14 @@ aarch64_const_binop (enum tree_code code, tree 
> arg1, tree arg2)
>  tree type = TREE_TYPE (arg1);
>  signop sign = TYPE_SIGN (type);
>  wi::overflow_type overflow = wi::OVF_NONE;
> -
> +  unsigned int element_bytes = tree_to_uhwi (TYPE_SIZE_UNIT (type));
>  /* Return 0 for division by 0, like SDIV and UDIV do.  */
>  if (code == TRUNC_DIV_EXPR && integer_zerop (arg2))
> return arg2;
> -
> +  /* Return 0 if shift amount is out of range. */
> +  if (code == LSHIFT_EXPR
> +   && tree_to_uhwi (arg2) >= (element_bytes * BITS_PER_UNIT))
 
 tree_to_uhwi is dangerous because a general shift might be negative
 (even if these particular shift amounts are unsigned).  We should
 probably also key off TYPE_PRECISION rather than TYPE_SIZE_UNIT.  So:
 
   if (code == LSHIFT_EXPR
   && wi::geu_p (wi::to_wide (arg2), TYPE_PRECISION (type)))
 
 without the element_bytes variable.  Also: the indentation looks a bit off;
 it should be tabs only followed by spaces only.
>>> 
>>> Thanks for the feedback, posting an updated patch with the suggested 
>>> changes.
>> 
>> Thanks Soumya, I’ve pushed this patch to trunk as commit 3e7549ece7c after 
>> adjusting
>> the ChangeLog slightly to start the lines with tabs instead of spaces.
> 
> Sorry Soumya, I forgot that you didn't have commit access yet.
> It's time you did though.  Could you follow the instructions
> on https://gcc.gnu.org/gitwrite.html ?  I'm happy to sponsor
> (and I'm sure Kyrill would be too).

Wow, that’s exciting! Kyrill has agreed to sponsor but thanks
nonetheless! :)

Best,
Soumya

> Thanks,
> Richard



Re: [Patch] AMD GCN: Set HSA_XNACK for USM and 'xnack+' / 'xnack-'

2024-10-29 Thread Andrew Stubbs

On 29/10/2024 11:44, Tobias Burnus wrote:

While users can set HSA_XNACK themselves, it is much more convenient if
the compiler sets it for them (at least if it is overriddable).

Some systems don't have XNACK, but for those that have it, the somewhat
newisher object code versions support three modes: unset (GCC: '- 
mxnack=any';
supporting both XNACK and not), set and unset; the last two only work if 
the

compiled-for mode matches the actual mode in which the GPU is running.
Therefore, setting HSA_XNACK in this case makes sense.

XNACK (when available) also needs to be enabled in order to have a working
unified-shared memory access, hence, setting it in that case also makes 
sense.

Therefore, this patch sets HSA_XNACK to 0 or 1.

This somewhat matches what is done in OG13 and in Andrew's patch at
https://gcc.gnu.org/pipermail/gcc-patches/2024-June/655951.html
albeit the code is somewhat different.
[For some reasons, this code is not in OG14 ?!?]

While doing so, I also updated the documentation and moved the code from
the existing stack-size constructor in the existing 'init' constructor to
reduce the number of used constructors.

OK for mainline?


This conflicts with my patch that already does (some of) this that is 
submitted as part of the USM series that is still awaiting review.


https://patchwork.sourceware.org/project/gcc/patch/20240628102449.562467-6-...@baylibre.com/

Andrew


Re: [Patch] AMD GCN: Set HSA_XNACK for USM and 'xnack+' / 'xnack-'

2024-10-29 Thread Tobias Burnus

Hi Andrew,

Am 29.10.24 um 13:07 schrieb Andrew Stubbs:

On 29/10/2024 11:44, Tobias Burnus wrote:

This somewhat matches what is done in OG13 and in Andrew's patch at
https://gcc.gnu.org/pipermail/gcc-patches/2024-June/655951.html
albeit the code is somewhat different.
[For some reasons, this code is not in OG14 ?!?]

...
This conflicts with my patch that already does (some of) this that is 
submitted as part of the USM series that is still awaiting review.


https://patchwork.sourceware.org/project/gcc/patch/20240628102449.562467-6-...@baylibre.com/ 



Well, you you go to the link above, it shows the same patch …

Tobias



Re: Frontend access to target features (was Re: [PATCH] libgccjit: Add ability to get CPU features)

2024-10-29 Thread Antoni Boucher

David: Arthur reviewed the gccrs patch and would be OK with it.

Could you please take a look and review it?

Le 2024-10-17 à 11 h 38, Antoni Boucher a écrit :

Hi.
Thanks for the review, David!

I talked to Arthur and he's OK with having a file to include in both 
gccrs and libgccjit.


I sent the patch to gccrs to move the code in a new file that we can 
include in both frontends: https://github.com/Rust-GCC/gccrs/pull/3195


I also renamed gcc_jit_target_info_supports_128bit_int to 
gcc_jit_target_info_supports_target_dependent_type because a subsequent 
patch will allow to check if other types are supported like _Float16 and 
_Float128.


Here's the patch for libgccjit updated to include this file.

Thanks.

Le 2024-06-26 à 17 h 55, David Malcolm a écrit :

On Sun, 2024-03-10 at 12:05 +0100, Iain Buclaw wrote:

Excerpts from David Malcolm's message of März 5, 2024 4:09 pm:

On Thu, 2023-11-09 at 19:33 -0500, Antoni Boucher wrote:

Hi.
See answers below.

On Thu, 2023-11-09 at 18:04 -0500, David Malcolm wrote:

On Thu, 2023-11-09 at 17:27 -0500, Antoni Boucher wrote:

Hi.
This patch adds support for getting the CPU features in
libgccjit
(bug
112466)

There's a TODO in the test:
I'm not sure how to test that gcc_jit_target_info_arch
returns
the
correct value since it is dependant on the CPU.
Any idea on how to improve this?

Also, I created a CStringHash to be able to have a
std::unordered_set. Is there any built-in way
of
doing
this?


Thanks for the patch.

Some high-level questions:

Is this specifically about detecting capabilities of the host
that
libgccjit is currently running on? or how the target was
configured
when libgccjit was built?


I'm less sure about this part. I'll need to do more tests.



One of the benefits of libgccjit is that, in theory, we support
all
of
the targets that GCC already supports.  Does this patch change
that,
or
is this more about giving client code the ability to determine
capabilities of the specific host being compiled for?


This should not change that. If it does, this is a bug.



I'm nervous about having per-target jit code.  Presumably
there's a
reason that we can't reuse existing target logic here - can you
please
describe what the problem is.  I see that the ChangeLog has:


 * config/i386/i386-jit.cc: New file.


where i386-jit.cc has almost 200 lines of nontrivial code.
Where
did
this come from?  Did you base it on existing code in our source
tree,
making modifications to fit the new internal API, or did you
write
it
from scratch?  In either case, how onerous would this be for
other
targets?


This was mostly copied from the same code done for the Rust and D
frontends.
See this commit and the following:
https://gcc.gnu.org/git/? 
p=gcc.git;a=commit;h=b1c06fd9723453dd2b2ec306684cb806dc2b4fbb

The equivalent to i386-jit.cc is there:
https://gcc.gnu.org/git/? 
p=gcc.git;a=commit;h=22e3557e2d52f129f2bbfdc98688b945dba28dc9


[CCing Iain and Arthur re those patches; for reference, the patch
being
discussed is attached to :
https://gcc.gnu.org/pipermail/jit/2024q1/001792.html ]

One of my concerns about this patch is that we seem to be gaining
code
that's per-(frontend x config) which seems to be copied and pasted
with
a search and replace, which could lead to an M*N explosion.



That's certainly the case with the configure/make rules. Itself I
think
is copied originally from the {cpu_type}-protos.h machinery.

It might be worth pointing out that the c-family of front-ends don't
have separate headers because their per-target macros are defined in
{cpu_type}.h directly - for better or worse.


Is there any real difference between the per-config code for the
different frontends, or should there be a general "enumerate all
features of the target" hook that's independent of the frontend?
(but
perhaps calls into it).



As far as I understand, the configure parts should all be identical
between tm_p, tm_d, tm_rust, ..., so would benefit from being
templated
to aid any other front-ends adding in their own per target hooks.


Am I right in thinking that (rustc with default LLVM backend) has
some
set of feature strings that both (rustc with rustc_codegen_gcc) and
gccrs are trying to emulate?  If so, is it presumably a goal that
libgccjit gives identical results to gccrs?  If so, would it be
crazy
for libgccjit to consume e.g. config/i386/i386-rust.cc ?


I don't know whether libgccjit can just pull in directly the
implementation of the rust target hooks here.


Sorry for the delay in responding.

I don't want to be in the business of maintaining a copy of the per-
target code for "jit", and I think it makes sense for libgccjit to
return identical information compared to gccrs.

So I think it would be ideal for jit to share code with rust for this,
rather than do a one-time copy-and-paste followed by a ongoing "keep
things updated" treadmill.

Presumably there would be Makefile.in issues given that e.g. Makefile
has i386-rust.o listed in:

# Target specific, Rust

Re: [RFC PATCH 3/5] vect: Fix dominators when adding a guard to skip the vector loop

2024-10-29 Thread Richard Biener
On Mon, 28 Oct 2024, Alex Coplan wrote:

> From: Tamar Christina 
> 
> The alignment peeling changes exposed a latent missing dominator update
> with early break vectorization, specifically when inserting the vector
> skip edge, since the new edge bypasses the prolog skip block and thus
> has the potential to subvert its dominance.  This patch fixes that.

OK.

Thanks,
Richard.

> gcc/ChangeLog:
> 
>   * tree-vect-loop-manip.cc (vect_do_peeling): Update immediate
>   dominators of nodes that were dominated by the prolog skip block
>   after inserting vector skip edge.
> 
> gcc/testsuite/ChangeLog:
> 
>   * g++.dg/vect/vect-early-break_6.cc: New test.
> 
> Co-Authored-By: Alex Coplan 
> ---
>  .../g++.dg/vect/vect-early-break_6.cc | 25 +++
>  gcc/tree-vect-loop-manip.cc   | 24 ++
>  2 files changed, 49 insertions(+)
>  create mode 100644 gcc/testsuite/g++.dg/vect/vect-early-break_6.cc
> 
> 

-- 
Richard Biener 
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)


Re: [RFC PATCH 4/5] vect: Ensure we add vector skip guard even when versioning for aliasing

2024-10-29 Thread Richard Biener
On Mon, 28 Oct 2024, Alex Coplan wrote:

> This fixes a latent wrong code issue whereby vect_do_peeling determined
> the wrong condition for inserting the vector skip guard.  Specifically
> in the case where the loop niters are unknown at compile time we used to
> check:
> 
>   !LOOP_REQUIRES_VERSIONING (loop_vinfo)
> 
> but LOOP_REQUIRES_VERSIONING is true for loops which we have versioned
> for aliasing, and that has nothing to do with prolog peeling.  I think
> this condition should instead be checking specifically if we aren't
> versioning for alignment.
> 
> As it stands, when we version for alignment, we don't peel, so the
> vector skip guard is indeed redundant in that case.
> 
> With the testcase added (reduced from the Fortran frontend) we would
> version for aliasing, omit the vector skip guard, and then at runtime we
> would peel sufficient iterations for alignment that there wasn't a full
> vector iteration left when we entered the vector body, thus overflowing
> the output buffer.

OK.

Thanks,
Richard.

> gcc/ChangeLog:
> 
>   * tree-vect-loop-manip.cc (vect_do_peeling): Adjust skip_vector
>   condition to only omit the edge if we're versioning for
>   alignment.
> 
> gcc/testsuite/ChangeLog:
> 
>   * gcc.dg/vect/vect-early-break_130.c: New test.
> ---
>  .../gcc.dg/vect/vect-early-break_130.c| 91 +++
>  gcc/tree-vect-loop-manip.cc   |  2 +-
>  2 files changed, 92 insertions(+), 1 deletion(-)
>  create mode 100644 gcc/testsuite/gcc.dg/vect/vect-early-break_130.c
> 
> 

-- 
Richard Biener 
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)


Re: [RFC PATCH 1/5] vect: Force alignment peeling to vectorize more early break loops

2024-10-29 Thread Richard Biener
On Mon, 28 Oct 2024, Alex Coplan wrote:

> This allows us to vectorize more loops with early exits by forcing
> peeling for alignment to make sure that we're guaranteed to be able to
> safely read an entire vector iteration without crossing a page boundary.
> 
> To make this work for VLA architectures we have to allow compile-time
> non-constant target alignments.  We also have to override the result of
> the target's preferred_vector_alignment hook if it isn't a power-of-two
> multiple of the TYPE_SIZE of the chosen vector type.
> 
> There is currently an implicit assumption that the TYPE_SIZE of the
> vector type is itself a power of two.  For non-VLA types this
> could be checked directly in the vectorizer.  For VLA types I
> had discussed offline with Richard S about adding a target hook to allow
> the vectorizer to query the backend to confirm that a given VLA type
> is known to have a power-of-two size at runtime.

GCC assumes all vectors have power-of-two size, so I don't think we
need to check anything but we'd instead have to make sure the
target constrains the hardware when this assumption doesn't hold
in silicon.

>  I thought we
> might be able to do this check in vector_alignment_reachable_p.  Any
> thoughts on that, richi?

For the purpose of alignment peeling yeah, I guess this would be
a possible place to check this.  The hook is currently used for
the case where the element has a lower alignment than its
size and thus vector alignment cannot be reached by peeling.

Btw, I thought we can already apply peeling for alignment for
VLA vectors ...

> gcc/ChangeLog:
> 
>   * tree-vect-data-refs.cc (vect_analyze_early_break_dependences):
>   Set need_peeling_for_alignment flag on read DRs instead of
>   failing vectorization.  Punt on gathers.
>   (dr_misalignment): Handle non-constant target alignments.
>   (vect_compute_data_ref_alignment): If need_peeling_for_alignment
>   flag is set on the DR, then override the target alignment chosen
>   by the preferred_vector_alignment hook to choose a safe
>   alignment.
>   (vect_supportable_dr_alignment): Override
>   support_vector_misalignment hook if need_peeling_for_alignment
>   is set on the DR: in this case we must return
>   dr_unaligned_unsupported in order to force peeling.
>   * tree-vect-loop-manip.cc (vect_do_peeling): Allow prolog
>   peeling by a compile-time non-constant amount.
>   * tree-vectorizer.h (dr_vec_info): Add new flag
>   need_peeling_for_alignment.
> ---
>  gcc/tree-vect-data-refs.cc  | 77 ++---
>  gcc/tree-vect-loop-manip.cc |  6 ---
>  gcc/tree-vectorizer.h   |  5 +++
>  3 files changed, 68 insertions(+), 20 deletions(-)

Eh, where's the inline copy ...

@@ -739,15 +739,22 @@ vect_analyze_early_break_dependences (loop_vec_info 
loop_vinfo)
  if (DR_IS_READ (dr_ref)
  && !ref_within_array_bound (stmt, DR_REF (dr_ref)))
{
+ if (STMT_VINFO_GATHER_SCATTER_P (stmt_vinfo))
+   {
+ const char *msg

you want to add STMT_VINFO_STRIDED_P as well.

  /* Vector size in bytes.  */
+  poly_uint64 safe_align
+   = exact_div (tree_to_poly_uint64 (TYPE_SIZE (vectype)), 
BITS_PER_UNIT);

safe_align = TYPE_SIZE_UNIT (vectype);

+  /* Multiply by the unroll factor to get the number of bytes read
+per vector iteration.  */
+  if (loop_vinfo)
+   {
+ auto num_copies = vect_get_num_copies (loop_vinfo, vectype);
+ gcc_checking_assert (pow2p_hwi (num_copies));
+ safe_align *= num_copies;

the unroll factor is the vectorization factor - I think the above goes
wrong for grouped accesses like an early break condition

 if (a[2*i] == a[2*i+1])

or so.  Thus, multiply by LOOP_VINFO_VECT_FACTOR (loop_vinfo).
Note this number doesn't need to be a power of two (and num_copies
above neither)

The rest of the patch looks good to me.

Richard.



RE: [PATCH 1/5] Internal-fn: Introduce new IFN MASK_LEN_STRIDED_LOAD{STORE}

2024-10-29 Thread Li, Pan2
Thanks Richard for comments.

> Please mention the full optab names.

Sure, let me adjust this before commit manually.

> There is documentation missing for doc/md.texi for the new optabs.

Ack, will take another patch for doc.

> Otherwise looks OK.  I'll note that non-masked or non-len-only-masked
> variants are missing but this is OK I guess.

Yes, we can add non-masked/non-len variants when we need in future.

Pan

-Original Message-
From: Richard Biener  
Sent: Tuesday, October 29, 2024 6:44 PM
To: Li, Pan2 
Cc: gcc-patches@gcc.gnu.org; tamar.christ...@arm.com; juzhe.zh...@rivai.ai; 
kito.ch...@gmail.com; jeffreya...@gmail.com; rdapp@gmail.com
Subject: Re: [PATCH 1/5] Internal-fn: Introduce new IFN 
MASK_LEN_STRIDED_LOAD{STORE}

On Wed, Oct 23, 2024 at 12:47 PM  wrote:
>
> From: Pan Li 
>
> This patch would like to introduce new IFN for strided load and store.
>
> LOAD:  v = MASK_LEN_STRIDED_LOAD (ptr, stride, mask, len, bias)
> STORE: MASK_LEN_STRIED_STORE (ptr, stride, v, mask, len, bias)
>
> The IFN target below code example similar as below
>
> void foo (int * a, int * b, int stride, int n)
> {
>   for (int i = 0; i < n; i++)
> a[i * stride] = b[i * stride];
> }
>
> The below test suites are passed for this patch.
> * The rv64gcv fully regression test.
> * The x86 bootstrap test.
> * The x86 fully regression test.
>
> gcc/ChangeLog:
>
> * internal-fn.cc (strided_load_direct): Add new define direct
> for strided load.
> (strided_store_direct): Ditto but for store.
> (expand_strided_load_optab_fn): Add new func to expand the IFN
> MASK_LEN_STRIDED_LOAD in middle-end.
> (expand_strided_store_optab_fn): Ditto but for store.
> (direct_strided_load_optab_supported_p): Add define for stride
> load optab supported.
> (direct_strided_store_optab_supported_p): Ditto but for store.
> (internal_fn_len_index): Add strided load/store len index.
> (internal_fn_mask_index): Ditto but for mask.
> (internal_fn_stored_value_index): Add strided store value index.
> * internal-fn.def (MASK_LEN_STRIDED_LOAD): Add new IFN for
> strided load.
> (MASK_LEN_STRIDED_STORE): Ditto but for store.
> * optabs.def (OPTAB_D): Add strided load/store optab.

Please mention the full optab names.

There is documentation missing for doc/md.texi for the new optabs.

Otherwise looks OK.  I'll note that non-masked or non-len-only-masked
variants are missing but this is OK I guess.

Richard.

>
> Signed-off-by: Pan Li 
> Co-Authored-By: Juzhe-Zhong 
> ---
>  gcc/internal-fn.cc  | 71 +
>  gcc/internal-fn.def |  6 
>  gcc/optabs.def  |  2 ++
>  3 files changed, 79 insertions(+)
>
> diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
> index d89a04fe412..bfbbba8e2dd 100644
> --- a/gcc/internal-fn.cc
> +++ b/gcc/internal-fn.cc
> @@ -159,6 +159,7 @@ init_internal_fns ()
>  #define load_lanes_direct { -1, -1, false }
>  #define mask_load_lanes_direct { -1, -1, false }
>  #define gather_load_direct { 3, 1, false }
> +#define strided_load_direct { -1, -1, false }
>  #define len_load_direct { -1, -1, false }
>  #define mask_len_load_direct { -1, 4, false }
>  #define mask_store_direct { 3, 2, false }
> @@ -168,6 +169,7 @@ init_internal_fns ()
>  #define vec_cond_mask_len_direct { 1, 1, false }
>  #define vec_cond_direct { 2, 0, false }
>  #define scatter_store_direct { 3, 1, false }
> +#define strided_store_direct { 1, 1, false }
>  #define len_store_direct { 3, 3, false }
>  #define mask_len_store_direct { 4, 5, false }
>  #define vec_set_direct { 3, 3, false }
> @@ -3712,6 +3714,64 @@ expand_gather_load_optab_fn (internal_fn, gcall *stmt, 
> direct_optab optab)
>assign_call_lhs (lhs, lhs_rtx, &ops[0]);
>  }
>
> +/* Expand MASK_LEN_STRIDED_LOAD call CALL by optab OPTAB.  */
> +
> +static void
> +expand_strided_load_optab_fn (ATTRIBUTE_UNUSED internal_fn, gcall *stmt,
> + direct_optab optab)
> +{
> +  tree lhs = gimple_call_lhs (stmt);
> +  tree base = gimple_call_arg (stmt, 0);
> +  tree stride = gimple_call_arg (stmt, 1);
> +
> +  rtx lhs_rtx = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
> +  rtx base_rtx = expand_normal (base);
> +  rtx stride_rtx = expand_normal (stride);
> +
> +  unsigned i = 0;
> +  class expand_operand ops[6];
> +  machine_mode mode = TYPE_MODE (TREE_TYPE (lhs));
> +
> +  create_output_operand (&ops[i++], lhs_rtx, mode);
> +  create_address_operand (&ops[i++], base_rtx);
> +  create_address_operand (&ops[i++], stride_rtx);
> +
> +  i = add_mask_and_len_args (ops, i, stmt);
> +  expand_insn (direct_optab_handler (optab, mode), i, ops);
> +
> +  if (!rtx_equal_p (lhs_rtx, ops[0].value))
> +emit_move_insn (lhs_rtx, ops[0].value);
> +}
> +
> +/* Expand MASK_LEN_STRIDED_STORE call CALL by optab OPTAB.  */
> +
> +static void
> +expand_strided_store_optab_fn (ATTRIBUTE_UNUSED internal_fn, gcall

Re: [RFC PATCH 2/5] vect: Don't guard scalar epilogue for inverted loops

2024-10-29 Thread Richard Biener
On Mon, 28 Oct 2024, Alex Coplan wrote:

> For loops with LOOP_VINFO_EARLY_BREAKS_VECT_PEELED we should always
> enter the scalar epilogue, so avoid emitting a guard on entry to the
> epilogue.

OK.  I guess this can go in independently?

Richard.

> gcc/ChangeLog:
> 
>   * tree-vect-loop-manip.cc (vect_do_peeling): Avoid emitting an
>   epilogue guard for inverted early-exit loops.
> ---
>  gcc/tree-vect-loop-manip.cc | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> 

-- 
Richard Biener 
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)


Re: testsuite: Use noinline in gcc.dg/simulate-thread/simulate-thread.h

2024-10-29 Thread Joseph Myers
On Thu, 24 Oct 2024, Joseph Myers wrote:

> Among the changes of test results with a -std=gnu23 default were two
> tests changing from PASS to UNSUPPORTED:
> 
> UNSUPPORTED: gcc.dg/simulate-thread/speculative-store.c   -O2 -g  thread 
> simulation test
> UNSUPPORTED: gcc.dg/simulate-thread/speculative-store.c   -O3 -g  thread 
> simulation test
> 
> It appears that functions defined with () becoming prototyped affects
> inlining, and changing the code to use (void) allows UNSUPPORTED
> results to be reproduced with -std=gnu17.  Add __attribute__
> ((noinline)) on one more function to avoid the UNSUPPORTED results;
> some of the tests in this directory already have such an attribute on
> some functions.
> 
> Tested for x86_64-pc-linux-gnu.  OK to commit?
> 
>   * gcc.dg/simulate-thread/simulate-thread.h
>   (simulate_thread_wrapper_final_verify): Mark noinline.

Linaro CI reported regressions for this on arm-linux-gnueabihf:

  | FAIL: gcc.dg/simulate-thread/atomic-load-longlong.c -O2 -g  thread 
simulation test
  | FAIL: gcc.dg/simulate-thread/atomic-load-longlong.c -O3 -g  thread 
simulation test

Examining the logs shows they are "Testcase exceeded maximum instruction 
count threshold", and that there is already such a failure the Linaro CI 
is considering expected for the -O0 test and for several other 
simulate-thread tests.

Thus, I propose that we do not consider the failures reported by Linaro CI 
as something to block this testsuite patch (which still needs review, as 
does my -std=gnu17 patch for gcc.dg/pr114115.c) but rather as a reasonable 
consequence of adding a noinline attribute to a test that already seems 
marginal on this target system for whether it can complete within the 
resource limits.

-- 
Joseph S. Myers
josmy...@redhat.com



Re: [PATCH] Add 'cobol' to Makefile.def, take 2

2024-10-29 Thread James K. Lowden
On Tue, 29 Oct 2024 11:56:18 +0100
Richard Biener  wrote:

> gcc/ should be stripped from * gcc/common.opt, so just * common.opt
...
> Likewise for gcc/cobol.

I see. The names in these cases are relative to gcc, not to the whole
project.  The runtime library, libgcobol, like the other libraries, is
adjacent to gcc.  

I will modify and repost.  

BTW, we're going to patch cobol/config-lang.in twice.  For this first
patch, it doesn't mention cobol, so as not to interfere with the rest
of gcc. In the next patch, which implements the FE, includes 

gtfiles="\$(srcdir)/cobol/cobol1.cc"

I don't know why this is needed.  I only know all the other FEs seem
to define gtfiles, and that if it's not included then the build fails
because the linker is missing symbols.  

Jakub Jelinek said: 

> > We'll mainly have to remember when pushing any of the series.
> 
> The m2 dances were git_commit.py tweaks:
> https://gcc.gnu.org/r13-4588
> followed by asking one of us gccadmins on IRC to install that change
> on the server, followed by adding just new directory/ies with almost
> empty ChangeLog files:
> https://gcc.gnu.org/r13-4592
> followed by the rest of the changes.

This 1st patch includes a stub cobol/ChangeLog, if that helps.  

--jkl


Re: [PATCH] c: detect variably-modified types [PR117145,PR117245,PR100420]

2024-10-29 Thread Joseph Myers
On Sat, 26 Oct 2024, Martin Uecker wrote:

> +tree
> +c_build_pointer_type (tree to_type)
> +{
> +  addr_space_t as = to_type == error_mark_node? ADDR_SPACE_GENERIC
> +   : TYPE_ADDR_SPACE (to_type);

This is badly formatted, missing space before '?'.

> +/* Build an array type.  This sets typeless storage as required
> +   by C23 and C_TYPE_VARIABLY_MODIFIED and C_TYPE_VARIABLE_SIZE
> +   based on the element type and domain.  */

As required by C2Y, not C23.

> +  else if (TREE_CODE (type) == REFERENCE_TYPE
> +|| TREE_CODE (type) == OFFSET_TYPE)
> +{
> +  gcc_assert (0);
> +}

Should be gcc_unreachable (), and no braces around a single statement.

> diff --git a/gcc/testsuite/gcc.dg/Wvla-parameter-2.c 
> b/gcc/testsuite/gcc.dg/Wvla-parameter-2.c
> index daa71d897c9..ebd61522563 100644
> --- a/gcc/testsuite/gcc.dg/Wvla-parameter-2.c
> +++ b/gcc/testsuite/gcc.dg/Wvla-parameter-2.c
> @@ -37,14 +37,10 @@ void f (int[n1][2][n3][4][n5][6][n7][8][n9]);
>  /* Due to a limitation and because [*] is represented the same as [0]
> only the most significant array bound is rendered as [*]; the others
> are rendered as [0].  */

This "Due to a limitation" comment should be updated / removed to reflect 
the changes to representation of [*].

OK with those fixes.

-- 
Joseph S. Myers
josmy...@redhat.com



Re: [PATCH 1/4] sched1: hookize pressure scheduling spilling agressiveness

2024-10-29 Thread Wilco Dijkstra
Hi Vineet,

> I agree the NARROW/WIDE stuff is obfuscating things in technicalities.

Is there evidence this change would make things significantly worse for
some targets? I did a few runs on Neoverse V2 with various options and
it looks beneficial both for integer and FP. On the example and options
you mentioned I got 20.1% speedup! This is a very wide core, so wide
vs narrow is not the right terminology as the gains are on both.

> How about  TARGET_SCHED_PRESSURE_SPILL_AGGRESSIVE with true (default) being 
> existing behavior
> and false being new semantics.
> Its a bit verbose but I think clear enough.

Another option may be to treat it like a new type of pressure algorithm, eg.
--param sched-pressure-algorithm=3 or update one of the existing ones.
There are already too many features in the scheduler - it would be better
to reduce the many variants and focus on doing really well on modern cores.

Cheers,
Wilco

[PUSHED] aarch64: Remove unnecessary casts to rtx_code [PR117349]

2024-10-29 Thread Andrew Pinski
In aarch64_gen_ccmp_first/aarch64_gen_ccmp_next, the casts
were no longer needed after r14-3412-gbf64392d66f291 which
changed the type of the arguments to rtx_code.

In aarch64_rtx_costs, they were no longer needed since
r12-4828-g1d5c43db79b7ea which changed the type of code
to rtx_code.

Pushed as obvious after a build/test for aarch64-linux-gnu.

gcc/ChangeLog:

PR target/117349
* config/aarch64/aarch64.cc (aarch64_rtx_costs): Remove
unnecessary casts to rtx_code.
(aarch64_gen_ccmp_first): Likewise.
(aarch64_gen_ccmp_next): Likewise.

Signed-off-by: Andrew Pinski 
---
 gcc/config/aarch64/aarch64.cc | 51 +++
 1 file changed, 21 insertions(+), 30 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index a6cc00e74ab..b2dd23ccb26 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -14286,7 +14286,7 @@ aarch64_rtx_costs (rtx x, machine_mode mode, int outer 
ATTRIBUTE_UNUSED,
   /* BFM.  */
  if (speed)
*cost += extra_cost->alu.bfi;
- *cost += rtx_cost (op1, VOIDmode, (enum rtx_code) code, 1, speed);
+ *cost += rtx_cost (op1, VOIDmode, code, 1, speed);
 }
 
  return true;
@@ -14666,8 +14666,7 @@ cost_minus:
  *cost += extra_cost->alu.extend_arith;
 
op1 = aarch64_strip_extend (op1, true);
-   *cost += rtx_cost (op1, VOIDmode,
-  (enum rtx_code) GET_CODE (op1), 0, speed);
+   *cost += rtx_cost (op1, VOIDmode, GET_CODE (op1), 0, speed);
return true;
  }
 
@@ -14678,9 +14677,7 @@ cost_minus:
 || aarch64_shift_p (GET_CODE (new_op1)))
&& code != COMPARE)
  {
-   *cost += aarch64_rtx_mult_cost (new_op1, MULT,
-   (enum rtx_code) code,
-   speed);
+   *cost += aarch64_rtx_mult_cost (new_op1, MULT, code, speed);
return true;
  }
 
@@ -14781,8 +14778,7 @@ cost_plus:
  *cost += extra_cost->alu.extend_arith;
 
op0 = aarch64_strip_extend (op0, true);
-   *cost += rtx_cost (op0, VOIDmode,
-  (enum rtx_code) GET_CODE (op0), 0, speed);
+   *cost += rtx_cost (op0, VOIDmode, GET_CODE (op0), 0, speed);
return true;
  }
 
@@ -14896,8 +14892,7 @@ cost_plus:
  && aarch64_mask_and_shift_for_ubfiz_p (int_mode, op1,
 XEXP (op0, 1)))
{
- *cost += rtx_cost (XEXP (op0, 0), int_mode,
-(enum rtx_code) code, 0, speed);
+ *cost += rtx_cost (XEXP (op0, 0), int_mode, code, 0, speed);
  if (speed)
*cost += extra_cost->alu.bfx;
 
@@ -14907,8 +14902,7 @@ cost_plus:
{
/* We possibly get the immediate for free, this is not
   modelled.  */
- *cost += rtx_cost (op0, int_mode,
-(enum rtx_code) code, 0, speed);
+ *cost += rtx_cost (op0, int_mode, code, 0, speed);
  if (speed)
*cost += extra_cost->alu.logical;
 
@@ -14943,10 +14937,8 @@ cost_plus:
}
 
  /* In both cases we want to cost both operands.  */
- *cost += rtx_cost (new_op0, int_mode, (enum rtx_code) code,
-0, speed);
- *cost += rtx_cost (op1, int_mode, (enum rtx_code) code,
-1, speed);
+ *cost += rtx_cost (new_op0, int_mode, code, 0, speed);
+ *cost += rtx_cost (op1, int_mode, code, 1, speed);
 
  return true;
}
@@ -14967,7 +14959,7 @@ cost_plus:
   /* MVN-shifted-reg.  */
   if (op0 != x)
 {
- *cost += rtx_cost (op0, mode, (enum rtx_code) code, 0, speed);
+ *cost += rtx_cost (op0, mode, code, 0, speed);
 
   if (speed)
 *cost += extra_cost->alu.log_shift;
@@ -14983,7 +14975,7 @@ cost_plus:
   rtx newop1 = XEXP (op0, 1);
   rtx op0_stripped = aarch64_strip_shift (newop0);
 
- *cost += rtx_cost (newop1, mode, (enum rtx_code) code, 1, speed);
+ *cost += rtx_cost (newop1, mode, code, 1, speed);
  *cost += rtx_cost (op0_stripped, mode, XOR, 0, speed);
 
   if (speed)
@@ -15149,7 +15141,7 @@ cost_plus:
  && known_eq (INTVAL (XEXP (op1, 1)),
   GET_MODE_BITSIZE (mode) - 1))
{
- *cost += rtx_cost (op0, mode, (rtx_code) code, 0, speed);
+ *cost += rtx_cost (op0, mode, code, 0, speed);
  /* We already demanded XEXP (op1, 0) to be REG_P, so
 

[PATCH] Fortran: fix several front-end memleaks

2024-10-29 Thread Harald Anlauf
Dear all,

while looking at the recent testcase gfortran.dg/pr115070.f90 with f951
running under valgrind, I noticed minor front-end memleaks of gfc_expr's
that are probably fallout from a code refactoring, which are fixed by
the attached.

Regtested on x86_64-pc-linux-gnu.  OK for mainline?

Thanks,
Harald

From 87aaaf3b8614730d0f7ccfe29ee36f4921cf48d2 Mon Sep 17 00:00:00 2001
From: Harald Anlauf 
Date: Tue, 29 Oct 2024 21:52:27 +0100
Subject: [PATCH] Fortran: fix several front-end memleaks

gcc/fortran/ChangeLog:

	* trans-expr.cc (gfc_trans_class_init_assign): Free intermediate
	gfc_expr's.
	* trans.cc (get_final_proc_ref): Likewise.
	(get_elem_size): Likewise.
	(gfc_add_finalizer_call): Likewise.
---
 gcc/fortran/trans-expr.cc | 2 ++
 gcc/fortran/trans.cc  | 5 +
 2 files changed, 7 insertions(+)

diff --git a/gcc/fortran/trans-expr.cc b/gcc/fortran/trans-expr.cc
index ff8cde93ef4..ddbb5ecf068 100644
--- a/gcc/fortran/trans-expr.cc
+++ b/gcc/fortran/trans-expr.cc
@@ -1890,6 +1890,8 @@ gfc_trans_class_init_assign (gfc_code *code)
 }

   gfc_add_expr_to_block (&block, tmp);
+  gfc_free_expr (lhs);
+  gfc_free_expr (rhs);

   return gfc_finish_block (&block);
 }
diff --git a/gcc/fortran/trans.cc b/gcc/fortran/trans.cc
index 58b93e233a1..1a0ba637058 100644
--- a/gcc/fortran/trans.cc
+++ b/gcc/fortran/trans.cc
@@ -1128,6 +1128,9 @@ get_final_proc_ref (gfc_se *se, gfc_expr *expr, tree class_container)

   if (POINTER_TYPE_P (TREE_TYPE (se->expr)))
 se->expr = build_fold_indirect_ref_loc (input_location, se->expr);
+
+  if (expr->ts.type != BT_DERIVED && !using_class_container)
+gfc_free_expr (final_wrapper);
 }


@@ -1155,6 +1158,7 @@ get_elem_size (gfc_se *se, gfc_expr *expr, tree class_container)

   gfc_conv_expr (se, class_size);
   gcc_assert (se->post.head == NULL_TREE);
+  gfc_free_expr (class_size);
 }
 }

@@ -1467,6 +1471,7 @@ gfc_add_finalizer_call (stmtblock_t *block, gfc_expr *expr2,

   gfc_add_expr_to_block (block, tmp);
   gfc_add_block_to_block (block, &final_se.post);
+  gfc_free_expr (expr);

   return true;
 }
--
2.35.3



Re: [PATCH v2 3/3] Simplify switch bit test clustering algorithmg

2024-10-29 Thread Andi Kleen
> > However this exposes PR117352 which is a negative interaction of the
> > more aggressive bit test conversion.  I don't think it's a show stopper,
> > this can be sorted out later.
> 
> I think it is a show stopper for GCC 15 because it is a pretty big
> performance regression with targets that have ccmp (which now includes
> x86_64).

Okay I reverted it.

It showed a weakness in the new algorithm that it doesn't take range
comparisons into account. And yes the cost check probably needs
to be adjust to understand ccmp.

-Andi


Re: Frontend access to target features (was Re: [PATCH] libgccjit: Add ability to get CPU features)

2024-10-29 Thread Antoni Boucher

Thanks, David!

Did you review the updated patch that depends on this gccrs patch?
Is it also OK to merge when the PR in gccrs is merged?

Le 2024-10-29 à 17 h 04, David Malcolm a écrit :

On Tue, 2024-10-29 at 07:59 -0400, Antoni Boucher wrote:

David: Arthur reviewed the gccrs patch and would be OK with it.

Could you please take a look and review it?


https://github.com/Rust-GCC/gccrs/pull/3195
looks good to me; thanks!

Dave



Le 2024-10-17 à 11 h 38, Antoni Boucher a écrit :

Hi.
Thanks for the review, David!

I talked to Arthur and he's OK with having a file to include in
both
gccrs and libgccjit.

I sent the patch to gccrs to move the code in a new file that we
can
include in both frontends:
https://github.com/Rust-GCC/gccrs/pull/3195

I also renamed gcc_jit_target_info_supports_128bit_int to
gcc_jit_target_info_supports_target_dependent_type because a
subsequent
patch will allow to check if other types are supported like
_Float16 and
_Float128.

Here's the patch for libgccjit updated to include this file.

Thanks.

Le 2024-06-26 à 17 h 55, David Malcolm a écrit :

On Sun, 2024-03-10 at 12:05 +0100, Iain Buclaw wrote:

Excerpts from David Malcolm's message of März 5, 2024 4:09 pm:

On Thu, 2023-11-09 at 19:33 -0500, Antoni Boucher wrote:

Hi.
See answers below.

On Thu, 2023-11-09 at 18:04 -0500, David Malcolm wrote:

On Thu, 2023-11-09 at 17:27 -0500, Antoni Boucher wrote:

Hi.
This patch adds support for getting the CPU features in
libgccjit
(bug
112466)

There's a TODO in the test:
I'm not sure how to test that gcc_jit_target_info_arch
returns
the
correct value since it is dependant on the CPU.
Any idea on how to improve this?

Also, I created a CStringHash to be able to have a
std::unordered_set. Is there any built-in
way
of
doing
this?


Thanks for the patch.

Some high-level questions:

Is this specifically about detecting capabilities of the
host
that
libgccjit is currently running on? or how the target was
configured
when libgccjit was built?


I'm less sure about this part. I'll need to do more tests.



One of the benefits of libgccjit is that, in theory, we
support
all
of
the targets that GCC already supports.  Does this patch
change
that,
or
is this more about giving client code the ability to
determine
capabilities of the specific host being compiled for?


This should not change that. If it does, this is a bug.



I'm nervous about having per-target jit code.  Presumably
there's a
reason that we can't reuse existing target logic here -
can you
please
describe what the problem is.  I see that the ChangeLog
has:


  * config/i386/i386-jit.cc: New file.


where i386-jit.cc has almost 200 lines of nontrivial
code.
Where
did
this come from?  Did you base it on existing code in our
source
tree,
making modifications to fit the new internal API, or did
you
write
it
from scratch?  In either case, how onerous would this be
for
other
targets?


This was mostly copied from the same code done for the Rust
and D
frontends.
See this commit and the following:
https://gcc.gnu.org/git/?
p=gcc.git;a=commit;h=b1c06fd9723453dd2b2ec306684cb806dc2b4f
bb
The equivalent to i386-jit.cc is there:
https://gcc.gnu.org/git/?
p=gcc.git;a=commit;h=22e3557e2d52f129f2bbfdc98688b945dba28d
c9


[CCing Iain and Arthur re those patches; for reference, the
patch
being
discussed is attached to :
https://gcc.gnu.org/pipermail/jit/2024q1/001792.html ]

One of my concerns about this patch is that we seem to be
gaining
code
that's per-(frontend x config) which seems to be copied and
pasted
with
a search and replace, which could lead to an M*N explosion.



That's certainly the case with the configure/make rules. Itself
I
think
is copied originally from the {cpu_type}-protos.h machinery.

It might be worth pointing out that the c-family of front-ends
don't
have separate headers because their per-target macros are
defined in
{cpu_type}.h directly - for better or worse.


Is there any real difference between the per-config code for
the
different frontends, or should there be a general "enumerate
all
features of the target" hook that's independent of the
frontend?
(but
perhaps calls into it).



As far as I understand, the configure parts should all be
identical
between tm_p, tm_d, tm_rust, ..., so would benefit from being
templated
to aid any other front-ends adding in their own per target
hooks.


Am I right in thinking that (rustc with default LLVM backend)
has
some
set of feature strings that both (rustc with
rustc_codegen_gcc) and
gccrs are trying to emulate?  If so, is it presumably a goal
that
libgccjit gives identical results to gccrs?  If so, would it
be
crazy
for libgccjit to consume e.g. config/i386/i386-rust.cc ?


I don't know whether libgccjit can just pull in directly the
implementation of the rust target hooks here.


Sorry for the delay in responding.

I don't want to be in the business of maintaining a copy of the
per-
target code for "jit", and I think it makes sense for libgccjit
to
return identical in

Re: Frontend access to target features (was Re: [PATCH] libgccjit: Add ability to get CPU features)

2024-10-29 Thread David Malcolm
On Tue, 2024-10-29 at 07:59 -0400, Antoni Boucher wrote:
> David: Arthur reviewed the gccrs patch and would be OK with it.
> 
> Could you please take a look and review it?

https://github.com/Rust-GCC/gccrs/pull/3195
looks good to me; thanks!

Dave

> 
> Le 2024-10-17 à 11 h 38, Antoni Boucher a écrit :
> > Hi.
> > Thanks for the review, David!
> > 
> > I talked to Arthur and he's OK with having a file to include in
> > both 
> > gccrs and libgccjit.
> > 
> > I sent the patch to gccrs to move the code in a new file that we
> > can 
> > include in both frontends:
> > https://github.com/Rust-GCC/gccrs/pull/3195
> > 
> > I also renamed gcc_jit_target_info_supports_128bit_int to 
> > gcc_jit_target_info_supports_target_dependent_type because a
> > subsequent 
> > patch will allow to check if other types are supported like
> > _Float16 and 
> > _Float128.
> > 
> > Here's the patch for libgccjit updated to include this file.
> > 
> > Thanks.
> > 
> > Le 2024-06-26 à 17 h 55, David Malcolm a écrit :
> > > On Sun, 2024-03-10 at 12:05 +0100, Iain Buclaw wrote:
> > > > Excerpts from David Malcolm's message of März 5, 2024 4:09 pm:
> > > > > On Thu, 2023-11-09 at 19:33 -0500, Antoni Boucher wrote:
> > > > > > Hi.
> > > > > > See answers below.
> > > > > > 
> > > > > > On Thu, 2023-11-09 at 18:04 -0500, David Malcolm wrote:
> > > > > > > On Thu, 2023-11-09 at 17:27 -0500, Antoni Boucher wrote:
> > > > > > > > Hi.
> > > > > > > > This patch adds support for getting the CPU features in
> > > > > > > > libgccjit
> > > > > > > > (bug
> > > > > > > > 112466)
> > > > > > > > 
> > > > > > > > There's a TODO in the test:
> > > > > > > > I'm not sure how to test that gcc_jit_target_info_arch
> > > > > > > > returns
> > > > > > > > the
> > > > > > > > correct value since it is dependant on the CPU.
> > > > > > > > Any idea on how to improve this?
> > > > > > > > 
> > > > > > > > Also, I created a CStringHash to be able to have a
> > > > > > > > std::unordered_set. Is there any built-in
> > > > > > > > way
> > > > > > > > of
> > > > > > > > doing
> > > > > > > > this?
> > > > > > > 
> > > > > > > Thanks for the patch.
> > > > > > > 
> > > > > > > Some high-level questions:
> > > > > > > 
> > > > > > > Is this specifically about detecting capabilities of the
> > > > > > > host
> > > > > > > that
> > > > > > > libgccjit is currently running on? or how the target was
> > > > > > > configured
> > > > > > > when libgccjit was built?
> > > > > > 
> > > > > > I'm less sure about this part. I'll need to do more tests.
> > > > > > 
> > > > > > > 
> > > > > > > One of the benefits of libgccjit is that, in theory, we
> > > > > > > support
> > > > > > > all
> > > > > > > of
> > > > > > > the targets that GCC already supports.  Does this patch
> > > > > > > change
> > > > > > > that,
> > > > > > > or
> > > > > > > is this more about giving client code the ability to
> > > > > > > determine
> > > > > > > capabilities of the specific host being compiled for?
> > > > > > 
> > > > > > This should not change that. If it does, this is a bug.
> > > > > > 
> > > > > > > 
> > > > > > > I'm nervous about having per-target jit code.  Presumably
> > > > > > > there's a
> > > > > > > reason that we can't reuse existing target logic here -
> > > > > > > can you
> > > > > > > please
> > > > > > > describe what the problem is.  I see that the ChangeLog
> > > > > > > has:
> > > > > > > 
> > > > > > > >  * config/i386/i386-jit.cc: New file.
> > > > > > > 
> > > > > > > where i386-jit.cc has almost 200 lines of nontrivial
> > > > > > > code.
> > > > > > > Where
> > > > > > > did
> > > > > > > this come from?  Did you base it on existing code in our
> > > > > > > source
> > > > > > > tree,
> > > > > > > making modifications to fit the new internal API, or did
> > > > > > > you
> > > > > > > write
> > > > > > > it
> > > > > > > from scratch?  In either case, how onerous would this be
> > > > > > > for
> > > > > > > other
> > > > > > > targets?
> > > > > > 
> > > > > > This was mostly copied from the same code done for the Rust
> > > > > > and D
> > > > > > frontends.
> > > > > > See this commit and the following:
> > > > > > https://gcc.gnu.org/git/? 
> > > > > > p=gcc.git;a=commit;h=b1c06fd9723453dd2b2ec306684cb806dc2b4f
> > > > > > bb
> > > > > > The equivalent to i386-jit.cc is there:
> > > > > > https://gcc.gnu.org/git/? 
> > > > > > p=gcc.git;a=commit;h=22e3557e2d52f129f2bbfdc98688b945dba28d
> > > > > > c9
> > > > > 
> > > > > [CCing Iain and Arthur re those patches; for reference, the
> > > > > patch
> > > > > being
> > > > > discussed is attached to :
> > > > > https://gcc.gnu.org/pipermail/jit/2024q1/001792.html ]
> > > > > 
> > > > > One of my concerns about this patch is that we seem to be
> > > > > gaining
> > > > > code
> > > > > that's per-(frontend x config) which seems to be copied and
> > > > > pasted
> > > > > with
> > > > > a search and replace, which could lead to an M*N explosion.
> > > > > 
> > > > 
> > > > That's certainly the case 

[PATCH] typos

2024-10-29 Thread Samuel Thibault
Changelog:

* gcc/config/i386/t-freebsd64: Fix typo.
* gcc/config/i386/t-gnu64: Fix typo.
* gcc/config/i386/t-linux64: Fix typo.

diff --git a/gcc/config/i386/t-freebsd64 b/gcc/config/i386/t-freebsd64
index 5e2cd3d2b6c..bd3a41c9516 100644
--- a/gcc/config/i386/t-freebsd64
+++ b/gcc/config/i386/t-freebsd64
@@ -18,7 +18,7 @@
 
 # The 32-bit libraries are found in /usr/lib32
 
-# To support i386 and x86-64, the directory structrue
+# To support i386 and x86-64, the directory structure
 # should be:
 #
 #  /lib has x86-64 libraries.
diff --git a/gcc/config/i386/t-gnu64 b/gcc/config/i386/t-gnu64
index 54b742a984e..d5f5540e1d4 100644
--- a/gcc/config/i386/t-gnu64
+++ b/gcc/config/i386/t-gnu64
@@ -23,7 +23,7 @@
 # it doesn't tell anything about the 32bit libraries on those systems.  Set
 # MULTILIB_OSDIRNAMES according to what is found on the target.
 
-# To support i386, x86-64 and x32 libraries, the directory structrue
+# To support i386, x86-64 and x32 libraries, the directory structure
 # should be:
 #
 #  /lib has i386 libraries.
diff --git a/gcc/config/i386/t-linux64 b/gcc/config/i386/t-linux64
index f9edc289e57..9a9d2199b14 100644
--- a/gcc/config/i386/t-linux64
+++ b/gcc/config/i386/t-linux64
@@ -23,7 +23,7 @@
 # it doesn't tell anything about the 32bit libraries on those systems.  Set
 # MULTILIB_OSDIRNAMES according to what is found on the target.
 
-# To support i386, x86-64 and x32 libraries, the directory structrue
+# To support i386, x86-64 and x32 libraries, the directory structure
 # should be:
 #
 #  /lib has i386 libraries.


Re: [PATCH v2 7/8] i386: Add else operand to masked loads.

2024-10-29 Thread Hongtao Liu
On Fri, Oct 18, 2024 at 10:23 PM Robin Dapp  wrote:
>
> This patch adds a zero else operand to masked loads, in particular the
> masked gather load builtins that are used for gather vectorization.
>
> gcc/ChangeLog:
>
> * config/i386/i386-expand.cc (ix86_expand_special_args_builtin):
> Add else-operand handling.
> (ix86_expand_builtin): Ditto.
> * config/i386/predicates.md (vcvtne2ps2bf_parallel): New
> predicate.
> (maskload_else_operand): Ditto.
> * config/i386/sse.md: Use predicate.
> ---
>  gcc/config/i386/i386-expand.cc |  26 +--
>  gcc/config/i386/predicates.md  |   4 ++
>  gcc/config/i386/sse.md | 124 -
>  3 files changed, 101 insertions(+), 53 deletions(-)
>
> diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
> index 63f5e348d64..f6a2c2d65b8 100644
> --- a/gcc/config/i386/i386-expand.cc
> +++ b/gcc/config/i386/i386-expand.cc
> @@ -12994,10 +12994,11 @@ ix86_expand_special_args_builtin (const struct 
> builtin_description *d,
>  {
>tree arg;
>rtx pat, op;
> -  unsigned int i, nargs, arg_adjust, memory;
> +  unsigned int i, nargs, arg_adjust, memory = -1;
>unsigned int constant = 100;
>bool aligned_mem = false;
> -  rtx xops[4];
> +  rtx xops[4] = {};
> +  bool add_els = false;
>enum insn_code icode = d->icode;
>const struct insn_data_d *insn_p = &insn_data[icode];
>machine_mode tmode = insn_p->operand[0].mode;
> @@ -13124,6 +13125,9 @@ ix86_expand_special_args_builtin (const struct 
> builtin_description *d,
>  case V4DI_FTYPE_PCV4DI_V4DI:
>  case V4SI_FTYPE_PCV4SI_V4SI:
>  case V2DI_FTYPE_PCV2DI_V2DI:
> +  /* Two actual args but an additional else operand.  */
> +  add_els = true;
> +  /* Fallthru.  */
>  case VOID_FTYPE_INT_INT64:
>nargs = 2;
>klass = load;
> @@ -13396,6 +13400,12 @@ ix86_expand_special_args_builtin (const struct 
> builtin_description *d,
>xops[i]= op;
>  }
>
> +  if (add_els)
> +{
> +  xops[i] = CONST0_RTX (GET_MODE (xops[0]));
> +  nargs++;
> +}
> +
>switch (nargs)
>  {
>  case 0:
> @@ -13652,7 +13662,7 @@ ix86_expand_builtin (tree exp, rtx target, rtx 
> subtarget,
>enum insn_code icode, icode2;
>tree fndecl = TREE_OPERAND (CALL_EXPR_FN (exp), 0);
>tree arg0, arg1, arg2, arg3, arg4;
> -  rtx op0, op1, op2, op3, op4, pat, pat2, insn;
> +  rtx op0, op1, op2, op3, op4, opels, pat, pat2, insn;
>machine_mode mode0, mode1, mode2, mode3, mode4;
>unsigned int fcode = DECL_MD_FUNCTION_CODE (fndecl);
>HOST_WIDE_INT bisa, bisa2;
> @@ -15559,12 +15569,15 @@ rdseed_step:
>   op3 = copy_to_reg (op3);
>   op3 = lowpart_subreg (mode3, op3, GET_MODE (op3));
> }
> +
>if (!insn_data[icode].operand[5].predicate (op4, mode4))
> {
> -  error ("the last argument must be scale 1, 2, 4, 8");
> -  return const0_rtx;
> + error ("the last argument must be scale 1, 2, 4, 8");
> + return const0_rtx;
> }
>
> +  opels = CONST0_RTX (GET_MODE (subtarget));
> +
>/* Optimize.  If mask is known to have all high bits set,
>  replace op0 with pc_rtx to signal that the instruction
>  overwrites the whole destination and doesn't use its
> @@ -15633,7 +15646,8 @@ rdseed_step:
> }
> }
>
> -  pat = GEN_FCN (icode) (subtarget, op0, op1, op2, op3, op4);
> +  pat = GEN_FCN (icode) (subtarget, op0, op1, op2, op3, op4, opels);
> +
>if (! pat)
> return const0_rtx;
>emit_insn (pat);
> diff --git a/gcc/config/i386/predicates.md b/gcc/config/i386/predicates.md
> index 053312bbe27..7c7d8f61f11 100644
> --- a/gcc/config/i386/predicates.md
> +++ b/gcc/config/i386/predicates.md
> @@ -2346,3 +2346,7 @@ (define_predicate "apx_evex_add_memory_operand"
>
>return true;
>  })
> +
> +(define_predicate "maskload_else_operand"
> +  (and (match_code "const_int,const_vector")
> +   (match_test "op == CONST0_RTX (GET_MODE (op))")))
> diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
> index a45b50ad732..83955eee5a0 100644
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -1575,7 +1575,8 @@ (define_expand "_load_mask"
>  }
>else if (MEM_P (operands[1]))
>  operands[1] = gen_rtx_UNSPEC (mode,
> -gen_rtvec(1, operands[1]),
> +gen_rtvec(2, operands[1],
> +  CONST0_RTX (mode)),
>  UNSPEC_MASKLOAD);
>  })
>
> @@ -1583,7 +1584,8 @@ (define_insn "*_load_mask"
>[(set (match_operand:V48_AVX512VL 0 "register_operand" "=v")
> (vec_merge:V48_AVX512VL
>   (unspec:V48_AVX512VL
> -   [(match_operand:V48_AVX512VL 1 "memory_operand" "m")]
> +   [(match_operand:V48_AVX512VL 1 "memory_operand" "m")
> +(match_operand:V48_A

Re: [PATCH 1/4] sched1: hookize pressure scheduling spilling agressiveness

2024-10-29 Thread Jeff Law




On 10/29/24 1:14 PM, Vineet Gupta wrote:

On 10/29/24 11:51, Wilco Dijkstra wrote:

Hi Vineet,

I agree the NARROW/WIDE stuff is obfuscating things in technicalities.

Is there evidence this change would make things significantly worse for
some targets?


Honestly I don't think this needs to be behind any toggle or made optional at 
all. The old algorithm was overly eager in spilling. But per last
discussion with Richard [1] at least back in 2012 for some in-order arm32 core 
this was better. And also that's where the wide vs. narrow discussions
came up and that it really mattered, as far as I understood.
I'd agree across the board.   In an ideal world we'd make 
TARGET_SCHED_PRESSURE_SPILL_AGGRESSIVE default to false since I suspect 
that's going to work best most of the time.  But the safer thing to do 
is make it true, preserving current behavior and let uarchs opt-in to 
the new behavior.





There are already too many features in the scheduler - it would be better
to reduce the many variants and focus on doing really well on modern cores.


A purist might argue it is not really new algorithm, just a little tweak to 
model algo. But I agree there's a zoo of sched toggles out there so this
might be ok. Anyhow I'm fine either way, whatever is more palatable for the 
community / maintainers. @Jeff what say you.

You're kind of making Wilco's point ;-)




And while we are at it - I'd prefer the new behavior to be default for all 
targets unless they explicitly opt out (and it would be insightful
understanding those not made up cases). But  given how late we are in the 
release cycle, I think it would be kosher to keep the old behavior for now
and maybe after gcc-15 is released we can flip the switch and deal with any 
potential fallout.

I can make an argument for either default.

The safest thing to do is keep behavior as-is and opt-in to the new 
behavior.   I don't think asking you to verify this doesn't regress the 
various aarch64 cores, s390 and ppc is reasonable.  Hell, even just 
verifying across risc-v cores would be nontrivial.


Jeff




[PATCH v1] Doc: Add doc for standard name mask_len_strided_load{store}m

2024-10-29 Thread pan2 . li
From: Pan Li 

This patch would like to add doc for the below 2 standard names.

1. strided load: v = mask_len_strided_load (ptr, stried, mask, len, bias)
2. strided store: mask_len_stried_store (ptr, stride, v, mask, len, bias)

gcc/ChangeLog:

* doc/md.texi: Add doc for mask_len_stried_load{store}.

Signed-off-by: Pan Li 
Co-Authored-By: Juzhe-Zhong 
---
 gcc/doc/md.texi | 27 +++
 1 file changed, 27 insertions(+)

diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
index 6d9c8643739..83036383fe1 100644
--- a/gcc/doc/md.texi
+++ b/gcc/doc/md.texi
@@ -5135,6 +5135,20 @@ Bit @var{i} of the mask is set if element @var{i} of the 
result should
 be loaded from memory and clear if element @var{i} of the result should be 
undefined.
 Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.
 
+@cindex @code{mask_len_strided_load@var{m}} instruction pattern
+@item @samp{mask_len_strided_load@var{m}}
+Load several separate memory locations into a destination vector of mode 
@var{m}.
+Operand 0 is a destination vector of mode @var{m}.
+Operand 1 is a scalar base address and operand 2 is a scalar stride of Pmode.
+operand 3 is mask operand, operand 4 is length operand and operand 5 is bias 
operand.
+The instruction can be seen as a special case of 
@code{mask_len_gather_load@var{m}@var{n}}
+with an offset vector that is a @code{vec_series} with operand 1 as base and 
operand 2 as step.
+For each element index i load address is operand 1 + @var{i} * operand 2.
+Similar to mask_len_load, the instruction loads at most (operand 4 + operand 
5) elements from memory.
+Element @var{i} of the mask (operand 3) is set if element @var{i} of the 
result should
+be loaded from memory and clear if element @var{i} of the result should be 
zero.
+Mask elements @var{i} with @var{i} > (operand 4 + operand 5) are ignored.
+
 @cindex @code{scatter_store@var{m}@var{n}} instruction pattern
 @item @samp{scatter_store@var{m}@var{n}}
 Store a vector of mode @var{m} into several distinct memory locations.
@@ -5172,6 +5186,19 @@ at most (operand 6 + operand 7) elements of (operand 4) 
to memory.
 Bit @var{i} of the mask is set if element @var{i} of (operand 4) should be 
stored.
 Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.
 
+@cindex @code{mask_len_strided_store@var{m}} instruction pattern
+@item @samp{mask_len_strided_store@var{m}}
+Store a vector of mode m into several distinct memory locations.
+Operand 0 is a scalar base address and operand 1 is scalar stride of Pmode.
+Operand 2 is the vector of values that should be stored, which is of mode 
@var{m}.
+operand 3 is mask operand, operand 4 is length operand and operand 5 is bias 
operand.
+The instruction can be seen as a special case of 
@code{mask_len_scatter_store@var{m}@var{n}}
+with an offset vector that is a @code{vec_series} with operand 1 as base and 
operand 1 as step.
+For each element index i store address is operand 0 + @var{i} * operand 1.
+Similar to mask_len_store, the instruction stores at most (operand 4 + operand 
5) elements of
+mask (operand 3) to memory.  Element @var{i} of the mask is set if element 
@var{i} of (operand 3)
+should be stored.  Mask elements @var{i} with @var{i} > (operand 4 + operand 
5) are ignored.
+
 @cindex @code{vec_set@var{m}} instruction pattern
 @item @samp{vec_set@var{m}}
 Set given field in the vector value.  Operand 0 is the vector to modify,
-- 
2.43.0



Fix PR rtl-optimization/117327

2024-10-29 Thread Eric Botcazou
This is a wrong-code generation on the SPARC for a function containing a call 
to __builtin_unreachable caused by the delay slot scheduling pass, and more 
specifically the find_end_label function which has these lines:

  /* Otherwise, see if there is a label at the end of the function.  If there
 is, it must be that RETURN insns aren't needed, so that is our return
 label and we don't have to do anything else.  */

The comment was correct 20 years ago but no longer is nowadays in the presence 
of RTL epilogues and calls to __builtin_unreachable, so the patch just removes 
the associated two lines of code:

  else if (LABEL_P (insn))
*plabel = as_a  (insn);

and otherwise contains just adjustments to the commentary.

Bootstrapped/regtested on SPARC64/Solaris and applied on all active branches.


2024-10-29  Eric Botcazou  

PR rtl-optimization/117327
* reorg.cc (find_end_label): Do not return a dangling label at the
end of the function and adjust commentary.


2024-10-29  Eric Botcazou  

* gcc.c-torture/execute/20241029-1.c: New test.

-- 
Eric Botcazou/* PR rtl-optimization/117327 */
/* Testcase by Brad Moody  */

__attribute__((noinline))
void foo(int *self, int *x)
{
__builtin_puts ("foo\n");

if (x) {
while (1) {
++*self;
if (*self == 6) break;
if (*self == 7) __builtin_unreachable();
}
}
}

int main (void)
{
int y = 0;
foo (&y, 0);
return 0;
}
diff --git a/gcc/reorg.cc b/gcc/reorg.cc
index 51321ce7b80..68bf30801cf 100644
--- a/gcc/reorg.cc
+++ b/gcc/reorg.cc
@@ -336,13 +336,14 @@ insn_sets_resource_p (rtx insn, struct resources *res,
   return resource_conflicts_p (&insn_sets, res);
 }
 
-/* Find a label at the end of the function or before a RETURN.  If there
-   is none, try to make one.  If that fails, returns 0.
+/* Find a label before a RETURN.  If there is none, try to make one; if this
+   fails, return 0.  KIND is either ret_rtx or simple_return_rtx, indicating
+   which type of RETURN we're looking for.
 
-   The property of such a label is that it is placed just before the
-   epilogue or a bare RETURN insn, so that another bare RETURN can be
-   turned into a jump to the label unconditionally.  In particular, the
-   label cannot be placed before a RETURN insn with a filled delay slot.
+   The property of the label is that it is placed just before a bare RETURN
+   insn, so that another bare RETURN can be turned into a jump to the label
+   unconditionally.  In particular, the label cannot be placed before a
+   RETURN insn with a filled delay slot.
 
??? There may be a problem with the current implementation.  Suppose
we start with a bare RETURN insn and call find_end_label.  It may set
@@ -353,9 +354,7 @@ insn_sets_resource_p (rtx insn, struct resources *res,
Note that this is probably mitigated by the following observation:
once function_return_label is made, it is very likely the target of
a jump, so filling the delay slot of the RETURN will be much more
-   difficult.
-   KIND is either simple_return_rtx or ret_rtx, indicating which type of
-   return we're looking for.  */
+   difficult.  */
 
 static rtx_code_label *
 find_end_label (rtx kind)
@@ -375,10 +374,7 @@ find_end_label (rtx kind)
   if (*plabel)
 return *plabel;
 
-  /* Otherwise, see if there is a label at the end of the function.  If there
- is, it must be that RETURN insns aren't needed, so that is our return
- label and we don't have to do anything else.  */
-
+  /* Otherwise, scan the insns backward from the end of the function.  */
   insn = get_last_insn ();
   while (NOTE_P (insn)
 	 || (NONJUMP_INSN_P (insn)
@@ -386,9 +382,8 @@ find_end_label (rtx kind)
 		 || GET_CODE (PATTERN (insn)) == CLOBBER)))
 insn = PREV_INSN (insn);
 
-  /* When a target threads its epilogue we might already have a
- suitable return insn.  If so put a label before it for the
- function_return_label.  */
+  /* First, see if there is a RETURN at the end of the function.  If so,
+ put the label before it.  */
   if (BARRIER_P (insn)
   && JUMP_P (PREV_INSN (insn))
   && PATTERN (PREV_INSN (insn)) == kind)
@@ -397,8 +392,8 @@ find_end_label (rtx kind)
   rtx_code_label *label = gen_label_rtx ();
   LABEL_NUSES (label) = 0;
 
-  /* Put the label before any USE insns that may precede the RETURN
-	 insn.  */
+  /* Put the label before any USE insns that may precede the
+	 RETURN insn.  */
   while (GET_CODE (temp) == USE)
 	temp = PREV_INSN (temp);
 
@@ -406,15 +401,12 @@ find_end_label (rtx kind)
   *plabel = label;
 }
 
-  else if (LABEL_P (insn))
-*plabel = as_a  (insn);
+  /* If the basic block reordering pass has moved the return insn to some
+ other place, try to locate it again and put the label there.  */
   else
 {
   rtx_code_label *label = gen_labe

Re: [PATCH 1/4] sched1: hookize pressure scheduling spilling agressiveness

2024-10-29 Thread Jeff Law




On 10/29/24 10:57 AM, Vineet Gupta wrote:


Certainly open to more ideas on the naming, which I think will impact
the documentation & comments as well.

And to be 100% clear, no concerns with the behavior of the patch, it's
really just the naming convention, documentation/comments.

Thoughts?


I agree the NARROW/WIDE stuff is obfuscating things in technicalities.
How about  TARGET_SCHED_PRESSURE_SPILL_AGGRESSIVE with true (default) being 
existing behavior and false being new semantics.
Its a bit verbose but I think clear enough.

Sure.  I can live with that.





ps.  I've got to get out of my bubble more often.  Picked up a bug at
the RVI summit...  Clearly my immune system isn't firing on all cylinders.


Oh gosh. Get well soon !
Doing better, mostly because I'm trying to minimize how much talking I 
do today.  That in turn keeps my throat from getting more irritated.


Jeff


Re: [PATCH v2 3/3] Simplify switch bit test clustering algorithm

2024-10-29 Thread Andrew Pinski
On Tue, Oct 29, 2024 at 3:04 PM Andi Kleen  wrote:
>
> On Tue, Oct 29, 2024 at 01:50:57PM +0100, Richard Biener wrote:
> > On Mon, Oct 28, 2024 at 9:58 PM Andi Kleen  wrote:
> > >
> > > From: Andi Kleen 
> > >
> > > The current switch bit test clustering enumerates all possible case
> > > clusters combinations to find ones that fit the bit test constrains
> > > best.  This causes performance problems with very large switches.
> > >
> > > For bit test clustering which happens naturally in word sized chunks
> > > I don't think such an expensive algorithm is really needed.
> > >
> > > This patch implements a simple greedy algorithm that walks
> > > the sorted list and examines word sized windows and tries
> > > to cluster them.
> > >
> > > Surprisingly the new algorithm gives consistly better clusters
> > > for the examples I tried.
> > >
> > > For example from the gcc bootstrap:
> > >
> > > old: 0-15 16-31 96-175
> > > new: 0-31 96-175
> > >
> > > I'm not fully sure why that is, probably some bug in the old
> > > algorithm? This shows even up in the test suite where if-to-switch-6
> > > now can generate a switch, as well as a case in switch-1.c
> > >
> > > I don't have a proof that the new algorithm is always as good or better,
> > > but so far at least I don't see any counter examples.
> > >
> > > It also fixes the excessive compile time in PR117091,
> > > however this was already fixed by an earlier patch
> > > that doesn't run clustering when no targets have multiple
> > > values.
> >
> > OK if you add a comment (as part of the function comment for example)
> > explaining the idea of the algorithm.
>
>
> I added the comment.
>
> I will commit it with this change. I also had to add a few more
> -fno-bit-tests to make the Linaro tester happy.

Those should have been xfailed instead of adding -fno-bit-tests.

>
> However this exposes PR117352 which is a negative interaction of the
> more aggressive bit test conversion.  I don't think it's a show stopper,
> this can be sorted out later.

I think it is a show stopper for GCC 15 because it is a pretty big
performance regression with targets that have ccmp (which now includes
x86_64).

Thanks,
Andrew

>
> -Andi


Re: [PATCH] Fortran: fix several front-end memleaks

2024-10-29 Thread Jerry D

On 10/29/24 2:00 PM, Harald Anlauf wrote:

Dear all,

while looking at the recent testcase gfortran.dg/pr115070.f90 with f951
running under valgrind, I noticed minor front-end memleaks of gfc_expr's
that are probably fallout from a code refactoring, which are fixed by
the attached.

Regtested on x86_64-pc-linux-gnu.  OK for mainline?

Thanks,
Harald



Yes OK for mainline.

Thanks,

Jerry


[PATCH] gimple: Remove special handling of COND_EXPR for COMPARISON_CLASS_P [PR116949, PR114785]

2024-10-29 Thread Andrew Pinski
After r13-707-g68e0063397ba82, COND_EXPR for gimple assign no longer could 
contain a comparison.
The vectorizer was builting gimple assigns with comparison until 
r15-4695-gd17e672ce82e69
(which added an assert to make sure it no longer builds it).

So let's remove the special handling COND_EXPR in a few places and add an 
assert to
gimple_build_assign_1 to make sure we don't build a gimple assign any more with 
a comparison.

Bootstrapped and tested on x86_64-linux-gnu.

gcc/ChangeLog:

PR middle-end/114785
PR middle-end/116949
* gimple-match-exports.cc (maybe_push_res_to_seq): Remove special
handling of COMPARISON_CLASS_P in COND_EXPR/VEC_COND_EXPR.
(gimple_extract): Likewise.
* gimple-walk.cc (walk_stmt_load_store_addr_ops): Likewise.
* gimple.cc (gimple_build_assign_1):

Signed-off-by: Andrew Pinski 
---
 gcc/gimple-match-exports.cc | 12 +---
 gcc/gimple-walk.cc  | 11 ---
 gcc/gimple.cc   |  3 +++
 3 files changed, 4 insertions(+), 22 deletions(-)

diff --git a/gcc/gimple-match-exports.cc b/gcc/gimple-match-exports.cc
index 77d225825cf..bc8038c19f0 100644
--- a/gcc/gimple-match-exports.cc
+++ b/gcc/gimple-match-exports.cc
@@ -489,12 +489,6 @@ maybe_push_res_to_seq (gimple_match_op *res_op, gimple_seq 
*seq, tree res)
&& SSA_NAME_OCCURS_IN_ABNORMAL_PHI (ops[i]))
   return NULL_TREE;
 
-  if (num_ops > 0 && COMPARISON_CLASS_P (ops[0]))
-for (unsigned int i = 0; i < 2; ++i)
-  if (TREE_CODE (TREE_OPERAND (ops[0], i)) == SSA_NAME
- && SSA_NAME_OCCURS_IN_ABNORMAL_PHI (TREE_OPERAND (ops[0], i)))
-   return NULL_TREE;
-
   if (res_op->code.is_tree_code ())
 {
   auto code = tree_code (res_op->code);
@@ -786,11 +780,7 @@ gimple_extract (gimple *stmt, gimple_match_op *res_op,
}
  case GIMPLE_TERNARY_RHS:
{
- tree rhs1 = gimple_assign_rhs1 (stmt);
- if (code == COND_EXPR && COMPARISON_CLASS_P (rhs1))
-   rhs1 = valueize_condition (rhs1);
- else
-   rhs1 = valueize_op (rhs1);
+ tree rhs1 = valueize_op (gimple_assign_rhs1 (stmt));
  tree rhs2 = valueize_op (gimple_assign_rhs2 (stmt));
  tree rhs3 = valueize_op (gimple_assign_rhs3 (stmt));
  res_op->set_op (code, type, rhs1, rhs2, rhs3);
diff --git a/gcc/gimple-walk.cc b/gcc/gimple-walk.cc
index 9f768ca20fd..00520319aa9 100644
--- a/gcc/gimple-walk.cc
+++ b/gcc/gimple-walk.cc
@@ -835,17 +835,6 @@ walk_stmt_load_store_addr_ops (gimple *stmt, void *data,
;
  else if (TREE_CODE (op) == ADDR_EXPR)
ret |= visit_addr (stmt, TREE_OPERAND (op, 0), op, data);
- /* COND_EXPR and VCOND_EXPR rhs1 argument is a comparison
-tree with two operands.  */
- else if (i == 1 && COMPARISON_CLASS_P (op))
-   {
- if (TREE_CODE (TREE_OPERAND (op, 0)) == ADDR_EXPR)
-   ret |= visit_addr (stmt, TREE_OPERAND (TREE_OPERAND (op, 0),
-  0), op, data);
- if (TREE_CODE (TREE_OPERAND (op, 1)) == ADDR_EXPR)
-   ret |= visit_addr (stmt, TREE_OPERAND (TREE_OPERAND (op, 1),
-  0), op, data);
-   }
}
 }
   else if (gcall *call_stmt = dyn_cast  (stmt))
diff --git a/gcc/gimple.cc b/gcc/gimple.cc
index eeb1badff5f..f7b313be40e 100644
--- a/gcc/gimple.cc
+++ b/gcc/gimple.cc
@@ -475,6 +475,9 @@ gimple_build_assign_1 (tree lhs, enum tree_code subcode, 
tree op1,
 gimple_build_with_ops_stat (GIMPLE_ASSIGN, (unsigned)subcode, num_ops
PASS_MEM_STAT));
   gimple_assign_set_lhs (p, lhs);
+  /* For COND_EXPR, op1 should not be a comparison. */
+  if (op1 && subcode == COND_EXPR)
+gcc_assert (!COMPARISON_CLASS_P  (op1));
   gimple_assign_set_rhs1 (p, op1);
   if (op2)
 {
-- 
2.43.0



Re: [PATCH v2] [PR83782] ifunc: back-propagate ifunc_resolver to aliases

2024-10-29 Thread Alexandre Oliva
On Nov  8, 2023, Alexandre Oliva  wrote:

> Ping?

Ping?
https://gcc.gnu.org/pipermail/gcc-patches/2023-November/635731.html

The test still fails with gcc-14 and trunk on ia32 with -fPIE.  I've
just retested it on trunk on i686-linux-gnu with -fPIE, and on
x86_64-linux-gnu.

> gcc.target/i386/mvc10.c fails with -fPIE on ia32 because we omit the
> @PLT mark when calling an alias to an indirect function.  Such aliases
> aren't marked as ifunc_resolvers in the cgraph, so the test that would
> have forced the PLT call fails.

> I've arranged for ifunc_resolver to be back-propagated to aliases, and
> relaxed the test that required the ifunc attribute to be attached to
> directly the decl, rather than taken from an aliased decl, when the
> ifunc_resolver bit is set.

> Regstrapped on x86_64-linux-gnu, also tested with gcc-13 on i686- and
> x86_64-.  Ok to install?

> (in the initial patchset for PR83782 and mvc10, I also needed
> https://gcc.gnu.org/pipermail/gcc-patches/2022-July/598873.html but I'm
> not getting that fail any more with gcc-13, apparently because a
> different patch was put in to address that part)


> for  gcc/ChangeLog

>   PR target/83782
>   * cgraph.h (symtab_node::set_ifunc_resolver): New, overloaded.
>   Back-propagate flag to aliases.
>   * cgraph.cc (cgraph_node::create): Use set_ifunc_resolver.
>   (cgraph_node::create_alias): Likewise.
>   * lto-cgraph.cc (input_node): Likewise.
>   * multiple_target.cc (create_dispatcher_calls): Propagate to
>   aliases when redirecting them.
>   * symtab.cc (symtab_node::verify_base): Accept ifunc_resolver
>   set in an alias to another ifunc_resolver nodes.
>   (symtab_node::resolve_alias): Propagate ifunc_resolver from
>   resolved target to alias.
>   * varasm.cc (do_assemble_alias): Checking for the attribute.

-- 
Alexandre Oliva, happy hackerhttps://FSFLA.org/blogs/lxo/
   Free Software Activist   GNU Toolchain Engineer
More tolerance and less prejudice are key for inclusion and diversity
Excluding neuro-others for not behaving ""normal"" is *not* inclusive


[PATCH] [testsuite] fix pr70321.c PIC expectations

2024-10-29 Thread Alexandre Oliva


When we select a non-bx get_pc_thunk, we get an extra mov to set up
the PIC register before the abort call.  Expect that mov or a
get_pc_thunk.bx call.

Regstrapped on x86_64-linux-gnu; also tested on i686-linux-gnu with
-fPIE.  Ok to install?


for  gcc/testsuite/ChangeLog

* gcc.target/i386/pr70321.c: Cope with non-bx get_pc_thunk.
---
 gcc/testsuite/gcc.target/i386/pr70321.c |6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/gcc/testsuite/gcc.target/i386/pr70321.c 
b/gcc/testsuite/gcc.target/i386/pr70321.c
index 58f5f5661c7a2..287b7da1b9501 100644
--- a/gcc/testsuite/gcc.target/i386/pr70321.c
+++ b/gcc/testsuite/gcc.target/i386/pr70321.c
@@ -9,4 +9,8 @@ void foo (long long ixi)
 
 /* { dg-final { scan-assembler-times "mov" 1 { target nonpic } } } */
 /* get_pc_thunk adds an extra mov insn.  */
-/* { dg-final { scan-assembler-times "mov" 2 { target { ! nonpic } } } } */
+/* Choosing a non-bx get_pc_thunk requires another mov before the abort call.
+   So we require a match of either that mov or the get_pc_thunk.bx call, in
+   addition to the other 2 movs.  (Hopefully there won't be more calls for a
+   false positive.)  */
+/* { dg-final { scan-assembler-times "mov|call\[^\n\r]*get_pc_thunk\.bx" 3 { 
target { ! nonpic } } } } */

-- 
Alexandre Oliva, happy hackerhttps://FSFLA.org/blogs/lxo/
   Free Software Activist   GNU Toolchain Engineer
More tolerance and less prejudice are key for inclusion and diversity
Excluding neuro-others for not behaving ""normal"" is *not* inclusive


Re: [PATCH] aarch64: Use canonicalize_comparison in ccmp expansion [PR117346]

2024-10-29 Thread Richard Sandiford
Andrew Pinski  writes:
> While testing the patch for PR 85605 on aarch64, it was noticed that
> imm_choice_comparison.c test failed. This was because canonicalize_comparison
> was not being called in the ccmp case. This can be noticed without the patch
> for PR 85605 as evidence of the new testcase.
>
> Bootstrapped and tested on aarch64-linux-gnu.
>
>   PR target/117346
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64.cc (aarch64_gen_ccmp_first): Call
>   canonicalize_comparison before figuring out the cmp_mode/cc_mode.
>   (aarch64_gen_ccmp_next): Likewise.
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/aarch64/imm_choice_comparison-1.c: New test.

OK, thanks.

Richard

> Signed-off-by: Andrew Pinski 
> ---
>  gcc/config/aarch64/aarch64.cc |  6 +++
>  .../aarch64/imm_choice_comparison-1.c | 42 +++
>  2 files changed, 48 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/imm_choice_comparison-1.c
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index a6cc00e74ab..cbb7ef13315 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -27353,6 +27353,9 @@ aarch64_gen_ccmp_first (rtx_insn **prep_seq, rtx_insn 
> **gen_seq,
>if (op_mode == VOIDmode)
>  op_mode = GET_MODE (op1);
>  
> +  if (CONST_SCALAR_INT_P (op1))
> +canonicalize_comparison (op_mode, &code, &op1);
> +
>switch (op_mode)
>  {
>  case E_QImode:
> @@ -27429,6 +27432,9 @@ aarch64_gen_ccmp_next (rtx_insn **prep_seq, rtx_insn 
> **gen_seq, rtx prev,
>if (op_mode == VOIDmode)
>  op_mode = GET_MODE (op1);
>  
> +  if (CONST_SCALAR_INT_P (op1))
> +canonicalize_comparison (op_mode, &cmp_code, &op1);
> +
>switch (op_mode)
>  {
>  case E_QImode:
> diff --git a/gcc/testsuite/gcc.target/aarch64/imm_choice_comparison-1.c 
> b/gcc/testsuite/gcc.target/aarch64/imm_choice_comparison-1.c
> new file mode 100644
> index 000..2afebe1a349
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/imm_choice_comparison-1.c
> @@ -0,0 +1,42 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2" } */
> +/* { dg-final { check-function-bodies "**" "" } } */
> +
> +/* PR target/117346 */
> +/* Make sure going through ccmp uses similar to non ccmp-case. */
> +/* This is similar to imm_choice_comparison.c's check except to force
> +   the use of ccmp by reording the comparison and putting the cast before. */
> +
> +/*
> +** check:
> +**   ...
> +**   mov w[0-9]+, -16777217
> +**   ...
> +*/
> +
> +int
> +check (int x, int y)
> +{
> +  unsigned xu = x;
> +  if (xu > 0xfefe && x > y)
> +return 100;
> +
> +  return x;
> +}
> +
> +/*
> +** check1:
> +**   ...
> +**   mov w[0-9]+, -16777217
> +**   ...
> +*/
> +
> +int
> +check1 (int x, int y)
> +{
> +  unsigned xu = x;
> +  if (x > y && xu > 0xfefe)
> +return 100;
> +
> +  return x;
> +}


[PATCH] [testsuite] disable PIE on ia32 on more tests

2024-10-29 Thread Alexandre Oliva


Multiple tests fail on ia32 with -fPIE enabled by default because of
different call sequences required by the call-saved PIC register
(no-callee-saved-*.c), uses of the constant pool instead of computing
constants (pr100865-*.c), and unexpected matches of esp in get_pc_thunk
(sse2-stv-1.c).  Disable PIE on them, to match the expectations.

Regstrapped on x86_64-linux-gnu; also tested on i686-linux-gnu with
-fPIE.  Ok to install?


for  gcc/testsuite/ChangeLog

* gcc.target/i386/no-callee-saved-13.c: Disable PIE on ia32.
* gcc.target/i386/no-callee-saved-14.c: Likewise.
* gcc.target/i386/no-callee-saved-15.c: Likewise.
* gcc.target/i386/no-callee-saved-17.c: Likewise.
* gcc.target/i386/pr100865-1.c: Likewise.
* gcc.target/i386/pr100865-7a.c: Likewise.
* gcc.target/i386/pr100865-7c.c: Likewise.
* gcc.target/i386/sse2-stv-1.c: Likewise.
---
 gcc/testsuite/gcc.target/i386/no-callee-saved-13.c |1 +
 gcc/testsuite/gcc.target/i386/no-callee-saved-14.c |1 +
 gcc/testsuite/gcc.target/i386/no-callee-saved-15.c |1 +
 gcc/testsuite/gcc.target/i386/no-callee-saved-17.c |1 +
 gcc/testsuite/gcc.target/i386/pr100865-1.c |1 +
 gcc/testsuite/gcc.target/i386/pr100865-7a.c|1 +
 gcc/testsuite/gcc.target/i386/pr100865-7c.c|1 +
 gcc/testsuite/gcc.target/i386/sse2-stv-1.c |1 +
 8 files changed, 8 insertions(+)

diff --git a/gcc/testsuite/gcc.target/i386/no-callee-saved-13.c 
b/gcc/testsuite/gcc.target/i386/no-callee-saved-13.c
index 6757e72d8487c..0b59da36786a1 100644
--- a/gcc/testsuite/gcc.target/i386/no-callee-saved-13.c
+++ b/gcc/testsuite/gcc.target/i386/no-callee-saved-13.c
@@ -1,5 +1,6 @@
 /* { dg-do compile } */
 /* { dg-options "-O2 -mtune-ctrl=^prologue_using_move,^epilogue_using_move" } 
*/
+/* { dg-additional-options "-fno-PIE" { target ia32 } } */
 
 extern void foo (void);
 
diff --git a/gcc/testsuite/gcc.target/i386/no-callee-saved-14.c 
b/gcc/testsuite/gcc.target/i386/no-callee-saved-14.c
index 2239e286e6a62..2127b12f120bd 100644
--- a/gcc/testsuite/gcc.target/i386/no-callee-saved-14.c
+++ b/gcc/testsuite/gcc.target/i386/no-callee-saved-14.c
@@ -1,5 +1,6 @@
 /* { dg-do compile } */
 /* { dg-options "-O2 -mtune-ctrl=^prologue_using_move,^epilogue_using_move" } 
*/
+/* { dg-additional-options "-fno-PIE" { target ia32 } } */
 
 extern void bar (void) __attribute__ ((no_callee_saved_registers));
 
diff --git a/gcc/testsuite/gcc.target/i386/no-callee-saved-15.c 
b/gcc/testsuite/gcc.target/i386/no-callee-saved-15.c
index 10135fec9c147..65f2a9532ffd3 100644
--- a/gcc/testsuite/gcc.target/i386/no-callee-saved-15.c
+++ b/gcc/testsuite/gcc.target/i386/no-callee-saved-15.c
@@ -1,5 +1,6 @@
 /* { dg-do compile } */
 /* { dg-options "-O2 -mtune-ctrl=^prologue_using_move,^epilogue_using_move" } 
*/
+/* { dg-additional-options "-fno-PIE" { target ia32 } } */
 
 typedef void (*fn_t) (void) __attribute__ ((no_callee_saved_registers));
 extern fn_t bar;
diff --git a/gcc/testsuite/gcc.target/i386/no-callee-saved-17.c 
b/gcc/testsuite/gcc.target/i386/no-callee-saved-17.c
index 1fd5daadf0800..1ecf4552f3d09 100644
--- a/gcc/testsuite/gcc.target/i386/no-callee-saved-17.c
+++ b/gcc/testsuite/gcc.target/i386/no-callee-saved-17.c
@@ -1,5 +1,6 @@
 /* { dg-do compile } */
 /* { dg-options "-O2 -mtune-ctrl=^prologue_using_move,^epilogue_using_move" } 
*/
+/* { dg-additional-options "-fno-PIE" { target ia32 } } */
 
 extern void foo (void) __attribute__ ((no_caller_saved_registers));
 
diff --git a/gcc/testsuite/gcc.target/i386/pr100865-1.c 
b/gcc/testsuite/gcc.target/i386/pr100865-1.c
index 75cd463cbfc2e..fc0a5b33950f1 100644
--- a/gcc/testsuite/gcc.target/i386/pr100865-1.c
+++ b/gcc/testsuite/gcc.target/i386/pr100865-1.c
@@ -1,5 +1,6 @@
 /* { dg-do compile } */
 /* { dg-options "-O2 -march=x86-64" } */
+/* { dg-additional-options "-fno-PIE" { target ia32 } } */
 
 extern char *dst;
 
diff --git a/gcc/testsuite/gcc.target/i386/pr100865-7a.c 
b/gcc/testsuite/gcc.target/i386/pr100865-7a.c
index 7de7d4a3ce3ad..9fb5dc5256522 100644
--- a/gcc/testsuite/gcc.target/i386/pr100865-7a.c
+++ b/gcc/testsuite/gcc.target/i386/pr100865-7a.c
@@ -1,5 +1,6 @@
 /* { dg-do compile } */
 /* { dg-options "-O3 -march=skylake" } */
+/* { dg-additional-options "-fno-PIE" { target ia32 } } */
 
 extern long long int array[64];
 
diff --git a/gcc/testsuite/gcc.target/i386/pr100865-7c.c 
b/gcc/testsuite/gcc.target/i386/pr100865-7c.c
index edbfd5b09ed69..695831e59af51 100644
--- a/gcc/testsuite/gcc.target/i386/pr100865-7c.c
+++ b/gcc/testsuite/gcc.target/i386/pr100865-7c.c
@@ -1,5 +1,6 @@
 /* { dg-do compile } */
 /* { dg-options "-O3 -march=skylake -mno-avx2" } */
+/* { dg-additional-options "-fno-PIE" { target ia32 } } */
 
 extern long long int array[64];
 
diff --git a/gcc/testsuite/gcc.target/i386/sse2-stv-1.c 
b/gcc/testsuite/gcc.target/i386/sse2-stv-1.c
index 72b57b5923c0d..c6eacc4f92cf9 100644
--- a/gcc/testsuite/gcc.target/i386/sse2-st

[pushed: r15-4760] diagnostics: support multiple output formats simultaneously [PR116613]

2024-10-29 Thread David Malcolm
This patch generalizes diagnostic_context so that rather than having
a single output format, it has a vector of zero or more.

It adds new two options:
 -fdiagnostics-add-output=DIAGNOSTICS-OUTPUT-SPEC
 -fdiagnostics-set-output=DIAGNOSTICS-OUTPUT-SPEC
which both take a new configuration syntax of the form SCHEME ("text" or
"sarif"), optionally followed by ":" and one or more KEY=VALUE pairs,
in this form:

  
  :=
  :=,=
  ...etc

where each SCHEME supports some set of keys.  For example, it's now
possible to use:

  -fdiagnostics-add-output=sarif:version=2.1,file=foo.2.1.sarif \
  -fdiagnostics-add-output=sarif:version=2.2-prerelease,file=foo.2.2.sarif

to add a pair of outputs, each writing to a different file, using
versions 2.1 and 2.2 of the SARIF standard respectively, whilst also
emitting the classic text form of the diagnostics to stderr.

I hope the new syntax gives us room to potentially add new kinds of
output sink in the future (e.g. RPC notifications), and to add new
key/value pairs as needed by the different sinks.

Implementation-wise, the diagnostic_context's m_printer which previously
was used directly by the single output format now becomes a "reference
printer", created by the client (such as the frontend), with defaults
modified by command-line options.  Each of the multiple output sinks has
its own pretty_printer instance, created by cloning the context's
reference printer.

Successfully bootstrapped & regrtested on x86_64-pc-linux-gnu.
Pushed to trunk as r15-4760-g0b73e9382ab51c.

gcc/ChangeLog:
PR other/116613
* Makefile.in (OBJS-libcommon-target): Add opts-diagnostic.o.
* common.opt (fdiagnostics-add-output=): New.
(fdiagnostics-set-output=): New.
(diagnostics_output_format): Drop sarif-file-2.2-prerelease from
enum.
* common.opt.urls: Regenerate.
* diagnostic-buffer.h (diagnostic_buffer::~diagnostic_buffer): New.
(diagnostic_buffer::ensure_per_format_buffer): Rename to...
(diagnostic_buffer::ensure_per_format_buffers): ...this.
(diagnostic_buffer::m_per_format_buffer): Replace with...
(diagnostic_buffer::m_per_format_buffers): ...this, updating type.
* diagnostic-format-json.cc (json_output_format::update_printer):
New.
(json_output_format::follows_reference_printer_p): New.
(diagnostic_output_format_init_json): Drop redundant call to
set_path_format, as this is not a text output format.
* diagnostic-format-sarif.cc: Include "diagnostic-format-text.h".
(sarif_builder::set_printer): New.
(sarif_builder::sarif_builder): Add "printer" param and use it for
m_printer.
(sarif_builder::make_location_object::escape_nonascii_renderer::render):
Rather than using dc.m_printer, create a
diagnostic_text_output_format instance and use its printer.
(sarif_output_format::follows_reference_printer_p): New.
(sarif_output_format::update_printer): New.
(sarif_output_format::sarif_output_format): Pass in correct
printer to m_builder's ctor.
(diagnostic_output_format_init_sarif): Drop redundant call to
set_path_format, as this is not a text output format.  Replace
calls to pp_show_color and set_token_printer with call to
update_printer.  Drop redundant call to set_show_highlight_colors,
as this printer does not show colors.
(diagnostic_output_format_init_sarif_file): Split out file opening
into...
(diagnostic_output_format_open_sarif_file): ...this new function.
(make_sarif_sink): New.
(selftest::test_make_location_object): Provide a pp for the
builder.
* diagnostic-format-sarif.h
(diagnostic_output_format_open_sarif_file): New decl.
(make_sarif_sink): New decl.
* diagnostic-format-text.cc (diagnostic_text_output_format::dump):
Dump sm_follows_reference_printer.
(diagnostic_text_output_format::on_report_verbatim): New.
(diagnostic_text_output_format::follows_reference_printer_p): New.
(diagnostic_text_output_format::update_printer): New.
* diagnostic-format-text.h
(diagnostic_text_output_format::diagnostic_text_output_format):
Add optional "follows_reference_printer" param.
(diagnostic_text_output_format::on_report_verbatim): New decl.
(diagnostic_text_output_format::after_diagnostic): Drop "final".
(diagnostic_text_output_format::follows_reference_printer_p): New
decl.
(class diagnostic_text_output_format): Convert private members to
protected.
(diagnostic_text_output_format::m_follows_reference_printer): New
field.
* diagnostic-format.h
(diagnostic_output_format::on_report_verbatim): New vfunc.
(diagnostic_output_format::follows_reference_printer_p): New vfunc.
(diagnostic_output_format::update_printer): New vfunc.

Re: [PATCH v2 3/3] Simplify switch bit test clustering algorithm

2024-10-29 Thread Andi Kleen
On Tue, Oct 29, 2024 at 01:50:57PM +0100, Richard Biener wrote:
> On Mon, Oct 28, 2024 at 9:58 PM Andi Kleen  wrote:
> >
> > From: Andi Kleen 
> >
> > The current switch bit test clustering enumerates all possible case
> > clusters combinations to find ones that fit the bit test constrains
> > best.  This causes performance problems with very large switches.
> >
> > For bit test clustering which happens naturally in word sized chunks
> > I don't think such an expensive algorithm is really needed.
> >
> > This patch implements a simple greedy algorithm that walks
> > the sorted list and examines word sized windows and tries
> > to cluster them.
> >
> > Surprisingly the new algorithm gives consistly better clusters
> > for the examples I tried.
> >
> > For example from the gcc bootstrap:
> >
> > old: 0-15 16-31 96-175
> > new: 0-31 96-175
> >
> > I'm not fully sure why that is, probably some bug in the old
> > algorithm? This shows even up in the test suite where if-to-switch-6
> > now can generate a switch, as well as a case in switch-1.c
> >
> > I don't have a proof that the new algorithm is always as good or better,
> > but so far at least I don't see any counter examples.
> >
> > It also fixes the excessive compile time in PR117091,
> > however this was already fixed by an earlier patch
> > that doesn't run clustering when no targets have multiple
> > values.
> 
> OK if you add a comment (as part of the function comment for example)
> explaining the idea of the algorithm.


I added the comment.

I will commit it with this change. I also had to add a few more
-fno-bit-tests to make the Linaro tester happy.

However this exposes PR117352 which is a negative interaction of the 
more aggressive bit test conversion.  I don't think it's a show stopper,
this can be sorted out later.

-Andi


[PATCH] RISC-V: fix const interleaved stepped vector with a scalar pattern

2024-10-29 Thread Vineet Gupta
When bisecting for ICE in PR/117353, commit 771256bcb9dd ("RISC-V: Emit costs 
for
bool and stepped const vectors") uncovered yet another latent issue (first 
noted [1])

  [1] https://github.com/patrick-rivos/gcc-postcommit-ci/issues/1625

This patch fixes some of the fortran regressions from that report.

Fixes 71a5ac6703d1 ("RISC-V: Support interleave vector with different step 
sequence")

rv64imafdcv_zvl256b_zba_zbb_zbs_zicond/lp64d/medlow
| # of unexpected case / # of unique unexpected case
|  gcc |  g++ | gfortran |
|  392 /   108 |7 / 3 |   91 /24 |
|  392 /   108 |7 / 3 |   67 /12 |

gcc/ChangeLog:

* config/riscv/riscv-v.cc (expand_const_vector): Use IOR op.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/autovec/slp-interleave-5.c: New test.

Signed-off-by: Vineet Gupta 
---
 gcc/config/riscv/riscv-v.cc   |  6 ++--
 .../riscv/rvv/autovec/slp-interleave-5.c  | 35 +++
 2 files changed, 38 insertions(+), 3 deletions(-)
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/slp-interleave-5.c

diff --git a/gcc/config/riscv/riscv-v.cc b/gcc/config/riscv/riscv-v.cc
index 209b7ee88f18..5e728f04cf51 100644
--- a/gcc/config/riscv/riscv-v.cc
+++ b/gcc/config/riscv/riscv-v.cc
@@ -1501,9 +1501,9 @@ expand_const_vector (rtx target, rtx src)
gen_int_mode (builder.inner_bits_size (), new_smode),
NULL_RTX, false, OPTAB_DIRECT);
  rtx tmp2 = gen_reg_rtx (new_mode);
- rtx and_ops[] = {tmp2, tmp1, scalar};
- emit_vlmax_insn (code_for_pred_scalar (AND, new_mode),
-  BINARY_OP, and_ops);
+ rtx ior_ops[] = {tmp2, tmp1, scalar};
+ emit_vlmax_insn (code_for_pred_scalar (IOR, new_mode),
+  BINARY_OP, ior_ops);
  emit_move_insn (result, gen_lowpart (mode, tmp2));
}
  else
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/slp-interleave-5.c 
b/gcc/testsuite/gcc.target/riscv/rvv/autovec/slp-interleave-5.c
new file mode 100644
index ..32cfe8a8688c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/slp-interleave-5.c
@@ -0,0 +1,35 @@
+/* { dg-do compile } */
+/* { dg-options "-march=rv64gcv_zvl1024b -mabi=lp64d -O3 
-fdump-tree-optimized-details" } */
+
+struct S { int a, b; } s[8];
+
+void
+foo ()
+{
+  int i;
+  for (i = 0; i < 8; i++)
+{
+  s[i].b = 1;
+  s[i].a = i+1;
+}
+}
+
+/* { dg-final { scan-tree-dump-times "\{ 1, 1, 2, 1, 3, 1, 4, 1 \}" 1 
"optimized" } } */
+/* { dg-final { scan-assembler {vid\.v} } } */
+/* { dg-final { scan-assembler {vadd\.v} } } */
+/* { dg-final { scan-assembler {vor\.v} } } */
+
+void
+foo2 ()
+{
+  int i;
+  for (i = 0; i < 8; i++)
+{
+  s[i].b = 0;
+  s[i].a = i+1;
+}
+}
+
+/* { dg-final { scan-tree-dump-times "\{ 1, 0, 2, 0, 3, 0, 4, 0 \}" 1 
"optimized" } } */
+/* { dg-final { scan-assembler {vid\.v} } } */
+/* { dg-final { scan-assembler {vadd\.v} } } */
-- 
2.43.0



Re: [PATCH] RISC-V: fix const interleaved stepped vector with a scalar pattern

2024-10-29 Thread 钟居哲
lgtm




juzhe.zh...@rivai.ai
 
From: Vineet Gupta
Date: 2024-10-30 08:11
To: gcc-patches; Jeff Law; Robin Dapp; juzhe . zhong @ rivai . ai
CC: gnu-toolchain; Vineet Gupta
Subject: [PATCH] RISC-V: fix const interleaved stepped vector with a scalar 
pattern
When bisecting for ICE in PR/117353, commit 771256bcb9dd ("RISC-V: Emit costs 
for
bool and stepped const vectors") uncovered yet another latent issue (first 
noted [1])
 
  [1] https://github.com/patrick-rivos/gcc-postcommit-ci/issues/1625
 
This patch fixes some of the fortran regressions from that report.
 
Fixes 71a5ac6703d1 ("RISC-V: Support interleave vector with different step 
sequence")
 
rv64imafdcv_zvl256b_zba_zbb_zbs_zicond/lp64d/medlow
| # of unexpected case / # of unique unexpected case
|  gcc |  g++ | gfortran |
|  392 /   108 |7 / 3 |   91 /24 |
|  392 /   108 |7 / 3 |   67 /12 |
 
gcc/ChangeLog:
 
* config/riscv/riscv-v.cc (expand_const_vector): Use IOR op.
 
gcc/testsuite/ChangeLog:
 
* gcc.target/riscv/rvv/autovec/slp-interleave-5.c: New test.
 
Signed-off-by: Vineet Gupta 
---
gcc/config/riscv/riscv-v.cc   |  6 ++--
.../riscv/rvv/autovec/slp-interleave-5.c  | 35 +++
2 files changed, 38 insertions(+), 3 deletions(-)
create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/slp-interleave-5.c
 
diff --git a/gcc/config/riscv/riscv-v.cc b/gcc/config/riscv/riscv-v.cc
index 209b7ee88f18..5e728f04cf51 100644
--- a/gcc/config/riscv/riscv-v.cc
+++ b/gcc/config/riscv/riscv-v.cc
@@ -1501,9 +1501,9 @@ expand_const_vector (rtx target, rtx src)
gen_int_mode (builder.inner_bits_size (), new_smode),
NULL_RTX, false, OPTAB_DIRECT);
  rtx tmp2 = gen_reg_rtx (new_mode);
-   rtx and_ops[] = {tmp2, tmp1, scalar};
-   emit_vlmax_insn (code_for_pred_scalar (AND, new_mode),
-BINARY_OP, and_ops);
+   rtx ior_ops[] = {tmp2, tmp1, scalar};
+   emit_vlmax_insn (code_for_pred_scalar (IOR, new_mode),
+BINARY_OP, ior_ops);
  emit_move_insn (result, gen_lowpart (mode, tmp2));
}
  else
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/slp-interleave-5.c 
b/gcc/testsuite/gcc.target/riscv/rvv/autovec/slp-interleave-5.c
new file mode 100644
index ..32cfe8a8688c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/slp-interleave-5.c
@@ -0,0 +1,35 @@
+/* { dg-do compile } */
+/* { dg-options "-march=rv64gcv_zvl1024b -mabi=lp64d -O3 
-fdump-tree-optimized-details" } */
+
+struct S { int a, b; } s[8];
+
+void
+foo ()
+{
+  int i;
+  for (i = 0; i < 8; i++)
+{
+  s[i].b = 1;
+  s[i].a = i+1;
+}
+}
+
+/* { dg-final { scan-tree-dump-times "\{ 1, 1, 2, 1, 3, 1, 4, 1 \}" 1 
"optimized" } } */
+/* { dg-final { scan-assembler {vid\.v} } } */
+/* { dg-final { scan-assembler {vadd\.v} } } */
+/* { dg-final { scan-assembler {vor\.v} } } */
+
+void
+foo2 ()
+{
+  int i;
+  for (i = 0; i < 8; i++)
+{
+  s[i].b = 0;
+  s[i].a = i+1;
+}
+}
+
+/* { dg-final { scan-tree-dump-times "\{ 1, 0, 2, 0, 3, 0, 4, 0 \}" 1 
"optimized" } } */
+/* { dg-final { scan-assembler {vid\.v} } } */
+/* { dg-final { scan-assembler {vadd\.v} } } */
-- 
2.43.0
 
 


[PATCH] libgo: Use stub syscall on GNU/Hurd

2024-10-29 Thread Samuel Thibault
GNU/Hurd does not actually have syscall(), it just has a stub that
always return ENOSYS, and defines __stub_syscall.  It does however
expose a declaration for it:

  extern long int syscall (long int __sysno, ...) __THROW;

that conflicts with the stub that libgo produces

  int
  syscall(int number __attribute__ ((unused)), ...)

So better match reality by not calling syscall() at all, but not
redefining it either.

Changelog:

* libgo/go/syscall/syscall_funcs.go: Do not build on GNU/Hurd.
* libgo/go/syscall/syscall_funcs_stubs.go: Build on GNU/Hurd.
* libgo/runtime/go-nosys.c: Do not produce syscall() stub on
GNU/Hurd.

Signed-off-by: Samuel Thibault 

diff --git a/libgo/go/syscall/syscall_funcs.go 
b/libgo/go/syscall/syscall_funcs.go
index a906fa5a42e..b62278dc27b 100644
--- a/libgo/go/syscall/syscall_funcs.go
+++ b/libgo/go/syscall/syscall_funcs.go
@@ -2,7 +2,7 @@
 // Use of this source code is governed by a BSD-style
 // license that can be found in the LICENSE file.
 
-//go:build darwin || dragonfly || freebsd || hurd || linux || netbsd || 
openbsd || solaris
+//go:build darwin || dragonfly || freebsd || linux || netbsd || openbsd || 
solaris
 // +build darwin dragonfly freebsd hurd linux netbsd openbsd solaris
 
 package syscall
diff --git a/libgo/go/syscall/syscall_funcs_stubs.go 
b/libgo/go/syscall/syscall_funcs_stubs.go
index 11f12bd9ae3..35bc71a5556 100644
--- a/libgo/go/syscall/syscall_funcs_stubs.go
+++ b/libgo/go/syscall/syscall_funcs_stubs.go
@@ -2,7 +2,7 @@
 // Use of this source code is governed by a BSD-style
 // license that can be found in the LICENSE file.
 
-//go:build aix || rtems
+//go:build aix || hurd || rtems
 // +build aix rtems
 
 // These are stubs.
diff --git a/libgo/runtime/go-nosys.c b/libgo/runtime/go-nosys.c
index 30222df7815..cd3e7664ca0 100644
--- a/libgo/runtime/go-nosys.c
+++ b/libgo/runtime/go-nosys.c
@@ -504,7 +504,7 @@ strerror_r (int errnum, char *buf, size_t buflen)
 
 #endif /* ! HAVE_STRERROR_R */
 
-#ifndef HAVE_SYSCALL
+#if !defined(HAVE_SYSCALL) && !defined(__GNU__) /* GNU/Hurd already has a stub 
*/
 int
 syscall(int number __attribute__ ((unused)), ...)
 {


Re: [PATCH v2 9/9] aarch64: Handle alignment when it is bigger than BIGGEST_ALIGNMENT

2024-10-29 Thread Richard Sandiford
Evgeny Karpov  writes:
>> Wednesday, October 23, 2024
>> Richard Sandiford  wrote:
>> 
>>> Or, even if that does work, it isn't clear to me why patching
>>> ASM_OUTPUT_ALIGNED_LOCAL is a complete solution to the problem.
>>
>> This patch reproduces the same code as it was done without declaring 
>> ASM_OUTPUT_ALIGNED_LOCAL.
>> ASM_OUTPUT_ALIGNED_LOCAL is needed to get the alignment value and handle it 
>> when it is bigger than BIGGEST_ALIGNMENT.
>> In all other cases, the code is the same.
>> 
>> https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/varasm.cc;h=c2540055421641caed08113d92dbeff7ffc09f49;hb=refs/heads/master#l2137
>> https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/varasm.cc;h=c2540055421641caed08113d92dbeff7ffc09f49;hb=refs/heads/master#l2233
>
> Does this information provide more clarity on ASM_OUTPUT_ALIGNED_LOCAL usage?
> If not, this patch will be dropped as a low priority, and FFmpeg, which 
> requires this change, will be patched 
> to avoid using alignment higher than 16 bytes on AArch64.

Hmm, I see.  I think this is surprising enough that it would be worth
a comment.  How about:

  /* Since the assembly directive only specifies a size, and not an
 alignment, we need to follow the default ASM_OUTPUT_LOCAL behavior
 and round the size up to at least a multiple of BIGGEST_ALIGNMENT bits,
 so that each uninitialized object starts on such a boundary.
 However, we also want to allow the alignment (and thus minimum size)
 to exceed BIGGEST_ALIGNMENT.  */

But how does using a larger size force the linker to assign a larger
alignment than BIGGEST_ALIGNMENT?  Is there a second limit in play?

Or does this patch not guarantee that the ffmpeg variable gets the
alignment it wants?  Is it just about suppresing the error?

If it's just about suppressing the error without guaranteeing the
requested alignment, then, yeah, I think patching ffmpeg would
be better.  If the patch does guarantee the alignment, then the
patch seems ok, but I think the comment should explain how, and
explain why BIGGEST_ALIGNMENT isn't larger.

Thanks,
Richard


[PATCH 2/2] Support vector float_extend from __bf16 to float.

2024-10-29 Thread liuhongt
It's supported by vector permutation with zero vector.

gcc/ChangeLog:

* config/i386/i386-expand.cc
(ix86_expand_vector_bf2sf_with_vec_perm): New function.
* config/i386/i386-protos.h
(ix86_expand_vector_bf2sf_with_vec_perm): New Declare.
* config/i386/mmx.md (extendv2bfv2sf2): New expander.
* config/i386/sse.md (extend2):
Ditto.
(VF1_AVX512BW): New mode iterator.
(sf_cvt_bf16): Add V4SF.
(sf_cvt_bf16_lower): New mode attr.

gcc/testsuite/ChangeLog:

* gcc.target/i386/avx512bw-extendbf2sf.c: New test.
* gcc.target/i386/sse2-extendbf2sf.c: New test.
---
 gcc/config/i386/i386-expand.cc| 39 
 gcc/config/i386/i386-protos.h |  2 +
 gcc/config/i386/mmx.md| 18 
 gcc/config/i386/sse.md| 20 +++-
 .../gcc.target/i386/avx512bw-extendbf2sf.c| 46 +++
 .../gcc.target/i386/sse2-extendbf2sf.c| 20 
 6 files changed, 144 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/avx512bw-extendbf2sf.c
 create mode 100644 gcc/testsuite/gcc.target/i386/sse2-extendbf2sf.c

diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
index 7138432659e..df9676b80d4 100644
--- a/gcc/config/i386/i386-expand.cc
+++ b/gcc/config/i386/i386-expand.cc
@@ -26854,5 +26854,44 @@ ix86_expand_vector_sf2bf_with_vec_perm (rtx dest, rtx 
src)
   emit_move_insn (dest, lowpart_subreg (GET_MODE (dest), target, vperm_mode));
 }
 
+/* Implement extendv8bf2v8sf2 with vector permutation.  */
+void
+ix86_expand_vector_bf2sf_with_vec_perm (rtx dest, rtx src)
+{
+  machine_mode vperm_mode, src_mode = GET_MODE (src);
+  switch (src_mode)
+{
+case V16BFmode:
+  vperm_mode = V32BFmode;
+  break;
+case V8BFmode:
+  vperm_mode = V16BFmode;
+  break;
+case V4BFmode:
+  vperm_mode = V8BFmode;
+  break;
+default:
+  gcc_unreachable ();
+}
+
+  int nelt = GET_MODE_NUNITS (vperm_mode);
+  vec_perm_builder sel (nelt, nelt, 1);
+  sel.quick_grow (nelt);
+  for (int i = 0, k = 0, j = nelt; i != nelt; i++)
+sel[i] = i & 1 ? j++ : k++;
+
+  vec_perm_indices indices (sel, 2, nelt);
+
+  rtx target = gen_reg_rtx (vperm_mode);
+  rtx op1 = lowpart_subreg (vperm_mode,
+   force_reg (src_mode, src),
+   src_mode);
+  rtx op0 = CONST0_RTX (vperm_mode);
+  bool ok = targetm.vectorize.vec_perm_const (vperm_mode, vperm_mode,
+ target, op0, op1, indices);
+  gcc_assert (ok);
+  emit_move_insn (dest, lowpart_subreg (GET_MODE (dest), target, vperm_mode));
+}
+
 
 #include "gt-i386-expand.h"
diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h
index 55ffdb9dcf1..c26ae5e4f1d 100644
--- a/gcc/config/i386/i386-protos.h
+++ b/gcc/config/i386/i386-protos.h
@@ -259,6 +259,8 @@ extern bool ix86_ternlog_operand_p (rtx op);
 extern rtx ix86_expand_ternlog (machine_mode mode, rtx op0, rtx op1, rtx op2,
int idx, rtx target);
 extern void ix86_expand_vector_sf2bf_with_vec_perm (rtx, rtx);
+extern void ix86_expand_vector_bf2sf_with_vec_perm (rtx, rtx);
+
 
 #ifdef TREE_CODE
 extern void init_cumulative_args (CUMULATIVE_ARGS *, tree, rtx, tree, int);
diff --git a/gcc/config/i386/mmx.md b/gcc/config/i386/mmx.md
index 5c776ec0aba..021ac90ae2a 100644
--- a/gcc/config/i386/mmx.md
+++ b/gcc/config/i386/mmx.md
@@ -3012,6 +3012,24 @@ (define_expand "truncv2sfv2bf2"
   DONE;
 })
 
+(define_expand "extendv2bfv2sf2"
+  [(set (match_operand:V2SF 0 "register_operand")
+   (float_extend:V2SF
+ (match_operand:V2BF 1 "nonimmediate_operand")))]
+  "TARGET_SSE2 && TARGET_MMX_WITH_SSE"
+{
+  rtx op0 = gen_reg_rtx (V4SFmode);
+  rtx op1 = gen_reg_rtx (V4BFmode);
+
+  emit_move_insn (op1, lowpart_subreg (V4BFmode,
+  force_reg (V2BFmode, operands[1]),
+  V2BFmode));
+  emit_insn (gen_extendv4bfv4sf2 (op0, op1));
+
+  emit_move_insn (operands[0], lowpart_subreg (V2SFmode, op0, V4SFmode));
+  DONE;
+})
+
 ;
 ;;
 ;; Parallel integral arithmetic
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index 7f7910383ae..3d57a90fad7 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -530,6 +530,9 @@ (define_mode_iterator VF2_AVX512VL
 (define_mode_iterator VF1_AVX512VL
   [(V16SF "TARGET_EVEX512") (V8SF "TARGET_AVX512VL") (V4SF "TARGET_AVX512VL")])
 
+(define_mode_iterator VF1_AVX512BW
+  [(V16SF "TARGET_EVEX512 && TARGET_EVEX512") (V8SF "TARGET_AVX2") V4SF])
+
 (define_mode_iterator VF1_AVX10_2
   [(V16SF "TARGET_AVX10_2_512") V8SF V4SF])
 
@@ -30925,7 +30928,11 @@ (define_mode_attr bf16_cvt_2sf
   [(V32BF  "V16SF") (V16BF  "V8SF") (V8BF  "V4SF")])
 ;; Converting from SF to BF
 

[PATCH 1/2] [x86] Support vector float_truncate for SF to BF.

2024-10-29 Thread liuhongt
Generate native instruction whenever possible, otherwise use vector
permutation with odd indices.

Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ready push to trunk.

gcc/ChangeLog:

* config/i386/i386-expand.cc
(ix86_expand_vector_sf2bf_with_vec_perm): New function.
* config/i386/i386-protos.h
(ix86_expand_vector_sf2bf_with_vec_perm): New declare.
* config/i386/mmx.md (truncv2sfv2bf2): New expander.
* config/i386/sse.md (truncv4sfv4bf2): Ditto.
(truncv8sfv8bf2): Ditto.
(truncv16sfv16bf2): Ditto.

gcc/testsuite/ChangeLog:

* gcc.target/i386/avx512bf16-truncsfbf.c: New test.
* gcc.target/i386/avx512bw-truncsfbf.c: New test.
* gcc.target/i386/ssse3-truncsfbf.c: New test.
---
 gcc/config/i386/i386-expand.cc| 38 +++
 gcc/config/i386/i386-protos.h |  1 +
 gcc/config/i386/mmx.md| 18 
 gcc/config/i386/sse.md| 44 ++
 .../gcc.target/i386/avx512bf16-truncsfbf.c|  5 ++
 .../gcc.target/i386/avx512bw-truncsfbf.c  | 46 +++
 .../gcc.target/i386/ssse3-truncsfbf.c | 20 
 7 files changed, 172 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/i386/avx512bf16-truncsfbf.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx512bw-truncsfbf.c
 create mode 100644 gcc/testsuite/gcc.target/i386/ssse3-truncsfbf.c

diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
index 63f5e348d64..7138432659e 100644
--- a/gcc/config/i386/i386-expand.cc
+++ b/gcc/config/i386/i386-expand.cc
@@ -26817,4 +26817,42 @@ ix86_expand_trunc_with_avx2_noavx512f (rtx output, rtx 
input, machine_mode cvt_m
   emit_move_insn (output, gen_lowpart (out_mode, d.target));
 }
 
+/* Implement truncv8sfv8bf2 with vector permutation.  */
+void
+ix86_expand_vector_sf2bf_with_vec_perm (rtx dest, rtx src)
+{
+  machine_mode vperm_mode, src_mode = GET_MODE (src);
+  switch (src_mode)
+{
+case V16SFmode:
+  vperm_mode = V32BFmode;
+  break;
+case V8SFmode:
+  vperm_mode = V16BFmode;
+  break;
+case V4SFmode:
+  vperm_mode = V8BFmode;
+  break;
+default:
+  gcc_unreachable ();
+}
+
+  int nelt = GET_MODE_NUNITS (vperm_mode);
+  vec_perm_builder sel (nelt, nelt, 1);
+  sel.quick_grow (nelt);
+  for (int i = 0; i != nelt; i++)
+sel[i] = (2 * i + 1) % nelt;
+  vec_perm_indices indices (sel, 1, nelt);
+
+  rtx target = gen_reg_rtx (vperm_mode);
+  rtx op0 = lowpart_subreg (vperm_mode,
+   force_reg (src_mode, src),
+   src_mode);
+  bool ok = targetm.vectorize.vec_perm_const (vperm_mode, vperm_mode,
+ target, op0, op0, indices);
+  gcc_assert (ok);
+  emit_move_insn (dest, lowpart_subreg (GET_MODE (dest), target, vperm_mode));
+}
+
+
 #include "gt-i386-expand.h"
diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h
index c1f9147769c..55ffdb9dcf1 100644
--- a/gcc/config/i386/i386-protos.h
+++ b/gcc/config/i386/i386-protos.h
@@ -258,6 +258,7 @@ extern int ix86_ternlog_idx (rtx op, rtx *args);
 extern bool ix86_ternlog_operand_p (rtx op);
 extern rtx ix86_expand_ternlog (machine_mode mode, rtx op0, rtx op1, rtx op2,
int idx, rtx target);
+extern void ix86_expand_vector_sf2bf_with_vec_perm (rtx, rtx);
 
 #ifdef TREE_CODE
 extern void init_cumulative_args (CUMULATIVE_ARGS *, tree, rtx, tree, int);
diff --git a/gcc/config/i386/mmx.md b/gcc/config/i386/mmx.md
index 506f4cab6a8..5c776ec0aba 100644
--- a/gcc/config/i386/mmx.md
+++ b/gcc/config/i386/mmx.md
@@ -2994,6 +2994,24 @@ (define_expand "truncv2sfv2hf2"
   DONE;
 })
 
+(define_expand "truncv2sfv2bf2"
+  [(set (match_operand:V2BF 0 "register_operand")
+   (float_truncate:V2BF
+ (match_operand:V2SF 1 "nonimmediate_operand")))]
+  "TARGET_SSSE3 && TARGET_MMX_WITH_SSE"
+{
+  rtx op1 = gen_reg_rtx (V4SFmode);
+  rtx op0 = gen_reg_rtx (V4BFmode);
+
+  emit_move_insn (op1, lowpart_subreg (V4SFmode,
+  force_reg (V2SFmode, operands[1]),
+  V2SFmode));
+  emit_insn (gen_truncv4sfv4bf2 (op0, op1));
+
+  emit_move_insn (operands[0], lowpart_subreg (V2BFmode, op0, V4BFmode));
+  DONE;
+})
+
 ;
 ;;
 ;; Parallel integral arithmetic
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index 6c28b74ac3f..7f7910383ae 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -30952,6 +30952,24 @@ (define_insn "avx512f_cvtne2ps2bf16_"
   "TARGET_AVX512BF16"
   "vcvtne2ps2bf16\t{%2, %1, %0|%0, %1, %2}")
 
+(define_expand "truncv4sfv4bf2"
+  [(set (match_operand:V4BF 0 "register_operand")
+ (float_truncate:V4BF
+   (match_operand:V4SF 1 "nonimmediate_operand")))]
+  "TARGET_SSSE3"
+{
+ 

[PATCH] c++, v2: Attempt to implement C++26 P3034R1 - Module Declarations Shouldn't be Macros [PR114461]

2024-10-29 Thread Jakub Jelinek
On Fri, Oct 25, 2024 at 12:52:41PM -0400, Jason Merrill wrote:
> This does seem like a hole in the wording.  I think the clear intent is that
> the name/partition must neither be macro-expanded nor come from macro
> expansion.

I'll defer filing the DR and figuring out the right wording for the standard
to you/WG21.

> > The patch below implements what is above described as the first variant
> > of the first issue resolution, i.e. disables expansion of as many tokens
> > as could be in the valid module name and module partition syntax, but
> > as soon as it e.g. sees two adjacent identifiers, the second one can be
> > macro expanded.  So, effectively:
> > #define SEMI ;
> > export module SEMI
> > used to be valid and isn't anymore,
> > #define FOO bar
> > export module FOO;
> > isn't valid,
> > #define COLON :
> > export module COLON private;
> > isn't valid,
> > #define BAR baz
> > export module foo.bar:baz.qux.BAR;
> > isn't valid,
> 
> Agreed.

Ok.

> > but
> > #define BAZ .qux
> > export module foo BAZ;
> > or
> > #define QUX [[]]
> > export module foo QUX;
> > or
> > #define FREDDY :garply
> > export module foo FREDDY;
> > or
> > #define GARPLY private
> > module : GARPLY;
> > etc. is.
> 
> I think QUX is valid, but the others are intended to be invalid.

I've changed the patch so that the BAZ and FREDDY cases above
are also diagnosed as invalid, so cases where a module name or module
partition name is followed by an identifier defined as object-like or
function-like macro and where after macro expansion the first
non-padding/comment token is dot or colon (so, either an object-like
or function-like macro whose expansion starts with . or : or one
which expands to nothing and . or : following that macro either directly
or from some other macro expansion.
E.g. one could have
#define A [[vendor::attr(1 ? 2
#define B 3)]]
export module foo.bar A : B;
which I think ought to be valid and expanded as
export module foo.bar [[vendor::attr(1 ? 2 : 3)]];
or
#define A
#define B baz
export module foo.bar A : B;
which would preprocess into
export module foo.bar:baz;
and I think that is against the intent of the paper where the quick
scanners couldn't figure this out without actually performing preprocessing.
Unfortunately, trying to enter macro context inside of
cpp_maybe_module_directive doesn't seem to work well, so I've instead added
a flag and let cpp_get_token_1 to diagnose it later (but still during
preprocessing, because if going e.g. with -save-temps the distinction would
be lost, so it feels undesirable to diagnose that during parsing, plus
whether some token comes from a macro or not is currently only testable in
the FE when not -fno-track-macro-expansion and see above testcase that
the : or . actually might not come from a macro, yet there could be
a macro expanding to nothing in between.

I've kept the GARPLY case above as valid, I don't see the paper changing
anything in that regard (sure, the : has to not come from macro so that
the scanners can figure it out) and I think it isn't needed.

> > --- gcc/testsuite/g++.dg/modules/dir-only-4.C.jj2020-12-22 
> > 23:50:17.057972516 +0100
> > +++ gcc/testsuite/g++.dg/modules/dir-only-4.C   2024-08-08 
> > 09:33:09.454522024 +0200
> > @@ -1,8 +1,8 @@
> >   // { dg-additional-options {-fmodules-ts -fpreprocessed 
> > -fdirectives-only} }
> > -// { dg-module-cmi !foo }
> > +// { dg-module-cmi !baz }
> >   module;
> >   #define foo baz
> > -export module foo;
> > +export module baz;
> >   class import {};
> > --- gcc/testsuite/g++.dg/modules/atom-preamble-2_a.C.jj 2020-12-22 
> > 23:50:17.055972539 +0100
> > +++ gcc/testsuite/g++.dg/modules/atom-preamble-2_a.C2024-08-08 
> > 09:35:56.093364042 +0200
> > @@ -1,6 +1,6 @@
> >   // { dg-additional-options "-fmodules-ts" }
> >   #define malcolm kevin
> > -export module malcolm;
> > +export module kevin;
> >   // { dg-module-cmi kevin }
> >   export class X;
> > --- gcc/testsuite/g++.dg/modules/atom-preamble-4.C.jj   2020-12-22 
> > 23:50:17.055972539 +0100
> > +++ gcc/testsuite/g++.dg/modules/atom-preamble-4.C  2024-08-08 
> > 09:35:32.463670046 +0200
> > @@ -1,5 +1,5 @@
> >   // { dg-additional-options "-fmodules-ts" }
> > -#define NAME(X) X;
> > +#define NAME(X) ;
> > -export module NAME(bob)
> > +export module bob NAME(bob)
> 
> It looks like these tests were specifically testing patterns that this paper
> makes ill-formed, and so we should check for errors instead of changing them
> to be well-formed.

Ok, changed dir-only-4.C and atom-preamble-4.C back to previous content
+ dg-error and added new tests with the changes in and not expecting
error.
I had to keep the atom-preamble-2_a.C change as is, because otherwise
all of atom-preamble-2*.C tests FAIL.

Here is a so far lightly tested patch, ok for trunk if it passes full
bootstrap/regtest or do you want some further changes?

2024-10-29  Jakub Jelinek  

PR c++/114461
libcpp/
* include/cpplib.h: Implement C++26 P3034R1

Re: [PATCH v2 6/8] gcn: Add else operand to masked loads.

2024-10-29 Thread Andrew Stubbs

On 29/10/2024 09:39, Andrew Stubbs wrote:

On 28/10/2024 20:03, Robin Dapp wrote:

I'm not sure how this is different to just deleting the
zero-initializer, which is what I already tested and found some random
behaviour?


The difference is in the else-operand predicate.  So unless there are
more bugs we should only have added VCOND_EXPRs for the cases where
they are absolutely necessary and not unconditionally as currently done
in the gcn backend.


Except that the init_regs pass just puts it straight back in I'm 
testing this additional patch:


--- a/gcc/config/gcn/gcn-valu.md
+++ b/gcc/config/gcn/gcn-valu.md
@@ -4000,7 +4000,7 @@ (define_expand "maskloaddi"
  rtx v = gen_rtx_CONST_INT (VOIDmode, MEM_VOLATILE_P (operands[1]));


  emit_insn (gen_gather_expr_exec (operands[0], addr, as, v,
-  operands[0], exec));
+  gcn_gen_undef(mode), 
exec));

  DONE;
    })


The tests with this patch (stacked on top of the original) also came 
back clean, and the init_regs pass didn't try to fix it up with the 
testcase I tried (gcc.dg/vect/vect-mask-load-1.c).


Just to be clear, gcn_gen_undef simply creates an unspec with the given 
mode that the insn predicates recognise as a valid input to an undefined 
vec_merge. The machine instruction simply leaves the previous contents 
of the masked lanes of the destination hardware register as it was.


The instruction requires that the input "else" value is in the same 
register as the destination, but at expand time you can obviously put 
any pseudo-registers you like in the vec_merge, and there's no need for 
a scratch register or any other dummy input if we use the unspec.


In principle, we can pre-initialize the destination register with any 
arbitrary value passed to maskload, but I think the "else" parameter 
does not actually work that way, as defined by the patch?


Andrew


Re: [PATCH 1/2] RISC-V:Add intrinsic support for the CMOs extensions

2024-10-29 Thread Jeff Law



On 10/29/24 4:12 AM, shiyul...@iscas.ac.cn wrote:

From: yulong 

gcc/ChangeLog:

 * config.gcc: Add riscv_cmo.h.
 * config/riscv/riscv_cmo.h: New file.
I think Kito pointed out a minor problem and the linter's also pointed 
out a whitespace problem.  I've fixed both locally and done a sanity 
check build/test.  I'll push this to the trunk momentarily.


Attached patch is what I'm actually committing as a single diff just for 
the archivers.


Jeffdiff --git a/gcc/config.gcc b/gcc/config.gcc
index fd848228722..e2ed3b309cc 100644
--- a/gcc/config.gcc
+++ b/gcc/config.gcc
@@ -558,7 +558,7 @@ riscv*)
extra_objs="${extra_objs} riscv-vector-builtins.o 
riscv-vector-builtins-shapes.o riscv-vector-builtins-bases.o"
extra_objs="${extra_objs} thead.o riscv-target-attr.o"
d_target_objs="riscv-d.o"
-   extra_headers="riscv_vector.h riscv_crypto.h riscv_bitmanip.h 
riscv_th_vector.h"
+   extra_headers="riscv_vector.h riscv_crypto.h riscv_bitmanip.h 
riscv_th_vector.h riscv_cmo.h"
target_gtfiles="$target_gtfiles 
\$(srcdir)/config/riscv/riscv-vector-builtins.cc"
target_gtfiles="$target_gtfiles 
\$(srcdir)/config/riscv/riscv-vector-builtins.h"
;;
diff --git a/gcc/config/riscv/riscv_cmo.h b/gcc/config/riscv/riscv_cmo.h
new file mode 100644
index 000..3514fd3f0fe
--- /dev/null
+++ b/gcc/config/riscv/riscv_cmo.h
@@ -0,0 +1,84 @@
+/* RISC-V CMO Extension intrinsics include file.
+   Copyright (C) 2024 Free Software Foundation, Inc.
+
+   This file is part of GCC.
+
+   GCC is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published
+   by the Free Software Foundation; either version 3, or (at your
+   option) any later version.
+
+   GCC is distributed in the hope that it will be useful, but WITHOUT
+   ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
+   or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public
+   License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   .  */
+
+#ifndef __RISCV_CMO_H
+#define __RISCV_CMO_H
+
+#if defined (__riscv_zicbom)
+
+extern __inline void
+__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
+__riscv_cmo_clean (void *addr)
+{
+__builtin_riscv_zicbom_cbo_clean (addr);
+}
+
+extern __inline void
+__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
+__riscv_cmo_flush (void *addr)
+{
+__builtin_riscv_zicbom_cbo_flush (addr);
+}
+
+extern __inline void
+__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
+__riscv_cmo_inval (void *addr)
+{
+__builtin_riscv_zicbom_cbo_inval (addr);
+}
+
+#endif // __riscv_zicbom
+
+#if defined (__riscv_zicbop)
+
+# define rnum 1
+
+extern __inline void
+__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
+__riscv_cmo_prefetch (void *addr, const int vs1, const int vs2)
+{
+__builtin_prefetch (addr,vs1,vs2);
+}
+
+extern __inline int
+__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
+__riscv_cmo_prefetchi ()
+{
+return __builtin_riscv_zicbop_cbo_prefetchi (rnum);
+}
+
+#endif // __riscv_zicbop
+
+#if defined (__riscv_zicboz)
+
+extern __inline void
+__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
+__riscv_cmo_zero (void *addr)
+{
+__builtin_riscv_zicboz_cbo_zero (addr);
+}
+
+#endif // __riscv_zicboz
+
+#endif // __RISCV_CMO_H
diff --git a/gcc/testsuite/gcc.target/riscv/cmo-32.c 
b/gcc/testsuite/gcc.target/riscv/cmo-32.c
new file mode 100644
index 000..8e733cc05fc
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/cmo-32.c
@@ -0,0 +1,58 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target rv32} */
+/* { dg-options "-march=rv32gc_zicbom_zicbop_zicboz -mabi=ilp32" } */
+
+#include "riscv_cmo.h"
+
+void foo1 (void *addr)
+{
+__riscv_cmo_clean(0);
+__riscv_cmo_clean(addr);
+__riscv_cmo_clean((void*)0x111);
+}
+
+void foo2 (void *addr)
+{
+__riscv_cmo_flush(0);
+__riscv_cmo_flush(addr);
+__riscv_cmo_flush((void*)0x111);
+}
+
+void foo3 (void *addr)
+{
+__riscv_cmo_inval(0);
+__riscv_cmo_inval(addr);
+__riscv_cmo_inval((void*)0x111);
+}
+
+void foo4 (void *addr)
+{
+__riscv_cmo_prefetch(addr,0,0);
+__riscv_cmo_prefetch(addr,0,1);
+__riscv_cmo_prefetch(addr,0,2);
+__riscv_cmo_prefetch(addr,0,3);
+__riscv_cmo_prefetch(addr,1,0);
+__riscv_cmo_prefetch(addr,1,1);
+__riscv_cmo_prefetch(addr,1,2);
+__riscv_cmo_prefetch(addr,1,3);
+}
+
+int foo5 (int num)
+{
+return __riscv_cmo

Re: [PATCH 1/2] RISC-V:Add intrinsic support for the CMOs extensions

2024-10-29 Thread Craig Topper
Jeff, should this question in the spec be resolved before merging this?
https://github.com/riscv-non-isa/riscv-c-api-doc/pull/93/files#r1817437534

On Tue, Oct 29, 2024 at 8:55 AM Jeff Law  wrote:

>
>
> On 10/29/24 9:50 AM, Craig Topper wrote:
> > The '# define rnum 1' may break user code that contains a variable
> > called rnum.
> Yikes!  Thanks for noting.  I'll take care of it.
>
> jeff
>
>


Re: [PATCH v2 2/2] Match: make SAT_ADD case 7 commutative

2024-10-29 Thread Richard Biener
On Mon, Oct 28, 2024 at 4:45 PM Akram Ahmad  wrote:
>
> Case 7 of unsigned scalar saturating addition defines
> SAT_ADD = X <= (X + Y) ? (X + Y) : -1. This is the same as
> SAT_ADD = Y <= (X + Y) ? (X + Y) : -1 due to usadd_left_part_1
> being commutative.
>
> The pattern for case 7 currently does not accept the alternative
> where Y is used in the condition. Therefore, this commit adds the
> commutative property to this case which causes more valid cases of
> unsigned saturating arithmetic to be recognised.
>
> Before:
>  
>  _1 = BIT_FIELD_REF ;
>  sum_5 = _1 + a_4(D);
>  if (a_4(D) <= sum_5)
>goto ; [INV]
>  else
>goto ; [INV]
>
>   :
>
>   :
>  _2 = PHI <255(3), sum_5(2)>
>  return _2;
>
> After:
>[local count: 1073741824]:
>   _1 = BIT_FIELD_REF ;
>   _2 = .SAT_ADD (_1, a_4(D)); [tail call]
>   return _2;
>
> This passes the aarch64-none-linux-gnu regression tests with no new
> failures. The tests written in this patch will fail on targets which
> do not implement the standard names for IFN SAT_ADD.
>
> gcc/ChangeLog:
>
> * match.pd: Modify existing case for SAT_ADD.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.dg/tree-ssa/sat-u-add-match-1-u16.c: New test.
> * gcc.dg/tree-ssa/sat-u-add-match-1-u32.c: New test.
> * gcc.dg/tree-ssa/sat-u-add-match-1-u64.c: New test.
> * gcc.dg/tree-ssa/sat-u-add-match-1-u8.c: New test.
> ---
>  gcc/match.pd  |  4 ++--
>  .../gcc.dg/tree-ssa/sat-u-add-match-1-u16.c   | 21 +++
>  .../gcc.dg/tree-ssa/sat-u-add-match-1-u32.c   | 21 +++
>  .../gcc.dg/tree-ssa/sat-u-add-match-1-u64.c   | 21 +++
>  .../gcc.dg/tree-ssa/sat-u-add-match-1-u8.c| 21 +++
>  5 files changed, 86 insertions(+), 2 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/sat-u-add-match-1-u16.c
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/sat-u-add-match-1-u32.c
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/sat-u-add-match-1-u64.c
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/sat-u-add-match-1-u8.c
>
> diff --git a/gcc/match.pd b/gcc/match.pd
> index 4fc5efa6247..98c50ab097f 100644
> --- a/gcc/match.pd
> +++ b/gcc/match.pd
> @@ -3085,7 +3085,7 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>  /* SAT_ADD = usadd_left_part_1 | usadd_right_part_1, aka:
> SAT_ADD = (X + Y) | -((X + Y) < X)  */
>  (match (usadd_left_part_1 @0 @1)
> - (plus:c @0 @1)
> + (plus @0 @1)
>   (if (INTEGRAL_TYPE_P (type) && TYPE_UNSIGNED (type)
>&& types_match (type, @0, @1
>
> @@ -3166,7 +3166,7 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>  /* Unsigned saturation add, case 7 (branch with le):
> SAT_ADD = x <= (X + Y) ? (X + Y) : -1.  */
>  (match (unsigned_integer_sat_add @0 @1)
> - (cond^ (le @0 (usadd_left_part_1@2 @0 @1)) @2 integer_minus_onep))
> + (cond^ (le @0 (usadd_left_part_1:C@2 @0 @1)) @2 integer_minus_onep))
>
>  /* Unsigned saturation add, case 8 (branch with gt):
> SAT_ADD = x > (X + Y) ? -1 : (X + Y).  */
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/sat-u-add-match-1-u16.c 
> b/gcc/testsuite/gcc.dg/tree-ssa/sat-u-add-match-1-u16.c
> new file mode 100644
> index 000..0202c70cc83
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/sat-u-add-match-1-u16.c
> @@ -0,0 +1,21 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-tree-optimized" } */
> +
> +#include 
> +
> +#define T uint16_t
> +#define UMAX (T) -1
> +
> +T sat_u_add_1 (T a, T b)
> +{
> +  T sum = a + b;
> +  return sum < a ? UMAX : sum;
> +}
> +
> +T sat_u_add_2 (T a, T b)
> +{
> +  T sum = a + b;
> +  return sum < b ? UMAX : sum;
> +}
> +
> +/* { dg-final { scan-tree-dump-times " .SAT_ADD " 2 "optimized" } } */

The testcases will FAIL unless the target has support for .SAT_ADD - you want to
add proper effective target tests here.

The match.pd part looks OK to me.

Richard.

> \ No newline at end of file
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/sat-u-add-match-1-u32.c 
> b/gcc/testsuite/gcc.dg/tree-ssa/sat-u-add-match-1-u32.c
> new file mode 100644
> index 000..34c80ba3854
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/sat-u-add-match-1-u32.c
> @@ -0,0 +1,21 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-tree-optimized" } */
> +
> +#include 
> +
> +#define T uint32_t
> +#define UMAX (T) -1
> +
> +T sat_u_add_1 (T a, T b)
> +{
> +  T sum = a + b;
> +  return sum < a ? UMAX : sum;
> +}
> +
> +T sat_u_add_2 (T a, T b)
> +{
> +  T sum = a + b;
> +  return sum < b ? UMAX : sum;
> +}
> +
> +/* { dg-final { scan-tree-dump-times " .SAT_ADD " 2 "optimized" } } */
> \ No newline at end of file
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/sat-u-add-match-1-u64.c 
> b/gcc/testsuite/gcc.dg/tree-ssa/sat-u-add-match-1-u64.c
> new file mode 100644
> index 000..0718cb566d3
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/sat-u-add-match-1-u64.c
> @@ -0,0 +1,21 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-tree-op

Re: [PATCH] Add 'cobol' to Makefile.def, take 2

2024-10-29 Thread Jakub Jelinek
On Tue, Oct 29, 2024 at 11:56:18AM +0100, Richard Biener wrote:
> It's probably best to have a first commit just generate the directories with 
> the
> empty ChangeLog and amend the contrib/gcc-changelog/git_commit.py
> scipts default_changelog_locations.
> 
> I'm not sure about the exact order of the dance, Jakub possibly remembers.
> We'll mainly have to remember when pushing any of the series.

The m2 dances were git_commit.py tweaks:
https://gcc.gnu.org/r13-4588
followed by asking one of us gccadmins on IRC to install that change
on the server, followed by adding just new directory/ies with almost
empty ChangeLog files:
https://gcc.gnu.org/r13-4592
followed by the rest of the changes.

Jakub



Re: [PATCH] c-family: Handle RAW_DATA_CST in complete_array_type [PR117313]

2024-10-29 Thread Joseph Myers
On Tue, 29 Oct 2024, Jakub Jelinek wrote:

> Hi!
> 
> The following testcase ICEs, because
> add_flexible_array_elts_to_size -> complete_array_type
> is done only after braced_lists_to_strings which optimizes
> RAW_DATA_CST surrounded by INTEGER_CST into a larger RAW_DATA_CST
> covering even the boundaries, while I thought it is done before
> that.
> So, RAW_DATA_CST now can be the last constructor_elt in a CONSTRUCTOR
> and so we need the function to take it into account (handle it as
> RAW_DATA_CST standing for RAW_DATA_LENGTH consecutive elements).
> 
> The function wants to support both CONSTRUCTORs without indexes and with
> them (for non-RAW_DATA_CST elts it was just adding 1 for the current
> index).  So, if the RAW_DATA_CST elt has ce->index, we need to add
> RAW_DATA_LENGTH (ce->value) - 1, while if it doesn't (and it isn't cnt == 0
> case where curindex is 0), add that plus 1, i.e. RAW_DATA_LENGTH (ce->value).
> 
> Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?

OK.

-- 
Joseph S. Myers
josmy...@redhat.com



Re: [PATCH] c: Add C2Y N3370 - Case range expressions support [PR117021]

2024-10-29 Thread Joseph Myers
On Tue, 29 Oct 2024, Jakub Jelinek wrote:

> Hi!
> 
> The following patch adds the C2Y N3370 paper support.
> We had the case ranges as a GNU extension for decades, so this patch
> simply:
> 1) adds different diagnostics when it is used in C (depending on flag_isoc2y
>and pedantic and warn_c23_c2y_compat)
> 2) emits a pedwarn in C if in a range conversion changes the value of
>the low or high bounds and in that case doesn't emit -Woverflow and
>similar warnings anymore if the pedwarn has been diagnosed
> 3) changes the handling of empty ranges both in C and C++; previously
>we just warned but let the values be still looked up in the splay
>tree/entered into it (and let only gimplification throw away those
>empty cases), so e.g. case -6 ... -8: break; case -6: break;
>complained about duplicate case label.  But that actually isn't
>duplicate case label, case -6 ... -8: stands for nothing at all
>and that is how it is treated later on (thrown away)
> 
> Older version has been bootstrapped/regtested on x86_64-linux and i686-linux
> successfully (without the last hunk in c_add_case_label and some of testcase
> additions), ok for trunk if even this updated patch passes
> bootstrap/regtest?

OK.

-- 
Joseph S. Myers
josmy...@redhat.com



Re: [PATCH] ifcombine: For short circuit case, allow 2 defining statements [PR85605]

2024-10-29 Thread Andrew Pinski
On Tue, Oct 29, 2024 at 5:59 AM Richard Biener
 wrote:
>
> On Tue, Oct 29, 2024 at 4:29 AM Andrew Pinski  
> wrote:
> >
> > r0-126134-g5d2a9da9a7f7c1 added support for circuiting and combing the ifs
> > into using either AND or OR. But it only allowed the inner condition
> > basic block having the conditional only. This changes to allow up to 2 
> > defining
> > statements as long as they are just nop conversions for either the lhs or 
> > rhs
> > of the conditional.
> >
> > This should allow to use ccmp on aarch64 and x86_64 (APX) slightly more 
> > than before.
> >
> > Boootstrapped and tested on x86_64-linux-gnu.
> >
> > PR tree-optimization/85605
> >
> > gcc/ChangeLog:
> >
> > * tree-ssa-ifcombine.cc (can_combine_bbs_with_short_circuit): New 
> > function.
> > (ifcombine_ifandif): Use can_combine_bbs_with_short_circuit instead 
> > of checking
> > if iterator is one before the last statement.
> >
> > gcc/testsuite/ChangeLog:
> >
> > * g++.dg/tree-ssa/ifcombine-ccmp-1.C: New test.
> > * gcc.dg/tree-ssa/ssa-ifcombine-ccmp-7.c: New test.
> > * gcc.dg/tree-ssa/ssa-ifcombine-ccmp-8.c: New test.
> >
> > Signed-off-by: Andrew Pinski 
> > ---
> >  .../g++.dg/tree-ssa/ifcombine-ccmp-1.C| 27 +
> >  .../gcc.dg/tree-ssa/ssa-ifcombine-ccmp-7.c| 18 +
> >  .../gcc.dg/tree-ssa/ssa-ifcombine-ccmp-8.c| 19 +
> >  gcc/tree-ssa-ifcombine.cc | 39 ++-
> >  4 files changed, 101 insertions(+), 2 deletions(-)
> >  create mode 100644 gcc/testsuite/g++.dg/tree-ssa/ifcombine-ccmp-1.C
> >  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-ifcombine-ccmp-7.c
> >  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-ifcombine-ccmp-8.c
> >
> > diff --git a/gcc/testsuite/g++.dg/tree-ssa/ifcombine-ccmp-1.C 
> > b/gcc/testsuite/g++.dg/tree-ssa/ifcombine-ccmp-1.C
> > new file mode 100644
> > index 000..282cec8c628
> > --- /dev/null
> > +++ b/gcc/testsuite/g++.dg/tree-ssa/ifcombine-ccmp-1.C
> > @@ -0,0 +1,27 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -g -fdump-tree-optimized --param 
> > logical-op-non-short-circuit=1" } */
> > +
> > +/* PR tree-optimization/85605 */
> > +#include 
> > +
> > +template
> > +inline bool cmp(T a, T2 b) {
> > +  return a<0 ? true : T2(a) < b;
> > +}
> > +
> > +template
> > +inline bool cmp2(T a, T2 b) {
> > +  return (a<0) | (T2(a) < b);
> > +}
> > +
> > +bool f(int a, int b) {
> > +return cmp(int64_t(a), unsigned(b));
> > +}
> > +
> > +bool f2(int a, int b) {
> > +return cmp2(int64_t(a), unsigned(b));
> > +}
> > +
> > +
> > +/* Both of these functions should be optimized to the same, and have an | 
> > in them. */
> > +/* { dg-final { scan-tree-dump-times " \\\| " 2 "optimized" } } */
> > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-ifcombine-ccmp-7.c 
> > b/gcc/testsuite/gcc.dg/tree-ssa/ssa-ifcombine-ccmp-7.c
> > new file mode 100644
> > index 000..1bdbb9358b4
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-ifcombine-ccmp-7.c
> > @@ -0,0 +1,18 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -g -fdump-tree-optimized --param 
> > logical-op-non-short-circuit=1" } */
> > +
> > +/* PR tree-optimization/85605 */
> > +/* Like ssa-ifcombine-ccmp-1.c but with conversion from unsigned to signed 
> > in the
> > +   inner bb which should be able to move too. */
> > +
> > +int t (int a, unsigned b)
> > +{
> > +  if (a > 0)
> > +  {
> > +signed t = b;
> > +if (t > 0)
> > +  return 0;
> > +  }
> > +  return 1;
> > +}
> > +/* { dg-final { scan-tree-dump "\&" "optimized" } } */
> > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-ifcombine-ccmp-8.c 
> > b/gcc/testsuite/gcc.dg/tree-ssa/ssa-ifcombine-ccmp-8.c
> > new file mode 100644
> > index 000..8d74b4932c5
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-ifcombine-ccmp-8.c
> > @@ -0,0 +1,19 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -g -fdump-tree-optimized --param 
> > logical-op-non-short-circuit=1" } */
> > +
> > +/* PR tree-optimization/85605 */
> > +/* Like ssa-ifcombine-ccmp-2.c but with conversion from unsigned to signed 
> > in the
> > +   inner bb which should be able to move too. */
> > +
> > +int t (int a, unsigned b)
> > +{
> > +  if (a > 0)
> > +goto L1;
> > +  signed t = b;
> > +  if (t > 0)
> > +goto L1;
> > +  return 0;
> > +L1:
> > +  return 1;
> > +}
> > +/* { dg-final { scan-tree-dump "\|" "optimized" } } */
> > diff --git a/gcc/tree-ssa-ifcombine.cc b/gcc/tree-ssa-ifcombine.cc
> > index 39702929fc0..3acecda31cc 100644
> > --- a/gcc/tree-ssa-ifcombine.cc
> > +++ b/gcc/tree-ssa-ifcombine.cc
> > @@ -400,6 +400,38 @@ update_profile_after_ifcombine (basic_block 
> > inner_cond_bb,
> >outer2->probability = profile_probability::never ();
> >  }
> >
> > +/* Returns true if inner_cond_bb contains just the condition or 1/2 
> > statements
> > +   that define lhs or rhs with a nop conversion. */
> > +
> > 

Re: [PATCH 1/2] RISC-V:Add intrinsic support for the CMOs extensions

2024-10-29 Thread Jeff Law




On 10/29/24 10:20 AM, Craig Topper wrote:
Jeff, should this question in the spec be resolved before merging this? 
https://github.com/riscv-non-isa/riscv-c-api-doc/pull/93/ 
files#r1817437534 
Actually, I think I mis-read what that comment applied to...  Let me 
look at it again.


jeff



[pushed: r15-4739] jit: fix leak of pending_assemble_externals_set [PR117275]

2024-10-29 Thread David Malcolm
My recent r15-4580-g779c0390e3b57d fix for resetting state in
varasm.cc introduced some noise to "make selftest-valgrind" and,
presumably, a memory leak in libgccjit:

==2462086== 160 (56 direct, 104 indirect) bytes in 1 blocks are definitely lost 
in loss record 248 of 352
==2462086==at 0x5270E7D: operator new(unsigned long) 
(vg_replace_malloc.c:342)
==2462086==by 0x1D1EB89: init_varasm_once() (varasm.cc:6806)
==2462086==by 0x181C845: backend_init() (toplev.cc:1826)
==2462086==by 0x181D41A: do_compile() (toplev.cc:2193)
==2462086==by 0x181D99C: toplev::main(int, char**) (toplev.cc:2371)
==2462086==by 0x378391D: main (main.cc:39)

Fixed thusly.

Successfully bootstrapped & regrtested on x86_64-pc-linux-gnu.
Tested lightly on powerpc64le-unknown-linux-gnu.
Pushed to trunk as r15-4739-g7f41203f08b994.

gcc/ChangeLog:
PR jit/117275
* varasm.cc (process_pending_assemble_externals): Reset
pending_assemble_externals_set to nullptr after deleting it.
(varasm_cc_finalize): Delete pending_assemble_externals_set.

Signed-off-by: David Malcolm 
---
 gcc/varasm.cc | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/gcc/varasm.cc b/gcc/varasm.cc
index ce1077b6d4bd..deefbac5b7b2 100644
--- a/gcc/varasm.cc
+++ b/gcc/varasm.cc
@@ -2575,6 +2575,7 @@ process_pending_assemble_externals (void)
   pending_assemble_externals_processed = true;
   pending_libcall_symbols = NULL_RTX;
   delete pending_assemble_externals_set;
+  pending_assemble_externals_set = nullptr;
 #endif
 }
 
@@ -8893,6 +8894,7 @@ varasm_cc_finalize ()
 
 #ifdef ASM_OUTPUT_EXTERNAL
   pending_assemble_externals_processed = false;
+  delete pending_assemble_externals_set;
   pending_assemble_externals_set = nullptr;
 #endif
 
-- 
2.26.3



Re: [PATCH] Remove code in vectorizer pattern recog relying on vec_cond{u,eq,}

2024-10-29 Thread Richard Biener
On Sat, 26 Oct 2024, Richard Biener wrote:

> With the intent to rely on vec_cond_mask and vec_cmp patterns
> comparisons do not need rewriting into COND_EXPRs that eventually
> combine to vec_cond{u,eq,}.
> 
> Bootstrap and regtest running on x86_64-unknown-linux-gnu.

So with this I effectively removed all invocations of adjust_bool_stmts
since there's no path left in check_bool_pattern that would ever
end up populating the set of stmts to rewrite.

What is surprising is that there's no test coverage (or no
regressions) for the optimizations attempted by adjust_bool_stmts
though those were written with sequences of COND_EXPRs in mind.

I'll post a patch removing the code next.

Richard.

>   * tree-vect-patterns.cc (check_bool_pattern): For comparisons
>   we do nothing if we can expand them or we can't replace them
>   with a ? -1 : 0 condition - but the latter would require
>   expanding the comparison which we proved we can't.  So do
>   nothing, aka not think vec_cond{u,eq,} will save us.
> ---
>  gcc/tree-vect-patterns.cc | 37 +
>  1 file changed, 1 insertion(+), 36 deletions(-)
> 
> diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
> index 46aa3129bb3..a6d246f570c 100644
> --- a/gcc/tree-vect-patterns.cc
> +++ b/gcc/tree-vect-patterns.cc
> @@ -5610,42 +5610,7 @@ check_bool_pattern (tree var, vec_info *vinfo, 
> hash_set &stmts)
>break;
>  
>  default:
> -  if (TREE_CODE_CLASS (rhs_code) == tcc_comparison)
> - {
> -   tree vecitype, comp_vectype;
> -
> -   /* If the comparison can throw, then is_gimple_condexpr will be
> -  false and we can't make a COND_EXPR/VEC_COND_EXPR out of it.  */
> -   if (stmt_could_throw_p (cfun, def_stmt))
> - return false;
> -
> -   comp_vectype = get_vectype_for_scalar_type (vinfo, TREE_TYPE (rhs1));
> -   if (comp_vectype == NULL_TREE)
> - return false;
> -
> -   tree mask_type = get_mask_type_for_scalar_type (vinfo,
> -   TREE_TYPE (rhs1));
> -   if (mask_type
> -   && expand_vec_cmp_expr_p (comp_vectype, mask_type, rhs_code))
> - return false;
> -
> -   if (TREE_CODE (TREE_TYPE (rhs1)) != INTEGER_TYPE)
> - {
> -   scalar_mode mode = SCALAR_TYPE_MODE (TREE_TYPE (rhs1));
> -   tree itype
> - = build_nonstandard_integer_type (GET_MODE_BITSIZE (mode), 1);
> -   vecitype = get_vectype_for_scalar_type (vinfo, itype);
> -   if (vecitype == NULL_TREE)
> - return false;
> - }
> -   else
> - vecitype = comp_vectype;
> -   if (! expand_vec_cond_expr_p (vecitype, comp_vectype, rhs_code))
> - return false;
> - }
> -  else
> - return false;
> -  break;
> +  return false;
>  }
>  
>bool res = stmts.add (def_stmt);
> 

-- 
Richard Biener 
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)


Re: [PATCH 1/4] sched1: hookize pressure scheduling spilling agressiveness

2024-10-29 Thread Vineet Gupta
On 10/29/24 08:05, Jeff Law wrote:
> On 10/20/24 1:40 PM, Vineet Gupta wrote:
>> Pressure senstive scheduling seems to prefer "wide" schedules with more
>> parallelism tending to more spills. This works better for in-order
>> cores [1][2].
> I'm not really sure I'd characterize it that way, but I can also see how 
> you got to the wide vs narrow characterization.
>
> In my mind what we're really doing is refining a bit of the pressure 
> reduction heuristic to more accurately reflect the result of issuing 
> insns that reduce pressure.  So it's more like a level of how 
> aggressively we try to reduce pressure.

Right, this characterization wasn't my first choice either but from earlier 
discussions on the list it seems the existing behavior was "as-designed"
and more of a feature than anti-feature. So I wanted this patch to come out as 
neutral vs. being pejorative (as in it makes spilling less aggressive)...

>> This patch allows for an opt-in target hook
>> TARGET_SCHED_PRESSURE_PREFER_NARROW:
>>
>>   - The default hook (returns false) preserves existing behavior of wider
>> schedules, more parallelism and potentially more spills.
>>
>>   - targets implementing the hook as true get the reverse effect.
> I think we need a better name/description here.  If SCHED_PRESSURE_MODEL 
> was a target hook then I'd probably be looking to make it an integer 
> indicating how aggressively to try and reduce pressure.   So perhaps 
> TARGET_SCHED_PRESSURE_AGGRESSIVE?  I don't particularly like it, but I 
> like it better than the PREFER_NARROW idea.  Everything else I've come 
> up with has been scarily verbose.
>
> TARGET_SCHED_PRESSURE_INCLUDE_PRESSURE_REDUCING_INSNS is on the absurd 
> side.  Removing the second PRESSURE makes me think of vector reductions, 
> so I didn't want to do that :-)
>
> Certainly open to more ideas on the naming, which I think will impact 
> the documentation & comments as well.
>
> And to be 100% clear, no concerns with the behavior of the patch, it's 
> really just the naming convention, documentation/comments.
>
> Thoughts?

I agree the NARROW/WIDE stuff is obfuscating things in technicalities.
How about  TARGET_SCHED_PRESSURE_SPILL_AGGRESSIVE with true (default) being 
existing behavior and false being new semantics.
Its a bit verbose but I think clear enough.


> ps.  I've got to get out of my bubble more often.  Picked up a bug at 
> the RVI summit...  Clearly my immune system isn't firing on all cylinders.

Oh gosh. Get well soon !

Thx,
-Vineet


Re: [PATCH] Add 'cobol' to Makefile.def, take 2

2024-10-29 Thread Jakub Jelinek
On Mon, Oct 28, 2024 at 03:12:03PM -0400, James K. Lowden wrote:
> Jakub Jelinek said: 
> 
> > > We'll mainly have to remember when pushing any of the series.
> > 
> > The m2 dances were git_commit.py tweaks:
> > https://gcc.gnu.org/r13-4588
> > followed by asking one of us gccadmins on IRC to install that change
> > on the server, followed by adding just new directory/ies with almost
> > empty ChangeLog files:
> > https://gcc.gnu.org/r13-4592
> > followed by the rest of the changes.
> 
> This 1st patch includes a stub cobol/ChangeLog, if that helps.  

It needs to go into a separate commit (i.e. register the new ChangeLog
locations in one commit, let gccadmins update the ~gccadmin copy of it,
then commit a patch which only adds ChangeLog files and nothing else,
and only after it commit the rest, because normally ChangeLog changes
are rejected if they are part of other changes).

Jakub



Re: [PATCH] c++: Implement P2662R3, Pack Indexing [PR113798]

2024-10-29 Thread Marek Polacek
On Wed, Oct 23, 2024 at 10:20:39AM -0400, Patrick Palka wrote:
> On Tue, 22 Oct 2024, Marek Polacek wrote:
> 
> > Bootstrapped/regtested on x86_64-pc-linux-gnu, ok for trunk?
> > 
> > -- >8 --
> > This patch implements C++26 Pack Indexing, as described in
> > .
> > 
> > The issue discussing how to mangle pack indexes has not been resolved
> > yet  and I've
> > made no attempt to address it so far.
> > 
> > Rather than introducing a new template code for a pack indexing, I'm
> > adding a new operand to EXPR_PACK_EXPANSION to store the index; for
> > TYPE_PACK_EXPANSION, I'm stashing the index into TYPE_VALUES_RAW.  This
> 
> What are the pros and cons of reusing TYPE/EXPR_PACK_EXPANSION instead
> of creating two new tree codes for these operators (one of whose
> operands would itself be a bare TYPE/EXPR_PACK_EXPANSION)?
> 
> I feel a little iffy at first glance about reusing these tree codes
> since it muddles what "kind" of tree they are: currently they represent
> a _vector_ or types/exprs (which is reflected by their tcc_exceptional
> class), and with this approach they can now also represent a single
> type/expr (despite their tcc_exceptional class), depending on whether
> PACK_EXPANSION_INDEX is set.
> 
> At the same time, the pattern of a generic *_PACK_EXPANSION can be
> anything whereas for these index operators we know it's always a single
> bare pack, so we also don't need the full expressivity of
> *_PACK_EXPANSION to represent these operators either.
> 
> Maybe it's the case that reusing these tree codes significantly
> simplifies parts of the implementation?

That was pretty much the reason.  But now I think it's cleaner to introduce
new codes, and that the clarity is worth the pain of adding new codes.
 
> > @@ -13814,6 +13833,10 @@ tsubst_pack_expansion (tree t, tree args, 
> > tsubst_flags_t complain,
> >  {
> >tree args = ARGUMENT_PACK_ARGS (TREE_VALUE (packs));
> >  
> > +  /* C++26 Pack Indexing.  */
> > +  if (index)
> > +   return pack_index_element (index, args, complain);
> 
> I'd expect every pack index operator to hit this code path since its
> pattern should always be a bare pack...
> 
> > +
> >/* If the argument pack is a single pack expansion, pull it out.  */
> >if (TREE_VEC_LENGTH (args) == 1
> >   && pack_expansion_args_count (args))
> > @@ -13946,6 +13969,10 @@ tsubst_pack_expansion (tree t, tree args, 
> > tsubst_flags_t complain,
> >&& PACK_EXPANSION_P (TREE_VEC_ELT (result, 0)))
> >  return TREE_VEC_ELT (result, 0);
> >  
> > +  /* C++26 Pack Indexing.  */
> > +  if (index)
> > +return pack_index_element (index, result, complain);
> 
> ... so this code path should be necessary?

This is no longer in the v2 patch.

Marek



Re: [PATCH] c++: Implement P2662R3, Pack Indexing [PR113798]

2024-10-29 Thread Marek Polacek
On Thu, Oct 24, 2024 at 04:29:02PM -0400, Patrick Palka wrote:
> On Wed, 23 Oct 2024, Jason Merrill wrote:
> 
> > On 10/23/24 10:20 AM, Patrick Palka wrote:
> > > On Tue, 22 Oct 2024, Marek Polacek wrote:
> > > 
> > > > Bootstrapped/regtested on x86_64-pc-linux-gnu, ok for trunk?
> > > > 
> > > > -- >8 --
> > > > This patch implements C++26 Pack Indexing, as described in
> > > > .
> > > > 
> > > > The issue discussing how to mangle pack indexes has not been resolved
> > > > yet  and I've
> > > > made no attempt to address it so far.
> > > > 
> > > > Rather than introducing a new template code for a pack indexing, I'm
> > > > adding a new operand to EXPR_PACK_EXPANSION to store the index; for
> > > > TYPE_PACK_EXPANSION, I'm stashing the index into TYPE_VALUES_RAW.  This
> > > 
> > > What are the pros and cons of reusing TYPE/EXPR_PACK_EXPANSION instead
> > > of creating two new tree codes for these operators (one of whose
> > > operands would itself be a bare TYPE/EXPR_PACK_EXPANSION)?
> > > 
> > > I feel a little iffy at first glance about reusing these tree codes
> > > since it muddles what "kind" of tree they are: currently they represent
> > > a _vector_ or types/exprs (which is reflected by their tcc_exceptional
> > > class), and with this approach they can now also represent a single
> > > type/expr (despite their tcc_exceptional class), depending on whether
> > > PACK_EXPANSION_INDEX is set.
> 
> Oops, I just noticed that TYPE/EXPR_PACK_EXPANSION are tcc_type/expr
> rather than tcc_exceptional, somewhat surprisingly.  I must've been
> thinking about ARGUMENT_PACK_SELECT which is tcc_exceptional.  But I
> guess conceptually they still represent a vector of types/exprs rather
> than a single type/expr currently.
> 
> > 
> > Yeah, I made a similar comment.
> 
> FWIW there's an interesting example mentioned in
> https://github.com/itanium-cxx-abi/cxx-abi/issues/175:
> 
> template  struct tuple {
>   template  T...[I] get();
> };
> 
> tuple t;

I've added that test in the v2 patch.
 
> How should we represent the partial instantiation of T...[I] with
> T={int, char, bool}?  IIUC with the current approach we could use
> PACK_EXPANSION_EXTRA_ARGS to defer expansion until the index is known.
> 
> With new tree codes, e.g. PACK_INDEX_EXPR/TYPE, if we represent the
> pack/pattern itself as a *_PACK_EXPANSION operand we could just
> expand that immediately, yielding a TREE_VEC operand.  (And this
> expansion shouldn't allocate due to the T... optimization in
> tsubst_pack_expansion.)
 
What happens now is that we see that the index is value-dep and so
tsubst_pack_index just returns the original pack index.  I'm not
sure I've done that correctly.

Thanks,
Marek



[PATCH v2] c++: Implement P2662R3, Pack Indexing [PR113798]

2024-10-29 Thread Marek Polacek
On Tue, Oct 22, 2024 at 07:42:57PM -0400, Jason Merrill wrote:
> On 10/22/24 3:22 PM, Marek Polacek wrote:
> > Bootstrapped/regtested on x86_64-pc-linux-gnu, ok for trunk?
> > 
> > -- >8 --
> > This patch implements C++26 Pack Indexing, as described in
> > .
> 
> Great!
> 
> > The issue discussing how to mangle pack indexes has not been resolved
> > yet  and I've
> > made no attempt to address it so far.
> > 
> > Rather than introducing a new template code for a pack indexing, I'm
> > adding a new operand to EXPR_PACK_EXPANSION to store the index; for
> > TYPE_PACK_EXPANSION, I'm stashing the index into TYPE_VALUES_RAW.  This
> > feature is akin to __type_pack_element, so they can share the element
> > extraction part.

In v2, I'm using two new codes, as discussed elsewhere.

> > A pack indexing in a decltype proved to be a bit tricky; eventually,
> > I've added PACK_EXPANSION_PARENTHESIZED_P -- while parsing, we can't
> > really tell what it's going to expand to.
> 
> As I comment below, I think we should have enough information while parsing;
> what it expands to doesn't matter.

Yup; I must not have realized that pack-index-expression is a product of
id-expression.
 
> > With this feature, it's valid to write something like
> > 
> >using U = tmpl;
> > 
> > where we first expand the template argument into
> > 
> >Ts...[Is#0], Ts...[Is#1], ...
> > 
> > and then substitute each individual pack index.
> > 
> > I have no test for the module.cc code, that is just guesswork.
> 
> Looks straightforward enough.

It was.  I made sure with an assert that the new code is exercised.
 
> > @@ -2605,6 +2605,8 @@ write_type (tree type)
> >  case TYPE_PACK_EXPANSION:
> >write_string ("Dp");
> >write_type (PACK_EXPANSION_PATTERN (type));
> > + /* TODO: Mangle PACK_EXPANSION_INDEX
> > +  */
> 
> Could we warn about this so it doesn't get forgotten?  And similarly in
> write_expression?

There is now a new sorry.
 
> > @@ -3952,7 +3953,11 @@ find_parameter_packs_r (tree *tp, int 
> > *walk_subtrees, void* data)
> > break;
> >   case VAR_DECL:
> > -  if (DECL_PACK_P (t))
> > +  /* We can have
> > +  T...[0] a;
> > +  (T...[0])(a); // #1
> > +where the 'a' in #1 is not a bare parameter pack.  */
> > +  if (DECL_PACK_P (t) && !PACK_EXPANSION_INDEX (TREE_TYPE (t)))
> 
> Seems like the INDEX check should move into DECL_PACK_P?
> 
> Why doesn't this apply to PARM_DECL above?

I think this is now moot.
 
> > @@ -13946,6 +13969,10 @@ tsubst_pack_expansion (tree t, tree args, 
> > tsubst_flags_t complain,
> > && PACK_EXPANSION_P (TREE_VEC_ELT (result, 0)))
> >   return TREE_VEC_ELT (result, 0);
> > +  /* C++26 Pack Indexing.  */
> > +  if (index)
> > +return pack_index_element (index, result, complain);
> 
> Could we only compute the desired element rather than computing all of them
> and selecting the desired one?

I don't think so.  Especially now that the PACK_EXPANSION_P is just one
operand of a PACK_INDEX_*, and tsubst_pack_expansion is agnostic about
whether the expansion is part of a pack index.
 
> > @@ -16897,17 +16924,23 @@ tsubst (tree t, tree args, tsubst_flags_t 
> > complain, tree in_decl)
> > ctx = tsubst_pack_expansion (ctx, args,
> >  complain | tf_qualifying_scope,
> >  in_decl);
> > -   if (ctx == error_mark_node
> > -   || TREE_VEC_LENGTH (ctx) > 1)
> > +   if (ctx == error_mark_node)
> >   return error_mark_node;
> > -   if (TREE_VEC_LENGTH (ctx) == 0)
> > +   /* If there was a pack-index-specifier, we won't get a TREE_VEC,
> > +  just the single element.  */
> > +   if (TREE_CODE (ctx) == TREE_VEC)
> >   {
> > -   if (complain & tf_error)
> > - error ("%qD is instantiated for an empty pack",
> > -TYPENAME_TYPE_FULLNAME (t));
> > -   return error_mark_node;
> > +   if (TREE_VEC_LENGTH (ctx) > 1)
> > + return error_mark_node;
> 
> This is preexisting, but it seems like we're missing a call to error() in
> this case.

Added.
 
> > @@ -17041,13 +17074,20 @@ tsubst (tree t, tree args, tsubst_flags_t 
> > complain, tree in_decl)
> > else
> >   {
> > bool id = DECLTYPE_TYPE_ID_EXPR_OR_MEMBER_ACCESS_P (t);
> > -   if (id && TREE_CODE (DECLTYPE_TYPE_EXPR (t)) == BIT_NOT_EXPR
> > -   && EXPR_P (type))
> > +   tree op = DECLTYPE_TYPE_EXPR (t);
> > +   if (id && TREE_CODE (op) == BIT_NOT_EXPR && EXPR_P (type))
> >   /* In a template ~id could be either a complement expression
> >  or an unqualified-id naming a destructor; if instantiating
> >  it produces an expression, it's not an id-expression or
> > 

[PATCH] Remove dead part of bool pattern recognition

2024-10-29 Thread Richard Biener
Given we no longer want vcond[u]{,_eq} and VEC_COND_EXPR or COND_EXPR
with embedded GENERIC comparisons the whole check_bool_pattern
and adjust_bool_stmts machinery is dead.  It is effectively dead
after r15-4713-g0942bb85fc5573 and the following patch removes it.

Bootstrapped and tested on x86_64-unknown-linux-gnu.

* tree-vect-patterns.cc (check_bool_pattern): Remove.
(adjust_bool_pattern_cast): Likewise.
(adjust_bool_pattern): Likewise.
(sort_after_uid): Likewise.
(adjust_bool_stmts): Likewise.
(vect_recog_bool_pattern): Remove calls to check_bool_pattern
and fold as if it returns false.
---
 gcc/tree-vect-patterns.cc | 385 --
 1 file changed, 34 insertions(+), 351 deletions(-)

diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
index 302101fa6a0..945e7d2dc45 100644
--- a/gcc/tree-vect-patterns.cc
+++ b/gcc/tree-vect-patterns.cc
@@ -5360,300 +5360,6 @@ vect_recog_mod_var_pattern (vec_info *vinfo,
 }
 
 
-/* Helper function of vect_recog_bool_pattern.  Called recursively, return
-   true if bool VAR can and should be optimized that way.  Assume it shouldn't
-   in case it's a result of a comparison which can be directly vectorized into
-   a vector comparison.  Fills in STMTS with all stmts visited during the
-   walk.  */
-
-static bool
-check_bool_pattern (tree var, vec_info *vinfo, hash_set &stmts)
-{
-  tree rhs1;
-  enum tree_code rhs_code;
-
-  stmt_vec_info def_stmt_info = vect_get_internal_def (vinfo, var);
-  if (!def_stmt_info)
-return false;
-
-  gassign *def_stmt = dyn_cast  (def_stmt_info->stmt);
-  if (!def_stmt)
-return false;
-
-  if (stmts.contains (def_stmt))
-return true;
-
-  rhs1 = gimple_assign_rhs1 (def_stmt);
-  rhs_code = gimple_assign_rhs_code (def_stmt);
-  switch (rhs_code)
-{
-case SSA_NAME:
-  if (! check_bool_pattern (rhs1, vinfo, stmts))
-   return false;
-  break;
-
-CASE_CONVERT:
-  if (!VECT_SCALAR_BOOLEAN_TYPE_P (TREE_TYPE (rhs1)))
-   return false;
-  if (! check_bool_pattern (rhs1, vinfo, stmts))
-   return false;
-  break;
-
-case BIT_NOT_EXPR:
-  if (! check_bool_pattern (rhs1, vinfo, stmts))
-   return false;
-  break;
-
-case BIT_AND_EXPR:
-case BIT_IOR_EXPR:
-case BIT_XOR_EXPR:
-  if (! check_bool_pattern (rhs1, vinfo, stmts)
- || ! check_bool_pattern (gimple_assign_rhs2 (def_stmt), vinfo, stmts))
-   return false;
-  break;
-
-default:
-  return false;
-}
-
-  bool res = stmts.add (def_stmt);
-  /* We can't end up recursing when just visiting SSA defs but not PHIs.  */
-  gcc_assert (!res);
-
-  return true;
-}
-
-
-/* Helper function of adjust_bool_pattern.  Add a cast to TYPE to a previous
-   stmt (SSA_NAME_DEF_STMT of VAR) adding a cast to STMT_INFOs
-   pattern sequence.  */
-
-static tree
-adjust_bool_pattern_cast (vec_info *vinfo,
- tree type, tree var, stmt_vec_info stmt_info)
-{
-  gimple *cast_stmt = gimple_build_assign (vect_recog_temp_ssa_var (type, 
NULL),
-  NOP_EXPR, var);
-  append_pattern_def_seq (vinfo, stmt_info, cast_stmt,
- get_vectype_for_scalar_type (vinfo, type));
-  return gimple_assign_lhs (cast_stmt);
-}
-
-/* Helper function of vect_recog_bool_pattern.  Do the actual transformations.
-   VAR is an SSA_NAME that should be transformed from bool to a wider integer
-   type, OUT_TYPE is the desired final integer type of the whole pattern.
-   STMT_INFO is the info of the pattern root and is where pattern stmts should
-   be associated with.  DEFS is a map of pattern defs.  */
-
-static void
-adjust_bool_pattern (vec_info *vinfo, tree var, tree out_type,
-stmt_vec_info stmt_info, hash_map  &defs)
-{
-  gimple *stmt = SSA_NAME_DEF_STMT (var);
-  enum tree_code rhs_code, def_rhs_code;
-  tree itype, cond_expr, rhs1, rhs2, irhs1, irhs2;
-  location_t loc;
-  gimple *pattern_stmt, *def_stmt;
-  tree trueval = NULL_TREE;
-
-  rhs1 = gimple_assign_rhs1 (stmt);
-  rhs2 = gimple_assign_rhs2 (stmt);
-  rhs_code = gimple_assign_rhs_code (stmt);
-  loc = gimple_location (stmt);
-  switch (rhs_code)
-{
-case SSA_NAME:
-CASE_CONVERT:
-  irhs1 = *defs.get (rhs1);
-  itype = TREE_TYPE (irhs1);
-  pattern_stmt
-   = gimple_build_assign (vect_recog_temp_ssa_var (itype, NULL),
-  SSA_NAME, irhs1);
-  break;
-
-case BIT_NOT_EXPR:
-  irhs1 = *defs.get (rhs1);
-  itype = TREE_TYPE (irhs1);
-  pattern_stmt
-   = gimple_build_assign (vect_recog_temp_ssa_var (itype, NULL),
-  BIT_XOR_EXPR, irhs1, build_int_cst (itype, 1));
-  break;
-
-case BIT_AND_EXPR:
-  /* Try to optimize x = y & (a < b ? 1 : 0); into
-x = (a < b ? y : 0);
-
-E.g. for:
-  bool a_b, b_b, c_b;
-  TYPE d_T;
-
-

Re: [PATCH 1/4] sched1: hookize pressure scheduling spilling agressiveness

2024-10-29 Thread Vineet Gupta
On 10/29/24 11:51, Wilco Dijkstra wrote:
> Hi Vineet,
>> I agree the NARROW/WIDE stuff is obfuscating things in technicalities.
> Is there evidence this change would make things significantly worse for
> some targets? 

Honestly I don't think this needs to be behind any toggle or made optional at 
all. The old algorithm was overly eager in spilling. But per last
discussion with Richard [1] at least back in 2012 for some in-order arm32 core 
this was better. And also that's where the wide vs. narrow discussions
came up and that it really mattered, as far as I understood.

[1] https://gcc.gnu.org/pipermail/gcc-patches/2024-August/659847.html

> I did a few runs on Neoverse V2 with various options and
> it looks beneficial both for integer and FP. On the example and options
> you mentioned I got 20.1% speedup! This is a very wide core, so wide
> vs narrow is not the right terminology as the gains are on both.

Would you be able to test this on some inorder core, maybe a 32-bit one
But agree lets just drop the wide vs. narrow terminology and rephrase this in 
terms of spill aggressiveness.

>> How about  TARGET_SCHED_PRESSURE_SPILL_AGGRESSIVE with true (default) being 
>> existing behavior
>> and false being new semantics.
>> Its a bit verbose but I think clear enough.
> Another option may be to treat it like a new type of pressure algorithm, eg.
> --param sched-pressure-algorithm=3 or update one of the existing ones.
> There are already too many features in the scheduler - it would be better
> to reduce the many variants and focus on doing really well on modern cores.

A purist might argue it is not really new algorithm, just a little tweak to 
model algo. But I agree there's a zoo of sched toggles out there so this
might be ok. Anyhow I'm fine either way, whatever is more palatable for the 
community / maintainers. @Jeff what say you.

And while we are at it - I'd prefer the new behavior to be default for all 
targets unless they explicitly opt out (and it would be insightful
understanding those not made up cases). But  given how late we are in the 
release cycle, I think it would be kosher to keep the old behavior for now
and maybe after gcc-15 is released we can flip the switch and deal with any 
potential fallout.

Thanks a bunch for giving this a look.

-Vineet


[PATCH] aarch64: Use canonicalize_comparison in ccmp expansion [PR117346]

2024-10-29 Thread Andrew Pinski
While testing the patch for PR 85605 on aarch64, it was noticed that
imm_choice_comparison.c test failed. This was because canonicalize_comparison
was not being called in the ccmp case. This can be noticed without the patch
for PR 85605 as evidence of the new testcase.

Bootstrapped and tested on aarch64-linux-gnu.

PR target/117346

gcc/ChangeLog:

* config/aarch64/aarch64.cc (aarch64_gen_ccmp_first): Call
canonicalize_comparison before figuring out the cmp_mode/cc_mode.
(aarch64_gen_ccmp_next): Likewise.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/imm_choice_comparison-1.c: New test.

Signed-off-by: Andrew Pinski 
---
 gcc/config/aarch64/aarch64.cc |  6 +++
 .../aarch64/imm_choice_comparison-1.c | 42 +++
 2 files changed, 48 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/imm_choice_comparison-1.c

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index a6cc00e74ab..cbb7ef13315 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -27353,6 +27353,9 @@ aarch64_gen_ccmp_first (rtx_insn **prep_seq, rtx_insn 
**gen_seq,
   if (op_mode == VOIDmode)
 op_mode = GET_MODE (op1);
 
+  if (CONST_SCALAR_INT_P (op1))
+canonicalize_comparison (op_mode, &code, &op1);
+
   switch (op_mode)
 {
 case E_QImode:
@@ -27429,6 +27432,9 @@ aarch64_gen_ccmp_next (rtx_insn **prep_seq, rtx_insn 
**gen_seq, rtx prev,
   if (op_mode == VOIDmode)
 op_mode = GET_MODE (op1);
 
+  if (CONST_SCALAR_INT_P (op1))
+canonicalize_comparison (op_mode, &cmp_code, &op1);
+
   switch (op_mode)
 {
 case E_QImode:
diff --git a/gcc/testsuite/gcc.target/aarch64/imm_choice_comparison-1.c 
b/gcc/testsuite/gcc.target/aarch64/imm_choice_comparison-1.c
new file mode 100644
index 000..2afebe1a349
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/imm_choice_comparison-1.c
@@ -0,0 +1,42 @@
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+/* { dg-final { check-function-bodies "**" "" } } */
+
+/* PR target/117346 */
+/* Make sure going through ccmp uses similar to non ccmp-case. */
+/* This is similar to imm_choice_comparison.c's check except to force
+   the use of ccmp by reording the comparison and putting the cast before. */
+
+/*
+** check:
+** ...
+** mov w[0-9]+, -16777217
+** ...
+*/
+
+int
+check (int x, int y)
+{
+  unsigned xu = x;
+  if (xu > 0xfefe && x > y)
+return 100;
+
+  return x;
+}
+
+/*
+** check1:
+** ...
+** mov w[0-9]+, -16777217
+** ...
+*/
+
+int
+check1 (int x, int y)
+{
+  unsigned xu = x;
+  if (x > y && xu > 0xfefe)
+return 100;
+
+  return x;
+}
-- 
2.43.0



Re: [PATCH 1/2] RISC-V:Add intrinsic support for the CMOs extensions

2024-10-29 Thread Jeff Law




On 10/29/24 10:20 AM, Craig Topper wrote:
Jeff, should this question in the spec be resolved before merging this? 
https://github.com/riscv-non-isa/riscv-c-api-doc/pull/93/ 
files#r1817437534 
It looks like a wrapper around prefetch.i, which to the best of my 
knowledge doesn't use/need the prefixing.  At least that's my take.


jeff




Re: [PATCH 1/2] RISC-V:Add intrinsic support for the CMOs extensions

2024-10-29 Thread Craig Topper
The '# define rnum 1' may break user code that contains a variable called
rnum.

On Tue, Oct 29, 2024 at 8:46 AM Jeff Law  wrote:

>
>
> On 10/29/24 4:12 AM, shiyul...@iscas.ac.cn wrote:
> > From: yulong 
> >
> > gcc/ChangeLog:
> >
> >  * config.gcc: Add riscv_cmo.h.
> >  * config/riscv/riscv_cmo.h: New file.
> I think Kito pointed out a minor problem and the linter's also pointed
> out a whitespace problem.  I've fixed both locally and done a sanity
> check build/test.  I'll push this to the trunk momentarily.
>
> Attached patch is what I'm actually committing as a single diff just for
> the archivers.
>
> Jeff


Re: [RFC PATCH 5/5] vect: Also cost gconds for scalar

2024-10-29 Thread Richard Biener
On Mon, 28 Oct 2024, Alex Coplan wrote:

> Currently we only cost gconds for the vector loop while we omit costing
> them when analyzing the scalar loop; this unfairly penalizes the vector
> loop in the case of loops with early exits.
> 
> This (together with the previous patches) enables us to vectorize
> std::find with 64-bit element sizes.

OK.

Thanks,
Richard.

> gcc/ChangeLog:
> 
>   * tree-vect-loop.cc (vect_compute_single_scalar_iteration_cost):
>   Don't skip over gconds.
> ---
>  gcc/tree-vect-loop.cc | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> 

-- 
Richard Biener 
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)


Re: [PATCH 1/4] sched1: hookize pressure scheduling spilling agressiveness

2024-10-29 Thread Jeff Law




On 10/20/24 1:40 PM, Vineet Gupta wrote:

Pressure senstive scheduling seems to prefer "wide" schedules with more
parallelism tending to more spills. This works better for in-order
cores [1][2].
I'm not really sure I'd characterize it that way, but I can also see how 
you got to the wide vs narrow characterization.


In my mind what we're really doing is refining a bit of the pressure 
reduction heuristic to more accurately reflect the result of issuing 
insns that reduce pressure.  So it's more like a level of how 
aggressively we try to reduce pressure.







This patch allows for an opt-in target hook
TARGET_SCHED_PRESSURE_PREFER_NARROW:

  - The default hook (returns false) preserves existing behavior of wider
schedules, more parallelism and potentially more spills.

  - targets implementing the hook as true get the reverse effect.
I think we need a better name/description here.  If SCHED_PRESSURE_MODEL 
was a target hook then I'd probably be looking to make it an integer 
indicating how aggressively to try and reduce pressure.   So perhaps 
TARGET_SCHED_PRESSURE_AGGRESSIVE?  I don't particularly like it, but I 
like it better than the PREFER_NARROW idea.  Everything else I've come 
up with has been scarily verbose.


TARGET_SCHED_PRESSURE_INCLUDE_PRESSURE_REDUCING_INSNS is on the absurd 
side.  Removing the second PRESSURE makes me think of vector reductions, 
so I didn't want to do that :-)


Certainly open to more ideas on the naming, which I think will impact 
the documentation & comments as well.


And to be 100% clear, no concerns with the behavior of the patch, it's 
really just the naming convention, documentation/comments.


Thoughts?

Jeff

ps.  I've got to get out of my bubble more often.  Picked up a bug at 
the RVI summit...  Clearly my immune system isn't firing on all cylinders.


[PATCH 1/2] Remove dead code in vectorizer pattern recog

2024-10-29 Thread Richard Biener
The following removes the code path in vect_recog_mask_conversion_pattern
dealing with comparisons in COND_EXPRs.  That can no longer happen.

* tree-vect-patterns.cc (vect_recog_mask_conversion_pattern):
Remove COMPARISON_CLASS_P rhs1 of COND_EXPR case and assert
it doesn't happen.
---
 gcc/tree-vect-patterns.cc | 99 +--
 1 file changed, 2 insertions(+), 97 deletions(-)

diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
index a6d246f570c..46f439fb8a3 100644
--- a/gcc/tree-vect-patterns.cc
+++ b/gcc/tree-vect-patterns.cc
@@ -6240,8 +6240,6 @@ vect_recog_mask_conversion_pattern (vec_info *vinfo,
   tree lhs = NULL_TREE, rhs1, rhs2, tmp, rhs1_type, rhs2_type;
   tree vectype1, vectype2;
   stmt_vec_info pattern_stmt_info;
-  tree rhs1_op0 = NULL_TREE, rhs1_op1 = NULL_TREE;
-  tree rhs1_op0_type = NULL_TREE, rhs1_op1_type = NULL_TREE;
 
   /* Check for MASK_LOAD and MASK_STORE as well as COND_OP calls requiring mask
  conversion.  */
@@ -6331,60 +6329,13 @@ vect_recog_mask_conversion_pattern (vec_info *vinfo,
 {
   vectype1 = get_vectype_for_scalar_type (vinfo, TREE_TYPE (lhs));
 
+  gcc_assert (! COMPARISON_CLASS_P (rhs1));
   if (TREE_CODE (rhs1) == SSA_NAME)
{
  rhs1_type = integer_type_for_mask (rhs1, vinfo);
  if (!rhs1_type)
return NULL;
}
-  else if (COMPARISON_CLASS_P (rhs1))
-   {
- /* Check whether we're comparing scalar booleans and (if so)
-whether a better mask type exists than the mask associated
-with boolean-sized elements.  This avoids unnecessary packs
-and unpacks if the booleans are set from comparisons of
-wider types.  E.g. in:
-
-  int x1, x2, x3, x4, y1, y1;
-  ...
-  bool b1 = (x1 == x2);
-  bool b2 = (x3 == x4);
-  ... = b1 == b2 ? y1 : y2;
-
-it is better for b1 and b2 to use the mask type associated
-with int elements rather bool (byte) elements.  */
- rhs1_op0 = TREE_OPERAND (rhs1, 0);
- rhs1_op1 = TREE_OPERAND (rhs1, 1);
- if (!rhs1_op0 || !rhs1_op1)
-   return NULL;
- rhs1_op0_type = integer_type_for_mask (rhs1_op0, vinfo);
- rhs1_op1_type = integer_type_for_mask (rhs1_op1, vinfo);
-
- if (!rhs1_op0_type)
-   rhs1_type = TREE_TYPE (rhs1_op0);
- else if (!rhs1_op1_type)
-   rhs1_type = TREE_TYPE (rhs1_op1);
- else if (TYPE_PRECISION (rhs1_op0_type)
-  != TYPE_PRECISION (rhs1_op1_type))
-   {
- int tmp0 = (int) TYPE_PRECISION (rhs1_op0_type)
-- (int) TYPE_PRECISION (TREE_TYPE (lhs));
- int tmp1 = (int) TYPE_PRECISION (rhs1_op1_type)
-- (int) TYPE_PRECISION (TREE_TYPE (lhs));
- if ((tmp0 > 0 && tmp1 > 0) || (tmp0 < 0 && tmp1 < 0))
-   {
- if (abs (tmp0) > abs (tmp1))
-   rhs1_type = rhs1_op1_type;
- else
-   rhs1_type = rhs1_op0_type;
-   }
- else
-   rhs1_type = build_nonstandard_integer_type
- (TYPE_PRECISION (TREE_TYPE (lhs)), 1);
-   }
- else
-   rhs1_type = rhs1_op0_type;
-   }
   else
return NULL;
 
@@ -6400,55 +6351,9 @@ vect_recog_mask_conversion_pattern (vec_info *vinfo,
 its vector type) and behave as though the comparison was an SSA
 name from the outset.  */
   if (known_eq (TYPE_VECTOR_SUBPARTS (vectype1),
-   TYPE_VECTOR_SUBPARTS (vectype2))
- && !rhs1_op0_type
- && !rhs1_op1_type)
+   TYPE_VECTOR_SUBPARTS (vectype2)))
return NULL;
 
-  /* If rhs1 is invariant and we can promote it leave the COND_EXPR
- in place, we can handle it in vectorizable_condition.  This avoids
-unnecessary promotion stmts and increased vectorization factor.  */
-  if (COMPARISON_CLASS_P (rhs1)
- && INTEGRAL_TYPE_P (rhs1_type)
- && known_le (TYPE_VECTOR_SUBPARTS (vectype1),
-  TYPE_VECTOR_SUBPARTS (vectype2)))
-   {
- enum vect_def_type dt;
- if (vect_is_simple_use (TREE_OPERAND (rhs1, 0), vinfo, &dt)
- && dt == vect_external_def
- && vect_is_simple_use (TREE_OPERAND (rhs1, 1), vinfo, &dt)
- && (dt == vect_external_def
- || dt == vect_constant_def))
-   {
- tree wide_scalar_type = build_nonstandard_integer_type
-   (vector_element_bits (vectype1), TYPE_UNSIGNED (rhs1_type));
- tree vectype3 = get_vectype_for_scalar_type (vinfo,
-  wide_scalar_type);
- if (expand_vec_cond_expr_p (vectype1, vectype3, TREE_CODE (rhs1)))
-   return

[pushed] c++: printing AGGR_INIT_EXPR args

2024-10-29 Thread Jason Merrill
Tested x86_64-pc-linux-gnu, applying to trunk.

-- 8< --

PR30854 was about wrongly dumping the dummy object argument to a
constructor; r126582 in 4.3 fixed that by skipping the first argument.  But
not all functions called by AGGR_INIT_EXPR are constructors, as observed in
PR116634; we shouldn't skip for non-member functions.  And let's combine the
printing code for CALL_EXPR and AGGR_INIT_EXPR.

This doesn't make us accept the ill-formed 116634 testcase again with a
pedwarn, just fixes the diagnostic issue.

PR c++/30854
PR c++/116634

gcc/cp/ChangeLog:

* error.cc (dump_aggr_init_expr_args): Remove.
(dump_call_expr_args): Handle AGGR_INIT_EXPR.
(dump_expr): Combine AGGR_INIT_EXPR and CALL_EXPR cases.

gcc/testsuite/ChangeLog:

* g++.dg/coroutines/coro-bad-alloc-02-no-op-new-nt.C: Adjust
diagnostic.
* g++.dg/diagnostic/aggr-init1.C: New test.
---
 gcc/cp/error.cc   | 79 +--
 .../coro-bad-alloc-02-no-op-new-nt.C  |  2 +-
 gcc/testsuite/g++.dg/diagnostic/aggr-init1.C  | 36 +
 3 files changed, 55 insertions(+), 62 deletions(-)
 create mode 100644 gcc/testsuite/g++.dg/diagnostic/aggr-init1.C

diff --git a/gcc/cp/error.cc b/gcc/cp/error.cc
index 8381f950488..4a60fac9694 100644
--- a/gcc/cp/error.cc
+++ b/gcc/cp/error.cc
@@ -84,7 +84,6 @@ static void dump_type_prefix (cxx_pretty_printer *, tree, 
int);
 static void dump_type_suffix (cxx_pretty_printer *, tree, int);
 static void dump_function_name (cxx_pretty_printer *, tree, int);
 static void dump_call_expr_args (cxx_pretty_printer *, tree, int, bool);
-static void dump_aggr_init_expr_args (cxx_pretty_printer *, tree, int, bool);
 static void dump_expr_list (cxx_pretty_printer *, tree, int);
 static void dump_global_iord (cxx_pretty_printer *, tree);
 static void dump_parameters (cxx_pretty_printer *, tree, int);
@@ -2253,46 +2252,15 @@ dump_template_parms (cxx_pretty_printer *pp, tree info,
 static void
 dump_call_expr_args (cxx_pretty_printer *pp, tree t, int flags, bool skipfirst)
 {
-  tree arg;
-  call_expr_arg_iterator iter;
+  const int len = call_expr_nargs (t);
 
   pp_cxx_left_paren (pp);
-  FOR_EACH_CALL_EXPR_ARG (arg, iter, t)
+  for (int i = skipfirst; i < len; ++i)
 {
-  if (skipfirst)
-   skipfirst = false;
-  else
-   {
- dump_expr (pp, arg, flags | TFF_EXPR_IN_PARENS);
- if (more_call_expr_args_p (&iter))
-   pp_separate_with_comma (pp);
-   }
-}
-  pp_cxx_right_paren (pp);
-}
-
-/* Print out the arguments of AGGR_INIT_EXPR T as a parenthesized list
-   using flags FLAGS.  Skip over the first argument if SKIPFIRST is
-   true.  */
-
-static void
-dump_aggr_init_expr_args (cxx_pretty_printer *pp, tree t, int flags,
-  bool skipfirst)
-{
-  tree arg;
-  aggr_init_expr_arg_iterator iter;
-
-  pp_cxx_left_paren (pp);
-  FOR_EACH_AGGR_INIT_EXPR_ARG (arg, iter, t)
-{
-  if (skipfirst)
-   skipfirst = false;
-  else
-   {
- dump_expr (pp, arg, flags | TFF_EXPR_IN_PARENS);
- if (more_aggr_init_expr_args_p (&iter))
-   pp_separate_with_comma (pp);
-   }
+  tree arg = get_nth_callarg (t, i);
+  dump_expr (pp, arg, flags | TFF_EXPR_IN_PARENS);
+  if (i + 1 < len)
+   pp_separate_with_comma (pp);
 }
   pp_cxx_right_paren (pp);
 }
@@ -2451,28 +2419,9 @@ dump_expr (cxx_pretty_printer *pp, tree t, int flags)
   break;
 
 case AGGR_INIT_EXPR:
-  {
-   tree fn = NULL_TREE;
-
-   if (TREE_CODE (AGGR_INIT_EXPR_FN (t)) == ADDR_EXPR)
- fn = TREE_OPERAND (AGGR_INIT_EXPR_FN (t), 0);
-
-   if (fn && TREE_CODE (fn) == FUNCTION_DECL)
- {
-   if (DECL_CONSTRUCTOR_P (fn))
- dump_type (pp, DECL_CONTEXT (fn), flags);
-   else
- dump_decl (pp, fn, 0);
- }
-   else
- dump_expr (pp, AGGR_INIT_EXPR_FN (t), 0);
-  }
-  dump_aggr_init_expr_args (pp, t, flags, true);
-  break;
-
 case CALL_EXPR:
   {
-   tree fn = CALL_EXPR_FN (t);
+   tree fn = cp_get_callee (t);
bool skipfirst = false;
 
/* Deal with internal functions.  */
@@ -2494,8 +2443,10 @@ dump_expr (cxx_pretty_printer *pp, tree t, int flags)
&& NEXT_CODE (fn) == METHOD_TYPE
&& call_expr_nargs (t))
  {
-   tree ob = CALL_EXPR_ARG (t, 0);
-   if (TREE_CODE (ob) == ADDR_EXPR)
+   tree ob = get_nth_callarg (t, 0);
+   if (is_dummy_object (ob))
+ /* Don't print dummy object.  */;
+   else if (TREE_CODE (ob) == ADDR_EXPR)
  {
dump_expr (pp, TREE_OPERAND (ob, 0),
flags | TFF_EXPR_IN_PARENS);
@@ -2514,7 +2465,13 @@ dump_expr (cxx_pretty_printer *pp, tree t, int flags)
pp_string (cxx_pp, M_(""));
break;
  }
-   dump_expr (pp, fn, flags | TFF_EXPR_IN_

  1   2   >