docs: Document PFA support in GCC-15 changes

2025-04-23 Thread Tamar Christina
Hi All,

This documents the PFA support in GCC-15.

Ok for master?

Thanks,
Tamar

---
diff --git a/htdocs/gcc-15/changes.html b/htdocs/gcc-15/changes.html
index 
f03e29c8581f2749a968e592eae2e40ce3ca8521..7fb70b993c56ff43c09aeb7bfaa4479385679dec
 100644
--- a/htdocs/gcc-15/changes.html
+++ b/htdocs/gcc-15/changes.html
@@ -55,6 +55,11 @@ a work-in-progress.
 it also disables vectorization of epilogue loops but otherwise is equal
 to the cheap cost model.
   
+  The vectorizer now supports vectoring of loops with early exits where
+the number of elements for the input pointers are unknown through peeling
+for alignment.  This is supported for only for loops with fixed vector
+lengths.
+  
   -ftime-report now only reports monotonic run time instead of
 system and user time. This reduces the overhead of the option 
significantly,
 making it possible to use in standard build systems.


-- 
diff --git a/htdocs/gcc-15/changes.html b/htdocs/gcc-15/changes.html
index f03e29c8581f2749a968e592eae2e40ce3ca8521..7fb70b993c56ff43c09aeb7bfaa4479385679dec 100644
--- a/htdocs/gcc-15/changes.html
+++ b/htdocs/gcc-15/changes.html
@@ -55,6 +55,11 @@ a work-in-progress.
 it also disables vectorization of epilogue loops but otherwise is equal
 to the cheap cost model.
   
+  The vectorizer now supports vectoring of loops with early exits where
+the number of elements for the input pointers are unknown through peeling
+for alignment.  This is supported for only for loops with fixed vector
+lengths.
+  
   -ftime-report now only reports monotonic run time instead of
 system and user time. This reduces the overhead of the option significantly,
 making it possible to use in standard build systems.



Re: docs: Document PFA support in GCC-15 changes

2025-04-23 Thread Richard Biener
On Wed, 23 Apr 2025, Tamar Christina wrote:

> Hi All,
> 
> This documents the PFA support in GCC-15.
> 
> Ok for master?

OK.

> Thanks,
> Tamar
> 
> ---
> diff --git a/htdocs/gcc-15/changes.html b/htdocs/gcc-15/changes.html
> index 
> f03e29c8581f2749a968e592eae2e40ce3ca8521..7fb70b993c56ff43c09aeb7bfaa4479385679dec
>  100644
> --- a/htdocs/gcc-15/changes.html
> +++ b/htdocs/gcc-15/changes.html
> @@ -55,6 +55,11 @@ a work-in-progress.
>  it also disables vectorization of epilogue loops but otherwise is equal
>  to the cheap cost model.
>
> +  The vectorizer now supports vectoring of loops with early exits where
> +the number of elements for the input pointers are unknown through peeling
> +for alignment.  This is supported for only for loops with fixed vector
> +lengths.
> +  
>-ftime-report now only reports monotonic run time instead 
> of
>  system and user time. This reduces the overhead of the option 
> significantly,
>  making it possible to use in standard build systems.
> 
> 
> 

-- 
Richard Biener 
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)


[committed] OpenMP: Add libgomp.fortran/target-enter-data-8.f90

2025-04-23 Thread Tobias Burnus

Looking through old patches, I came across this testcase.

It was originally part of the patch

[Patch] Fortran/OpenMP: Fix DT struct-component with 'alloc' and array descr
https://gcc.gnu.org/pipermail/gcc-patches/2022-November/604887.html

under the name testsuite/libgomp.fortran/target-enter-data-3.f90
or testsuite/libgomp.fortran/target-enter-data-3a.f90

And I think the code change landed as part of:
r15-9488-g99cd28c4733c2f Fortran/OpenMP: Support automatic mapping 
allocatable components (deep mapping) The history is a bit murky as the 
original patch was submitted 2022, landed in OG12 and was forward ported 
to OG14 with patches merged. In OG14, the code change is gone 
(presumably merged into a code change) and only the test case remained 
in commit: 06a430ade09 Fortran/OpenMP: Fix DT struct-component with 
'alloc' and array descr This commit is essentially that test case; 
however, I checked whether some #if 0 could be removed - and indeed all 
but one could be removed, i.e. more code actually works - whether fixed 
on the generic gfortran side or on the OpenMP, I don't know but I also 
don't care as long as it works :-) Committed as r16-92-gc9a8f2f9d39a31 
Tobias
commit c9a8f2f9d39a317ed67fb47157a995ea03c182d4
Author: Tobias Burnus 
Date:   Wed Apr 23 09:03:00 2025 +0200

OpenMP: Add libgomp.fortran/target-enter-data-8.f90

Add another testcase for Fortran deep mapping of allocatable components.

libgomp/ChangeLog:

* testsuite/libgomp.fortran/target-enter-data-8.f90: New test.

diff --git a/libgomp/testsuite/libgomp.fortran/target-enter-data-8.f90 b/libgomp/testsuite/libgomp.fortran/target-enter-data-8.f90
new file mode 100644
index 000..c6d671c1306
--- /dev/null
+++ b/libgomp/testsuite/libgomp.fortran/target-enter-data-8.f90
@@ -0,0 +1,532 @@
+! { dg-additional-options "-cpp" }
+
+! FIXME: Some tests do not work yet. Those are for now in '#if 0'
+
+! Check that 'map(alloc:' properly works with
+! - deferred-length character strings
+! - arrays with array descriptors
+! For those, the array descriptor / string length must be mapped with 'to:'
+
+program main
+implicit none
+
+type t
+  integer :: ic(2:5), ic2
+  character(len=11) :: ccstr(3:4), ccstr2
+  character(len=11,kind=4) :: cc4str(3:7), cc4str2
+  integer, pointer :: pc(:), pc2
+  character(len=:), pointer :: pcstr(:), pcstr2
+  character(len=:,kind=4), pointer :: pc4str(:), pc4str2
+end type t
+
+type(t) :: dt
+
+integer :: ii(5), ii2
+character(len=11) :: clstr(-1:1), clstr2
+character(len=11,kind=4) :: cl4str(0:3), cl4str2
+integer, pointer :: ip(:), ip2
+integer, allocatable :: ia(:), ia2
+character(len=:), pointer :: pstr(:), pstr2
+character(len=:), allocatable :: astr(:), astr2
+character(len=:,kind=4), pointer :: p4str(:), p4str2
+character(len=:,kind=4), allocatable :: a4str(:), a4str2
+
+
+allocate(dt%pc(5), dt%pc2)
+allocate(character(len=2) :: dt%pcstr(2))
+allocate(character(len=4) :: dt%pcstr2)
+
+allocate(character(len=3,kind=4) :: dt%pc4str(2:3))
+allocate(character(len=5,kind=4) :: dt%pc4str2)
+
+allocate(ip(5), ip2, ia(8), ia2)
+allocate(character(len=2) :: pstr(-2:0))
+allocate(character(len=4) :: pstr2)
+allocate(character(len=6) :: astr(3:5))
+allocate(character(len=8) :: astr2)
+
+allocate(character(len=3,kind=4) :: p4str(2:4))
+allocate(character(len=5,kind=4) :: p4str2)
+allocate(character(len=7,kind=4) :: a4str(-2:3))
+allocate(character(len=9,kind=4) :: a4str2)
+
+
+! integer :: ic(2:5), ic2
+
+!$omp target enter data map(alloc: dt%ic)
+!$omp target map(alloc: dt%ic)
+  if (size(dt%ic) /= 4) error stop
+  if (lbound(dt%ic, 1) /= 2) error stop
+  if (ubound(dt%ic, 1) /= 5) error stop
+  dt%ic = [22, 33, 44, 55]
+!$omp end target
+!$omp target exit data map(from: dt%ic)
+if (size(dt%ic) /= 4) error stop
+if (lbound(dt%ic, 1) /= 2) error stop
+if (ubound(dt%ic, 1) /= 5) error stop
+if (any (dt%ic /= [22, 33, 44, 55])) error stop
+
+!$omp target enter data map(alloc: dt%ic2)
+!$omp target map(alloc: dt%ic2)
+  dt%ic2 = 42
+!$omp end target
+!$omp target exit data map(from: dt%ic2)
+if (dt%ic2 /= 42) error stop
+
+
+! character(len=11) :: ccstr(3:4), ccstr2
+
+!$omp target enter data map(alloc: dt%ccstr)
+!$omp target map(alloc: dt%ccstr)
+  if (len(dt%ccstr) /= 11) error stop
+  if (size(dt%ccstr) /= 2) error stop
+  if (lbound(dt%ccstr, 1) /= 3) error stop
+  if (ubound(dt%ccstr, 1) /= 4) error stop
+  dt%ccstr = ["12345678901", "abcdefghijk"]
+!$omp end target
+!$omp target exit data map(from: dt%ccstr)
+if (len(dt%ccstr) /= 11) error stop
+if (size(dt%ccstr) /= 2) error stop
+if (lbound(dt%ccstr, 1) /= 3) error stop
+if (ubound(dt%ccstr, 1) /= 4) error stop
+if (any (dt%ccstr /= ["12345678901", "abcdefghijk"])) error stop
+
+!$omp target enter data map(alloc: dt%ccstr2)
+!$omp target map(alloc: dt%ccstr2)
+  if (len(dt%ccstr2) /= 11) error stop
+  dt%ccstr2 = "ABCDEFGHIJK"
+!$omp end target
+!$omp target exit data map(from: dt%ccstr2)
+if (len(dt%ccstr2) /= 11) error 

Re: [PATCH v2 1/3] RISC-V: Combine vec_duplicate + vadd.vv to vadd.vx on GR2VR cost

2025-04-23 Thread Robin Dapp
The only thing I think we want for the patch (as Pan also raised last time) is 
the param to set those .vx costs to zero in order to ensure the tests test the 
right thing (--param=vx_preferred/gr2vr_cost or something).


I see, shall we start a new series for this? AFAIK, we may need some more 
alignment
for something like --param=xx that exposing to end-user.

According to patchwork the tests you add pass but shouldn't they actually fail 
with a GR2VR cost of 2?  I must be missing something.


For now the cost of GR2VR is 2, take test vx_vadd-1-i64.c for example, the 
vec_dup + vadd.vv
has higher cost than vadd.vx, thus perform the late-combine as below.


Ah, I see, thanks.  So vec_dup costs 1 + 2 and vadd.vv costs 1 totalling 4 
while vadd.vx costs 1 + 2, making it cheaper?


IMHO vec_dup should just cost 2 (=GR2VR) rather than 3.  All it does is 
broadcast (no additional operation), while vadd.vx performs the broadcast (cost 
2) as well as an operation (cost 1).  So vec_dup + vadd.vv should cost 3, the 
same as vadd.vx.  In late combine when comparing costs we scale the them by 
"frequency".  The vadd.vx inside the loop should have higher frequency making 
it more costly by default.


With such a change the tests wouldn't pass by default (AFAICT) and we would 
need a --param=xx.  I wouldn't worry about exposing those details to the user 
for now as we're so early in the cycle and can easily iterate on it.  I would
suggest just adding something in order to make the tests work as expected and 
change things later (if needed).


--
Regards
Robin



Re: [PATCH] testsuite: AMDGCN test for vect-early-break_38.c as well to consistent architecture [PR119286]

2025-04-23 Thread Richard Biener
On Wed, 23 Apr 2025, Tamar Christina wrote:

> Hi All,
> 
> I had missed this one during the AMDGCN test failures.
> 
> Like vect-early-break_18.c this test is also scalaring the
> loads and thus leading to unexpected vectorization for this
> testcase.
> 
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> 
> Cross checked the failing case on amdgcn-amdhsa
> and all pass now.
> 
> Ok for master? and GCC 15?

OK for trunk and 15 if you manage before RC2.

> Thanks,
> Tamar
> 
> gcc/testsuite/ChangeLog:
> 
>   * gcc.dg/vect/vect-early-break_38.c: Force -march=gfx908 for amdgcn.
> 
> ---
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_38.c 
> b/gcc/testsuite/gcc.dg/vect/vect-early-break_38.c
> index 
> 36fc6a6eb60fae70f8f05a3d9435f5adce025847..010e7ea7e327f4bb0e33560e24dd3e6c5462d659
>  100644
> --- a/gcc/testsuite/gcc.dg/vect/vect-early-break_38.c
> +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_38.c
> @@ -2,6 +2,7 @@
>  /* { dg-do compile } */
>  /* { dg-require-effective-target vect_early_break } */
>  /* { dg-require-effective-target vect_int } */
> +/* { dg-additional-options "-march=gfx908" { target amdgcn*-*-* } } */
>  
>  #ifndef N
>  #define N 803
> 
> 
> 

-- 
Richard Biener 
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)


[PATCH]middle-end: Add new "max" vector cost model

2025-04-23 Thread Tamar Christina
Hi All,

This patch proposes a new vector cost model called "max".  The cost model is an
intersection between two of our existing cost models.  Like `unlimited` it
disables the costing vs scalar and assumes all vectorization to be profitable.

But unlike unlimited it does not fully disable the vector cost model.  That
means that we still perform comparisons between vector modes.

As an example, the following:

void
foo (char *restrict a, int *restrict b, int *restrict c,
 int *restrict d, int stride)
{
if (stride <= 1)
return;

for (int i = 0; i < 3; i++)
{
int res = c[i];
int t = b[i * stride];
if (a[i] != 0)
res = t * d[i];
c[i] = res;
}
}

compiled with -O3 -march=armv8-a+sve -fvect-cost-model=dynamic fails to
vectorize as it assumes scalar would be faster, and with
-fvect-cost-model=unlimited it picks a vector type that's so big that the large
sequence generated is working on mostly inactive lanes:

...
and p3.b, p3/z, p4.b, p4.b
whilelo p0.s, wzr, w7
ld1wz23.s, p3/z, [x3, #3, mul vl]
ld1wz28.s, p0/z, [x5, z31.s, sxtw 2]
add x0, x5, x0
punpklo p6.h, p6.b
ld1wz27.s, p4/z, [x0, z31.s, sxtw 2]
and p6.b, p6/z, p0.b, p0.b
punpklo p4.h, p7.b
ld1wz24.s, p6/z, [x3, #2, mul vl]
and p4.b, p4/z, p2.b, p2.b
uqdecw  w6
ld1wz26.s, p4/z, [x3]
whilelo p1.s, wzr, w6
mul z27.s, p5/m, z27.s, z23.s
ld1wz29.s, p1/z, [x4, z31.s, sxtw 2]
punpkhi p7.h, p7.b
mul z24.s, p5/m, z24.s, z28.s
and p7.b, p7/z, p1.b, p1.b
mul z26.s, p5/m, z26.s, z30.s
ld1wz25.s, p7/z, [x3, #1, mul vl]
st1wz27.s, p3, [x2, #3, mul vl]
mul z25.s, p5/m, z25.s, z29.s
st1wz24.s, p6, [x2, #2, mul vl]
st1wz25.s, p7, [x2, #1, mul vl]
st1wz26.s, p4, [x2]
...

With -fvect-cost-model=max you get more reasonable code:

foo:
cmp w4, 1
ble .L1
ptrue   p7.s, vl3
index   z0.s, #0, w4
ld1bz29.s, p7/z, [x0]
ld1wz30.s, p7/z, [x1, z0.s, sxtw 2]
ptrue   p6.b, all
cmpne   p7.b, p7/z, z29.b, #0
ld1wz31.s, p7/z, [x3]
mul z31.s, p6/m, z31.s, z30.s
st1wz31.s, p7, [x2]
.L1:
ret

This model has been useful internally for performance exploration and cost-model
validation.  It allows us to force realistic vectorization overriding the cost
model to be able to tell whether it's correct wrt to profitability.

Bootstrapped Regtested on aarch64-none-linux-gnu,
arm-none-linux-gnueabihf, x86_64-pc-linux-gnu
-m32, -m64 and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

* common.opt (vect-cost-model, simd-cost-model): Add max cost model.
* doc/invoke.texi: Document it.
* flag-types.h (enum vect_cost_model): Add VECT_COST_MODEL_MAX.
* tree-vect-data-refs.cc (vect_peeling_hash_insert,
vect_peeling_hash_choose_best_peeling,
vect_enhance_data_refs_alignment): Use it.
* tree-vect-loop.cc (vect_analyze_loop_costing,
vect_estimate_min_profitable_iters): Likewise.

---
diff --git a/gcc/common.opt b/gcc/common.opt
index 
88d987e6ab14d9f8df7aa686efffc43418dbb42d..bd5e2e951f9388b12206d9addc736e336cd0e4ee
 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -3442,11 +3442,11 @@ Enable basic block vectorization (SLP) on trees.
 
 fvect-cost-model=
 Common Joined RejectNegative Enum(vect_cost_model) Var(flag_vect_cost_model) 
Init(VECT_COST_MODEL_DEFAULT) Optimization
--fvect-cost-model=[unlimited|dynamic|cheap|very-cheap] Specifies the cost 
model for vectorization.
+-fvect-cost-model=[unlimited|max|dynamic|cheap|very-cheap] Specifies the 
cost model for vectorization.
 
 fsimd-cost-model=
 Common Joined RejectNegative Enum(vect_cost_model) Var(flag_simd_cost_model) 
Init(VECT_COST_MODEL_UNLIMITED) Optimization
--fsimd-cost-model=[unlimited|dynamic|cheap|very-cheap] Specifies the 
vectorization cost model for code marked with a simd directive.
+-fsimd-cost-model=[unlimited|max|dynamic|cheap|very-cheap] Specifies the 
vectorization cost model for code marked with a simd directive.
 
 Enum
 Name(vect_cost_model) Type(enum vect_cost_model) UnknownError(unknown 
vectorizer cost model %qs)
@@ -3454,6 +3454,9 @@ Name(vect_cost_model) Type(enum vect_cost_model) 
UnknownError(unknown vectorizer
 EnumValue
 Enum(vect_cost_model) String(unlimited) Value(VECT_COST_MODEL_UNLIMITED)
 
+EnumValue
+Enum(vect_cost_model) String(max) Value(VECT_COST_MODEL_MAX)
+
 EnumValue
 Enum(vect_cost_model) String(dynamic) Value(VECT_COST_MODEL_DYNAMIC)
 
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 
14a78fd236f64185fc129f18b52b20692d49305c..e7b242c9134ff17022c92f81c8b24762cfd59c6c
 100644
--- a/gcc/doc/invoke.texi

Re: [PATCH v1 0/4] Refactor long function expand_const_vector

2025-04-23 Thread Robin Dapp

These patches LGTM from myside. But please wait for other folks to comment.


The series LGTM as well.  But please wait with merging until GCC 15.1 is 
released (as requested by the release maintainers).


--
Regards
Robin



[GCC16 stage1][PATCH v2 3/3] Use the counted_by attribute of pointers in array bound checker.

2025-04-23 Thread Qing Zhao
Current array bound checker only instruments ARRAY_REF, and the INDEX
information is the 2nd operand of the ARRAY_REF.

When extending the array bound checker to pointer references with
counted_by attributes, the hardest part is to get the INDEX of the
corresponding array ref from the offset computation expression of
the pointer ref.  I.e.

Given an OFFSET expression, and the ELEMENT_SIZE,
get the index expression from the OFFSET.
For example:
  OFFSET:
   ((long unsigned int) m * (long unsigned int) SAVE_EXPR ) * 4
  ELEMENT_SIZE:
   (sizetype) SAVE_EXPR  * 4
get the index as (long unsigned int) m.

gcc/c-family/ChangeLog:

* c-gimplify.cc (ubsan_walk_array_refs_r): Instrument INDIRECT_REF
with .ACCESS_WITH_SIZE in its address computation.
* c-ubsan.cc (ubsan_instrument_bounds): Format change.
(ubsan_instrument_bounds_pointer): New function.
(get_factors_from_mul_expr): New function.
(get_index_from_offset): New function.
(get_index_from_pointer_addr_expr): New function.
(is_instrumentable_pointer_array): New function.
(ubsan_array_ref_instrumented_p): Handle INDIRECT_REF.
(ubsan_maybe_instrument_array_ref): Handle INDIRECT_REF.

gcc/testsuite/ChangeLog:

* gcc.dg/ubsan/pointer-counted-by-bounds-2.c: New test.
* gcc.dg/ubsan/pointer-counted-by-bounds-3.c: New test.
* gcc.dg/ubsan/pointer-counted-by-bounds-4.c: New test.
* gcc.dg/ubsan/pointer-counted-by-bounds-5.c: New test.
* gcc.dg/ubsan/pointer-counted-by-bounds-6.c: New test.
* gcc.dg/ubsan/pointer-counted-by-bounds.c: New test.
---
 gcc/c-family/c-gimplify.cc|   7 +
 gcc/c-family/c-ubsan.cc   | 264 --
 .../ubsan/pointer-counted-by-bounds-2.c   |  47 
 .../ubsan/pointer-counted-by-bounds-3.c   |  35 +++
 .../ubsan/pointer-counted-by-bounds-4.c   |  35 +++
 .../ubsan/pointer-counted-by-bounds-5.c   |  46 +++
 .../ubsan/pointer-counted-by-bounds-6.c   |  33 +++
 .../gcc.dg/ubsan/pointer-counted-by-bounds.c  |  46 +++
 8 files changed, 496 insertions(+), 17 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/ubsan/pointer-counted-by-bounds-2.c
 create mode 100644 gcc/testsuite/gcc.dg/ubsan/pointer-counted-by-bounds-3.c
 create mode 100644 gcc/testsuite/gcc.dg/ubsan/pointer-counted-by-bounds-4.c
 create mode 100644 gcc/testsuite/gcc.dg/ubsan/pointer-counted-by-bounds-5.c
 create mode 100644 gcc/testsuite/gcc.dg/ubsan/pointer-counted-by-bounds-6.c
 create mode 100644 gcc/testsuite/gcc.dg/ubsan/pointer-counted-by-bounds.c

diff --git a/gcc/c-family/c-gimplify.cc b/gcc/c-family/c-gimplify.cc
index c6fb7646567..bfd16e6d081 100644
--- a/gcc/c-family/c-gimplify.cc
+++ b/gcc/c-family/c-gimplify.cc
@@ -121,6 +121,13 @@ ubsan_walk_array_refs_r (tree *tp, int *walk_subtrees, 
void *data)
   walk_tree (&TREE_OPERAND (*tp, 1), ubsan_walk_array_refs_r, pset, pset);
   walk_tree (&TREE_OPERAND (*tp, 0), ubsan_walk_array_refs_r, pset, pset);
 }
+  else if (TREE_CODE (*tp) == INDIRECT_REF
+  && TREE_CODE (TREE_OPERAND (*tp, 0)) == POINTER_PLUS_EXPR
+  && TREE_CODE (TREE_OPERAND (TREE_OPERAND (*tp, 0), 0))
+   == INDIRECT_REF)
+if (is_access_with_size_p
+   (TREE_OPERAND (TREE_OPERAND (TREE_OPERAND (*tp, 0), 0), 0)))
+ubsan_maybe_instrument_array_ref (tp, false);
   return NULL_TREE;
 }
 
diff --git a/gcc/c-family/c-ubsan.cc b/gcc/c-family/c-ubsan.cc
index 78b78685469..21fb0e312f7 100644
--- a/gcc/c-family/c-ubsan.cc
+++ b/gcc/c-family/c-ubsan.cc
@@ -420,7 +420,6 @@ get_bound_from_access_with_size (tree call)
   return size;
 }
 
-
 /* Instrument array bounds for ARRAY_REFs.  We create special builtin,
that gets expanded in the sanopt pass, and make an array dimension
of it.  ARRAY is the array, *INDEX is an index to the array.
@@ -450,8 +449,7 @@ ubsan_instrument_bounds (location_t loc, tree array, tree 
*index,
   && is_access_with_size_p ((TREE_OPERAND (array, 0
{
  bound = get_bound_from_access_with_size ((TREE_OPERAND (array, 0)));
- bound = fold_build2 (MINUS_EXPR, TREE_TYPE (bound),
-  bound,
+ bound = fold_build2 (MINUS_EXPR, TREE_TYPE (bound), bound,
   build_int_cst (TREE_TYPE (bound), 1));
}
   else
@@ -554,38 +552,270 @@ ubsan_instrument_bounds (location_t loc, tree array, 
tree *index,
   *index, bound);
 }
 
-/* Return true iff T is an array that was instrumented by SANITIZE_BOUNDS.  */
+
+/* Instrument array bounds for the pointer array whose base address
+   is a call to .ACCESS_WITH_SIZE.  We create special builtin, that
+   gets expanded in the sanopt pass, and make an array dimension of
+   it.  POINTER is the pointer array's base address, *INDEX is an
+   index to the array.
+   Return NULL_TREE if no instrumentation is emitted.  */
+
+tr

[PATCH] GCN, nvptx offloading: Host/device compatibility: Itanium C++ ABI, DSO Object Destruction API [PR119853, PR119854]

2025-04-23 Thread Thomas Schwinge
'__dso_handle' for '__cxa_atexit', '__cxa_finalize'.  See
.

PR target/119853
PR target/119854
libgcc/
* config/gcn/crt0.c (_fini_array): Call
'__GCC_offload___cxa_finalize'.
* config/nvptx/gbl-ctors.c (__static_do_global_dtors): Likewise.
libgomp/
* target-cxa-dso-dtor.c: New.
* config/accel/target-cxa-dso-dtor.c: Likewise.
* Makefile.am (libgomp_la_SOURCES): Add it.
* Makefile.in: Regenerate.
* testsuite/libgomp.c++/target-cdtor-1.C: New.
* testsuite/libgomp.c++/target-cdtor-2.C: Likewise.
---
 libgcc/config/gcn/crt0.c  |  32 
 libgcc/config/nvptx/gbl-ctors.c   |  16 ++
 libgomp/Makefile.am   |   2 +-
 libgomp/Makefile.in   |   7 +-
 libgomp/config/accel/target-cxa-dso-dtor.c|  62 
 libgomp/target-cxa-dso-dtor.c |   3 +
 .../testsuite/libgomp.c++/target-cdtor-1.C| 104 +
 .../testsuite/libgomp.c++/target-cdtor-2.C| 138 ++
 8 files changed, 361 insertions(+), 3 deletions(-)
 create mode 100644 libgomp/config/accel/target-cxa-dso-dtor.c
 create mode 100644 libgomp/target-cxa-dso-dtor.c
 create mode 100644 libgomp/testsuite/libgomp.c++/target-cdtor-1.C
 create mode 100644 libgomp/testsuite/libgomp.c++/target-cdtor-2.C

diff --git a/libgcc/config/gcn/crt0.c b/libgcc/config/gcn/crt0.c
index dbd6749a47f..cc23e214cf9 100644
--- a/libgcc/config/gcn/crt0.c
+++ b/libgcc/config/gcn/crt0.c
@@ -24,6 +24,28 @@ typedef long long size_t;
 /* Provide an entry point symbol to silence a linker warning.  */
 void _start() {}
 
+
+#define PR119369_fixed 0
+
+
+/* Host/device compatibility: '__cxa_finalize'.  Dummy; if necessary,
+   overridden via libgomp 'target-cxa-dso-dtor.c'.  */
+
+#if PR119369_fixed
+extern void __GCC_offload___cxa_finalize (void *) __attribute__((weak));
+#else
+void __GCC_offload___cxa_finalize (void *) __attribute__((weak));
+
+void __attribute__((weak))
+__GCC_offload___cxa_finalize (void *dso_handle __attribute__((unused)))
+{
+}
+#endif
+
+/* There are no DSOs; this is the main program.  */
+static void * const __dso_handle = 0;
+
+
 #ifdef USE_NEWLIB_INITFINI
 
 extern void __libc_init_array (void) __attribute__((weak));
@@ -38,6 +60,11 @@ void _init_array()
 __attribute__((amdgpu_hsa_kernel ()))
 void _fini_array()
 {
+#if PR119369_fixed
+  if (__GCC_offload___cxa_finalize)
+#endif
+__GCC_offload___cxa_finalize (__dso_handle);
+
   __libc_fini_array ();
 }
 
@@ -70,6 +97,11 @@ void _init_array()
 __attribute__((amdgpu_hsa_kernel ()))
 void _fini_array()
 {
+#if PR119369_fixed
+  if (__GCC_offload___cxa_finalize)
+#endif
+__GCC_offload___cxa_finalize (__dso_handle);
+
   size_t count;
   size_t i;
 
diff --git a/libgcc/config/nvptx/gbl-ctors.c b/libgcc/config/nvptx/gbl-ctors.c
index 26268116ee0..10954ee3ab6 100644
--- a/libgcc/config/nvptx/gbl-ctors.c
+++ b/libgcc/config/nvptx/gbl-ctors.c
@@ -31,6 +31,20 @@
 extern int atexit (void (*function) (void));
 
 
+/* Host/device compatibility: '__cxa_finalize'.  Dummy; if necessary,
+   overridden via libgomp 'target-cxa-dso-dtor.c'.  */
+
+extern void __GCC_offload___cxa_finalize (void *);
+
+void __attribute__((weak))
+__GCC_offload___cxa_finalize (void *dso_handle __attribute__((unused)))
+{
+}
+
+/* There are no DSOs; this is the main program.  */
+static void * const __dso_handle = 0;
+
+
 /* Handler functions ('static', in contrast to the 'gbl-ctors.h'
prototypes).  */
 
@@ -49,6 +63,8 @@ static void __static_do_global_dtors (void);
 static void
 __static_do_global_dtors (void)
 {
+  __GCC_offload___cxa_finalize (__dso_handle);
+
   func_ptr *p = __DTOR_LIST__;
   ++p;
   for (; *p; ++p)
diff --git a/libgomp/Makefile.am b/libgomp/Makefile.am
index e3202aeb0e0..19479aea462 100644
--- a/libgomp/Makefile.am
+++ b/libgomp/Makefile.am
@@ -70,7 +70,7 @@ libgomp_la_SOURCES = alloc.c atomic.c barrier.c critical.c 
env.c error.c \
target.c splay-tree.c libgomp-plugin.c oacc-parallel.c oacc-host.c \
oacc-init.c oacc-mem.c oacc-async.c oacc-plugin.c oacc-cuda.c \
priority_queue.c affinity-fmt.c teams.c allocator.c oacc-profiling.c \
-   oacc-target.c target-indirect.c
+   oacc-target.c target-indirect.c target-cxa-dso-dtor.c
 
 include $(top_srcdir)/plugin/Makefrag.am
 
diff --git a/libgomp/Makefile.in b/libgomp/Makefile.in
index 2a0a842af52..6d22b3d3bfd 100644
--- a/libgomp/Makefile.in
+++ b/libgomp/Makefile.in
@@ -219,7 +219,8 @@ am_libgomp_la_OBJECTS = alloc.lo atomic.lo barrier.lo 
critical.lo \
oacc-parallel.lo oacc-host.lo oacc-init.lo oacc-mem.lo \
oacc-async.lo oacc-plugin.lo oacc-cuda.lo priority_queue.lo \
affinity-fmt.lo teams.lo allocator.lo oacc-profiling.lo \
-   oacc-target.lo target-indirect.lo $(am__objects_1)
+   oacc-target.lo target-indirect.lo target-cxa-d

[GCC16 stage 1][PATCH v2 0/3] extend "counted_by" attribute to pointer fields of structures

2025-04-23 Thread Qing Zhao
Hi,

This is the 2nd version of the patch set to extend "counted_by" attribute
 to pointer fields of structures.

the first version was submitted 3 months ago on 1/16/2025, and triggered
a lot of discussion on whether we need a new syntax for counted_by
attribute.

https://gcc.gnu.org/pipermail/gcc-patches/2025-January/673837.html

After a long discussion since then: 
(https://gcc.gnu.org/pipermail/gcc-patches/2025-March/677024.html)

We agreed to the following compromised solution:

1. Keep the current syntax of counted_by for lone identifier;
2. Add a new attribute "counted_by_exp" for expressions.

Although there are still some discussion going on for the new 
counted_by_exp attribute (In Clang community) 
https://discourse.llvm.org/t/rfc-bounds-safety-in-c-syntax-compatibility-with-gcc/85885

The syntax for the lone identifier is kept the same as before.

So, I'd like to resubmit my previous patch of extending "counted_by"
to pointer fields of structures. 

The whole patch set has been rebased on the latest trunk, some testing case
adjustment,  bootstrapped  and regression tested on both aarch64 and x86.

There will be a seperate patch set for the new "counted_by_exp" 
attribute later to cover the expressions cases.

The following are more details on this patch set:

For example:

struct PP {
  size_t count2;
  char other1;
  char *array2 __attribute__ ((counted_by (count2)));
  int other2;
} *pp;

specifies that the "array2" is an array that is pointed by the
pointer field, and its number of elements is given by the field
"count2" in the same structure.

There are the following importand facts about "counted_by" on pointer
fields compared to the "counted_by" on FAM fields:

1. one more new requirement for pointer fields with "counted_by" attribute:
   pp->array2 and pp->count2 can ONLY be changed by changing the whole structure
   at the same time.

2. the following feature for FAM field with "counted_by" attribute is NOT
   valid for the pointer field any more:

" One important feature of the attribute is, a reference to the
 flexible array member field uses the latest value assigned to the
 field that represents the number of the elements before that
 reference.  For example,

p->count = val1;
p->array[20] = 0;  // ref1 to p->array
p->count = val2;
p->array[30] = 0;  // ref2 to p->array

 in the above, 'ref1' uses 'val1' as the number of the elements in
 'p->array', and 'ref2' uses 'val2' as the number of elements in
 'p->array'. "

This patch set includes 3 parts:

1.Extend "counted_by" attribute to pointer fields of structures. 
2.Convert a pointer reference with counted_by attribute to .ACCESS_WITH_SIZE
and use it in builtinin-object-size.
3.Use the counted_by attribute of pointers in array bound checker.

In which, the patch 1 and 2 are simple and straightforward, however, the patch 
3  
is a little complicate due to the following reason:

Current array bound checker only instruments ARRAY_REF, and the INDEX
information is the 2nd operand of the ARRAY_REF.

When extending the array bound checker to pointer references with
counted_by attributes, the hardest part is to get the INDEX of the
corresponding array ref from the offset computation expression of
the pointer ref. 

I do need some careful review on the 3rd part of the patch. And I do wonder
for the access to pointer arrays:

struct annotated {
  int b;
  int *c __attribute__ ((counted_by (b)));
} *p_array_annotated;

p_array_annotated->c[annotated_index] = 2;

Is it possible to generate ARRAY_REF instead of INDIRECT_REF for the above 
p_array_annotated->c[annotated_index]
in C FE? then we can keep the INDEX info in the IR and avoid all the hacks 
to get the index from the OFFSET computation expression.

The whole patch set has been rebased on the latest trunk, bootstrapped 
and regression tested on both aarch64 and x86.

Let me know any comments and suggestions.
 
Thanks.

Qing


Re: [PATCH] [x86] Generate 2 FMA instructions in ix86_expand_swdivsf.

2025-04-23 Thread Hongtao Liu
On Thu, Apr 24, 2025 at 12:54 AM Jan Hubicka  wrote:
>
> > From: "hongtao.liu" 
> >
> > When FMA is available, N-R step can be rewritten with
> >
> > a / b = (a - (rcp(b) * a * b)) * rcp(b) + rcp(b) * a
> >
> > which have 2 fma generated.[1]
> >
> > [1] https://bugs.llvm.org/show_bug.cgi?id=21385
> >
> > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> > Ok for trunk?
>
> How this behaves on CPUs where FMA has longer latency then addition when
> swdifsf is on the critical path through the loop?
For the original N-R step, addition couldn't be on the cross-iteration
critical path since it's internal inside the N-R step, only
multiplication could be on the critical path.

It's like
/* a / b = a * ((rcp(b) + rcp(b)) - (b * rcp(b) * rcp (b))) */
   x0 = rcp(b)
 /  \
e0 = x0 *b  e1 = x0 + x0
  |  /
e0 = x0 * e0/
  \   /
  x1 = e1 - e0
|
  res = a * x1 (multiplication here)

For the new N-R step, even the last operation is addition, I don't
think it can be on the cross-iteration critical path since there's
multication to get either e0/e2.
/* a / b = (a - (rcp(b) * a * b)) * rcp(b) + rcp(b) * a */
 x0 = rcp(b)
|
 e0 = x0 * a
|
  e1 = e0 * b
|
 x1 = a - e1
|
  e2 = x0  * x1
|
res = e0  + e2 (addition here)



>
> Honza
> >
> >
> > gcc/ChangeLog:
> >
> >   * config/i386/i386-expand.cc (ix86_emit_swdivsf): Generate 2
> >   FMA instructions when TARGET_FMA.
> >
> > gcc/testsuite/ChangeLog:
> >
> >   * gcc.target/i386/recip-vec-divf-fma.c: New test.
> > ---
> >  gcc/config/i386/i386-expand.cc| 44 ++-
> >  .../gcc.target/i386/recip-vec-divf-fma.c  | 12 +
> >  2 files changed, 44 insertions(+), 12 deletions(-)
> >  create mode 100644 gcc/testsuite/gcc.target/i386/recip-vec-divf-fma.c
> >
> > diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
> > index cdfd94d3c73..4fffbfdd574 100644
> > --- a/gcc/config/i386/i386-expand.cc
> > +++ b/gcc/config/i386/i386-expand.cc
> > @@ -19256,8 +19256,6 @@ ix86_emit_swdivsf (rtx res, rtx a, rtx b, 
> > machine_mode mode)
> >e1 = gen_reg_rtx (mode);
> >x1 = gen_reg_rtx (mode);
> >
> > -  /* a / b = a * ((rcp(b) + rcp(b)) - (b * rcp(b) * rcp (b))) */
> > -
> >b = force_reg (mode, b);
> >
> >/* x0 = rcp(b) estimate */
> > @@ -19270,20 +19268,42 @@ ix86_emit_swdivsf (rtx res, rtx a, rtx b, 
> > machine_mode mode)
> >  emit_insn (gen_rtx_SET (x0, gen_rtx_UNSPEC (mode, gen_rtvec (1, b),
> >   UNSPEC_RCP)));
> >
> > -  /* e0 = x0 * b */
> > -  emit_insn (gen_rtx_SET (e0, gen_rtx_MULT (mode, x0, b)));
> > +  unsigned vector_size = GET_MODE_SIZE (mode);
> >
> > -  /* e0 = x0 * e0 */
> > -  emit_insn (gen_rtx_SET (e0, gen_rtx_MULT (mode, x0, e0)));
> > +  /* (a - (rcp(b) * a * b)) * rcp(b) + rcp(b) * a
> > + N-R step with 2 fma implementation.  */
> > +  if (TARGET_FMA
> > +  || (TARGET_AVX512F && vector_size == 64)
> > +  || (TARGET_AVX512VL && (vector_size == 32 || vector_size == 16)))
> > +{
> > +  /* e0 = x0 * a  */
> > +  emit_insn (gen_rtx_SET (e0, gen_rtx_MULT (mode, x0, a)));
> > +  /* e1 = e0 * b - a  */
> > +  emit_insn (gen_rtx_SET (e1, gen_rtx_FMA (mode, e0, b,
> > +gen_rtx_NEG (mode, a;
> > +  /* res = - e1 * x0 + e0  */
> > +  emit_insn (gen_rtx_SET (res, gen_rtx_FMA (mode,
> > +gen_rtx_NEG (mode, e1),
> > +x0, e0)));
> > +}
> > +/* a / b = a * ((rcp(b) + rcp(b)) - (b * rcp(b) * rcp (b))) */
> > +  else
> > +{
> > +  /* e0 = x0 * b */
> > +  emit_insn (gen_rtx_SET (e0, gen_rtx_MULT (mode, x0, b)));
> >
> > -  /* e1 = x0 + x0 */
> > -  emit_insn (gen_rtx_SET (e1, gen_rtx_PLUS (mode, x0, x0)));
> > +  /* e1 = x0 + x0 */
> > +  emit_insn (gen_rtx_SET (e1, gen_rtx_PLUS (mode, x0, x0)));
> >
> > -  /* x1 = e1 - e0 */
> > -  emit_insn (gen_rtx_SET (x1, gen_rtx_MINUS (mode, e1, e0)));
> > +  /* e0 = x0 * e0 */
> > +  emit_insn (gen_rtx_SET (e0, gen_rtx_MULT (mode, x0, e0)));
> >
> > -  /* res = a * x1 */
> > -  emit_insn (gen_rtx_SET (res, gen_rtx_MULT (mode, a, x1)));
> > +  /* x1 = e1 - e0 */
> > +  emit_insn (gen_rtx_SET (x1, gen_rtx_MINUS (mode, e1, e0)));
> > +
> > +  /* res = a * x1 */
> > +  emit_insn (gen_rtx_SET (res, gen_rtx_MULT (mode, a, x1)));
> > +}
> >  }
> >
> >  /* Output code to perform a Newton-Rhapson approximation of a
> > diff --git a/gcc/testsuite/gcc.target/i386/recip-vec-divf-fma.c 
> > b/gcc/testsuite/gcc.target/i386/recip-vec-divf-fma.c
> > new file mode 100644
> > index 000..ad9e07b1eb6
> > --- /dev/null
> > +++ b/gcc/testsui

Re: [PATCH] Add a bootstrap-native build config

2025-04-23 Thread Richard Biener
On Tue, Apr 22, 2025 at 5:43 PM Andi Kleen  wrote:
>
> On 2025-04-22 13:22, Richard Biener wrote:
> > On Sat, Apr 12, 2025 at 5:09 PM Andi Kleen  wrote:
> >>
> >> From: Andi Kleen 
> >>
> >> ... that uses -march=native -mtune=native to build a compiler
> >> optimized
> >> for the host.
> >
> > -march=native implies -mtune=native so I think the latter is redundant.
>
> Ok with that change?

Put the list back in the loop.

>
> >
> >> config/ChangeLog:
> >>
> >> * bootstrap-native.mk: New file.
> >>
> >> gcc/ChangeLog:
> >>
> >> * doc/install.texi: Document bootstrap-native.
> >> ---
> >>  config/bootstrap-native.mk | 1 +
> >>  gcc/doc/install.texi   | 7 +++
> >>  2 files changed, 8 insertions(+)
> >>  create mode 100644 config/bootstrap-native.mk
> >>
> >> diff --git a/config/bootstrap-native.mk b/config/bootstrap-native.mk
> >> new file mode 100644
> >> index 000..a4a3d859408
> >> --- /dev/null
> >> +++ b/config/bootstrap-native.mk
> >> @@ -0,0 +1 @@
> >> +BOOT_CFLAGS := -march=native -mtune=native $(BOOT_CFLAGS)
> >
> > bootstrap-O3 uses
> >
> > BOOT_CFLAGS := -O3 $(filter-out -O%, $(BOOT_CFLAGS))
> >
> > so do you want to filer-out other -march/-mtune/-mcpu options?
>
> I don't think that is needed because these are usually not used (unlike
> -O)
>
> >
> > Some targets know -mcpu= instead of -march=, did you check whether
> > any of those have =native?
>
> There are some like Alpha and others dont jave it at all. That is the
> why the documentation says "if supported".

I see.

So yes, OK with the above change.

Richard.

> >
> >> diff --git a/gcc/doc/install.texi b/gcc/doc/install.texi
> >> index 4973f195daf..04a2256b97a 100644
> >> --- a/gcc/doc/install.texi
> >> +++ b/gcc/doc/install.texi
> >> @@ -3052,6 +3052,13 @@ Removes any @option{-O}-started option from
> >> @code{BOOT_CFLAGS}, and adds
> >>  @itemx @samp{bootstrap-Og}
> >>  Analogous to @code{bootstrap-O1}.
> >>
> >> +@item @samp{bootstrap-native}
> >> +@itemx @samp{bootstrap-native}
> >> +Optimize the compiler code for the build host, if supported by the
> >> +architecture. Note this only affects the compiler, not the targeted
> >> +code. If you want the later, choose options suitable to the target
> >> you
> >> +are looking for. For example @samp{--with-cpu} would be a good
> >> starting point.
> >> +
> >>  @item @samp{bootstrap-lto}
> >>  Enables Link-Time Optimization for host tools during bootstrapping.
> >>  @samp{BUILD_CONFIG=bootstrap-lto} is equivalent to adding
> >> --
> >> 2.47.1
> >>


Re: [PATCH]middle-end: Add new "max" vector cost model

2025-04-23 Thread Kyrylo Tkachov



> On 23 Apr 2025, at 08:37, Tamar Christina  wrote:
> 
> Hi All,
> 
> This patch proposes a new vector cost model called "max".  The cost model is 
> an
> intersection between two of our existing cost models.  Like `unlimited` it
> disables the costing vs scalar and assumes all vectorization to be profitable.
> 
> But unlike unlimited it does not fully disable the vector cost model.  That
> means that we still perform comparisons between vector modes.
> 
> As an example, the following:
> 
> void
> foo (char *restrict a, int *restrict b, int *restrict c,
> int *restrict d, int stride)
> {
>if (stride <= 1)
>return;
> 
>for (int i = 0; i < 3; i++)
>{
>int res = c[i];
>int t = b[i * stride];
>if (a[i] != 0)
>res = t * d[i];
>c[i] = res;
>}
> }
> 
> compiled with -O3 -march=armv8-a+sve -fvect-cost-model=dynamic fails to
> vectorize as it assumes scalar would be faster, and with
> -fvect-cost-model=unlimited it picks a vector type that's so big that the 
> large
> sequence generated is working on mostly inactive lanes:
> 
>...
>and p3.b, p3/z, p4.b, p4.b
>whilelo p0.s, wzr, w7
>ld1wz23.s, p3/z, [x3, #3, mul vl]
>ld1wz28.s, p0/z, [x5, z31.s, sxtw 2]
>add x0, x5, x0
>punpklo p6.h, p6.b
>ld1wz27.s, p4/z, [x0, z31.s, sxtw 2]
>and p6.b, p6/z, p0.b, p0.b
>punpklo p4.h, p7.b
>ld1wz24.s, p6/z, [x3, #2, mul vl]
>and p4.b, p4/z, p2.b, p2.b
>uqdecw  w6
>ld1wz26.s, p4/z, [x3]
>whilelo p1.s, wzr, w6
>mul z27.s, p5/m, z27.s, z23.s
>ld1wz29.s, p1/z, [x4, z31.s, sxtw 2]
>punpkhi p7.h, p7.b
>mul z24.s, p5/m, z24.s, z28.s
>and p7.b, p7/z, p1.b, p1.b
>mul z26.s, p5/m, z26.s, z30.s
>ld1wz25.s, p7/z, [x3, #1, mul vl]
>st1wz27.s, p3, [x2, #3, mul vl]
>mul z25.s, p5/m, z25.s, z29.s
>st1wz24.s, p6, [x2, #2, mul vl]
>st1wz25.s, p7, [x2, #1, mul vl]
>st1wz26.s, p4, [x2]
>...
> 
> With -fvect-cost-model=max you get more reasonable code:
> 
> foo:
>cmp w4, 1
>ble .L1
>ptrue   p7.s, vl3
>index   z0.s, #0, w4
>ld1bz29.s, p7/z, [x0]
>ld1wz30.s, p7/z, [x1, z0.s, sxtw 2]
> ptrue   p6.b, all
>cmpne   p7.b, p7/z, z29.b, #0
>ld1wz31.s, p7/z, [x3]
> mul z31.s, p6/m, z31.s, z30.s
>st1wz31.s, p7, [x2]
> .L1:
>ret
> 
> This model has been useful internally for performance exploration and 
> cost-model
> validation.  It allows us to force realistic vectorization overriding the cost
> model to be able to tell whether it's correct wrt to profitability.

Thanks for this, it looks really useful.


> 
> Bootstrapped Regtested on aarch64-none-linux-gnu,
> arm-none-linux-gnueabihf, x86_64-pc-linux-gnu
> -m32, -m64 and no issues.
> 
> Ok for master?
> 
> Thanks,
> Tamar
> 
> gcc/ChangeLog:
> 
> * common.opt (vect-cost-model, simd-cost-model): Add max cost model.
> * doc/invoke.texi: Document it.
> * flag-types.h (enum vect_cost_model): Add VECT_COST_MODEL_MAX.
> * tree-vect-data-refs.cc (vect_peeling_hash_insert,
> vect_peeling_hash_choose_best_peeling,
> vect_enhance_data_refs_alignment): Use it.
> * tree-vect-loop.cc (vect_analyze_loop_costing,
> vect_estimate_min_profitable_iters): Likewise.
> 
> ---
> diff --git a/gcc/common.opt b/gcc/common.opt
> index 
> 88d987e6ab14d9f8df7aa686efffc43418dbb42d..bd5e2e951f9388b12206d9addc736e336cd0e4ee
>  100644
> --- a/gcc/common.opt
> +++ b/gcc/common.opt
> @@ -3442,11 +3442,11 @@ Enable basic block vectorization (SLP) on trees.
> 
> fvect-cost-model=
> Common Joined RejectNegative Enum(vect_cost_model) Var(flag_vect_cost_model) 
> Init(VECT_COST_MODEL_DEFAULT) Optimization
> --fvect-cost-model=[unlimited|dynamic|cheap|very-cheap] Specifies the cost 
> model for vectorization.
> +-fvect-cost-model=[unlimited|max|dynamic|cheap|very-cheap] Specifies the 
> cost model for vectorization.
> 
> fsimd-cost-model=
> Common Joined RejectNegative Enum(vect_cost_model) Var(flag_simd_cost_model) 
> Init(VECT_COST_MODEL_UNLIMITED) Optimization
> --fsimd-cost-model=[unlimited|dynamic|cheap|very-cheap] Specifies the 
> vectorization cost model for code marked with a simd directive.
> +-fsimd-cost-model=[unlimited|max|dynamic|cheap|very-cheap] Specifies the 
> vectorization cost model for code marked with a simd directive.
> 
> Enum
> Name(vect_cost_model) Type(enum vect_cost_model) UnknownError(unknown 
> vectorizer cost model %qs)
> @@ -3454,6 +3454,9 @@ Name(vect_cost_model) Type(enum vect_cost_model) 
> UnknownError(unknown vectorizer
> EnumValue
> Enum(vect_cost_model) String(unlimited) Value(VECT_COST_MODEL_UNLIMITED)
> 
> +EnumValue
> +Enum(vect_cost_model) String(max) Value(VECT_COST_MODEL_MAX)
> +
> EnumValue
> Enum(vec

RE: [PATCH]middle-end: Add new "max" vector cost model

2025-04-23 Thread Richard Biener
On Wed, 23 Apr 2025, Tamar Christina wrote:

> > -Original Message-
> > From: Richard Sandiford 
> > Sent: Wednesday, April 23, 2025 9:45 AM
> > To: Tamar Christina 
> > Cc: Richard Biener ; gcc-patches@gcc.gnu.org; nd
> > 
> > Subject: Re: [PATCH]middle-end: Add new "max" vector cost model
> > 
> > Tamar Christina  writes:
> > >> -Original Message-
> > >> From: Richard Biener 
> > >> Sent: Wednesday, April 23, 2025 9:31 AM
> > >> To: Tamar Christina 
> > >> Cc: gcc-patches@gcc.gnu.org; nd ; Richard Sandiford
> > >> 
> > >> Subject: Re: [PATCH]middle-end: Add new "max" vector cost model
> > >>
> > >> On Wed, 23 Apr 2025, Tamar Christina wrote:
> > >>
> > >> > Hi All,
> > >> >
> > >> > This patch proposes a new vector cost model called "max".  The cost 
> > >> > model is
> > an
> > >> > intersection between two of our existing cost models.  Like 
> > >> > `unlimited` it
> > >> > disables the costing vs scalar and assumes all vectorization to be 
> > >> > profitable.
> > >> >
> > >> > But unlike unlimited it does not fully disable the vector cost model.  
> > >> > That
> > >> > means that we still perform comparisons between vector modes.
> > >> >
> > >> > As an example, the following:
> > >> >
> > >> > void
> > >> > foo (char *restrict a, int *restrict b, int *restrict c,
> > >> >  int *restrict d, int stride)
> > >> > {
> > >> > if (stride <= 1)
> > >> > return;
> > >> >
> > >> > for (int i = 0; i < 3; i++)
> > >> > {
> > >> > int res = c[i];
> > >> > int t = b[i * stride];
> > >> > if (a[i] != 0)
> > >> > res = t * d[i];
> > >> > c[i] = res;
> > >> > }
> > >> > }
> > >> >
> > >> > compiled with -O3 -march=armv8-a+sve -fvect-cost-model=dynamic fails to
> > >> > vectorize as it assumes scalar would be faster, and with
> > >> > -fvect-cost-model=unlimited it picks a vector type that's so big that 
> > >> > the large
> > >> > sequence generated is working on mostly inactive lanes:
> > >> >
> > >> > ...
> > >> > and p3.b, p3/z, p4.b, p4.b
> > >> > whilelo p0.s, wzr, w7
> > >> > ld1wz23.s, p3/z, [x3, #3, mul vl]
> > >> > ld1wz28.s, p0/z, [x5, z31.s, sxtw 2]
> > >> > add x0, x5, x0
> > >> > punpklo p6.h, p6.b
> > >> > ld1wz27.s, p4/z, [x0, z31.s, sxtw 2]
> > >> > and p6.b, p6/z, p0.b, p0.b
> > >> > punpklo p4.h, p7.b
> > >> > ld1wz24.s, p6/z, [x3, #2, mul vl]
> > >> > and p4.b, p4/z, p2.b, p2.b
> > >> > uqdecw  w6
> > >> > ld1wz26.s, p4/z, [x3]
> > >> > whilelo p1.s, wzr, w6
> > >> > mul z27.s, p5/m, z27.s, z23.s
> > >> > ld1wz29.s, p1/z, [x4, z31.s, sxtw 2]
> > >> > punpkhi p7.h, p7.b
> > >> > mul z24.s, p5/m, z24.s, z28.s
> > >> > and p7.b, p7/z, p1.b, p1.b
> > >> > mul z26.s, p5/m, z26.s, z30.s
> > >> > ld1wz25.s, p7/z, [x3, #1, mul vl]
> > >> > st1wz27.s, p3, [x2, #3, mul vl]
> > >> > mul z25.s, p5/m, z25.s, z29.s
> > >> > st1wz24.s, p6, [x2, #2, mul vl]
> > >> > st1wz25.s, p7, [x2, #1, mul vl]
> > >> > st1wz26.s, p4, [x2]
> > >> > ...
> > >> >
> > >> > With -fvect-cost-model=max you get more reasonable code:
> > >> >
> > >> > foo:
> > >> > cmp w4, 1
> > >> > ble .L1
> > >> > ptrue   p7.s, vl3
> > >> > index   z0.s, #0, w4
> > >> > ld1bz29.s, p7/z, [x0]
> > >> > ld1wz30.s, p7/z, [x1, z0.s, sxtw 2]
> > >> >ptrue   p6.b, all
> > >> > cmpne   p7.b, p7/z, z29.b, #0
> > >> > ld1wz31.s, p7/z, [x3]
> > >> >mul z31.s, p6/m, z31.s, z30.s
> > >> > st1wz31.s, p7, [x2]
> > >> > .L1:
> > >> > ret
> > >> >
> > >> > This model has been useful internally for performance exploration and 
> > >> > cost-
> > >> model
> > >> > validation.  It allows us to force realistic vectorization overriding 
> > >> > the cost
> > >> > model to be able to tell whether it's correct wrt to profitability.
> > >> >
> > >> > Bootstrapped Regtested on aarch64-none-linux-gnu,
> > >> > arm-none-linux-gnueabihf, x86_64-pc-linux-gnu
> > >> > -m32, -m64 and no issues.
> > >> >
> > >> > Ok for master?
> > >>
> > >> Hmm.  I don't like another cost model.  Instead how about changing
> > >> 'unlimited' to still iterate through vector sizes?  Cost modeling
> > >> is really about vector vs. scalar, not vector vs. vector which is
> > >> completely under target control.  Targets should provide a way
> > >> to limit iteration, like aarch64 has with the aarch64-autovec-preference
> > >> --param or x86 has with -mprefer-vector-width.
> > >>
> > >
> > > I'm ok with changing 'unlimited' if that's preferred, but I do want to 
> > > point
> > > out that we don't have enough control with current --param or -m options
> > > to simulat

RE: [PATCH]middle-end: Add new "max" vector cost model

2025-04-23 Thread Tamar Christina
> -Original Message-
> From: Richard Biener 
> Sent: Wednesday, April 23, 2025 10:14 AM
> To: Tamar Christina 
> Cc: Richard Sandiford ; gcc-patches@gcc.gnu.org;
> nd 
> Subject: RE: [PATCH]middle-end: Add new "max" vector cost model
> 
> On Wed, 23 Apr 2025, Tamar Christina wrote:
> 
> > > -Original Message-
> > > From: Richard Sandiford 
> > > Sent: Wednesday, April 23, 2025 9:45 AM
> > > To: Tamar Christina 
> > > Cc: Richard Biener ; gcc-patches@gcc.gnu.org; nd
> > > 
> > > Subject: Re: [PATCH]middle-end: Add new "max" vector cost model
> > >
> > > Tamar Christina  writes:
> > > >> -Original Message-
> > > >> From: Richard Biener 
> > > >> Sent: Wednesday, April 23, 2025 9:31 AM
> > > >> To: Tamar Christina 
> > > >> Cc: gcc-patches@gcc.gnu.org; nd ; Richard Sandiford
> > > >> 
> > > >> Subject: Re: [PATCH]middle-end: Add new "max" vector cost model
> > > >>
> > > >> On Wed, 23 Apr 2025, Tamar Christina wrote:
> > > >>
> > > >> > Hi All,
> > > >> >
> > > >> > This patch proposes a new vector cost model called "max".  The cost 
> > > >> > model
> is
> > > an
> > > >> > intersection between two of our existing cost models.  Like 
> > > >> > `unlimited` it
> > > >> > disables the costing vs scalar and assumes all vectorization to be 
> > > >> > profitable.
> > > >> >
> > > >> > But unlike unlimited it does not fully disable the vector cost 
> > > >> > model.  That
> > > >> > means that we still perform comparisons between vector modes.
> > > >> >
> > > >> > As an example, the following:
> > > >> >
> > > >> > void
> > > >> > foo (char *restrict a, int *restrict b, int *restrict c,
> > > >> >  int *restrict d, int stride)
> > > >> > {
> > > >> > if (stride <= 1)
> > > >> > return;
> > > >> >
> > > >> > for (int i = 0; i < 3; i++)
> > > >> > {
> > > >> > int res = c[i];
> > > >> > int t = b[i * stride];
> > > >> > if (a[i] != 0)
> > > >> > res = t * d[i];
> > > >> > c[i] = res;
> > > >> > }
> > > >> > }
> > > >> >
> > > >> > compiled with -O3 -march=armv8-a+sve -fvect-cost-model=dynamic fails
> to
> > > >> > vectorize as it assumes scalar would be faster, and with
> > > >> > -fvect-cost-model=unlimited it picks a vector type that's so big 
> > > >> > that the
> large
> > > >> > sequence generated is working on mostly inactive lanes:
> > > >> >
> > > >> > ...
> > > >> > and p3.b, p3/z, p4.b, p4.b
> > > >> > whilelo p0.s, wzr, w7
> > > >> > ld1wz23.s, p3/z, [x3, #3, mul vl]
> > > >> > ld1wz28.s, p0/z, [x5, z31.s, sxtw 2]
> > > >> > add x0, x5, x0
> > > >> > punpklo p6.h, p6.b
> > > >> > ld1wz27.s, p4/z, [x0, z31.s, sxtw 2]
> > > >> > and p6.b, p6/z, p0.b, p0.b
> > > >> > punpklo p4.h, p7.b
> > > >> > ld1wz24.s, p6/z, [x3, #2, mul vl]
> > > >> > and p4.b, p4/z, p2.b, p2.b
> > > >> > uqdecw  w6
> > > >> > ld1wz26.s, p4/z, [x3]
> > > >> > whilelo p1.s, wzr, w6
> > > >> > mul z27.s, p5/m, z27.s, z23.s
> > > >> > ld1wz29.s, p1/z, [x4, z31.s, sxtw 2]
> > > >> > punpkhi p7.h, p7.b
> > > >> > mul z24.s, p5/m, z24.s, z28.s
> > > >> > and p7.b, p7/z, p1.b, p1.b
> > > >> > mul z26.s, p5/m, z26.s, z30.s
> > > >> > ld1wz25.s, p7/z, [x3, #1, mul vl]
> > > >> > st1wz27.s, p3, [x2, #3, mul vl]
> > > >> > mul z25.s, p5/m, z25.s, z29.s
> > > >> > st1wz24.s, p6, [x2, #2, mul vl]
> > > >> > st1wz25.s, p7, [x2, #1, mul vl]
> > > >> > st1wz26.s, p4, [x2]
> > > >> > ...
> > > >> >
> > > >> > With -fvect-cost-model=max you get more reasonable code:
> > > >> >
> > > >> > foo:
> > > >> > cmp w4, 1
> > > >> > ble .L1
> > > >> > ptrue   p7.s, vl3
> > > >> > index   z0.s, #0, w4
> > > >> > ld1bz29.s, p7/z, [x0]
> > > >> > ld1wz30.s, p7/z, [x1, z0.s, sxtw 2]
> > > >> >  ptrue   p6.b, all
> > > >> > cmpne   p7.b, p7/z, z29.b, #0
> > > >> > ld1wz31.s, p7/z, [x3]
> > > >> >  mul z31.s, p6/m, z31.s, z30.s
> > > >> > st1wz31.s, p7, [x2]
> > > >> > .L1:
> > > >> > ret
> > > >> >
> > > >> > This model has been useful internally for performance exploration and
> cost-
> > > >> model
> > > >> > validation.  It allows us to force realistic vectorization 
> > > >> > overriding the cost
> > > >> > model to be able to tell whether it's correct wrt to profitability.
> > > >> >
> > > >> > Bootstrapped Regtested on aarch64-none-linux-gnu,
> > > >> > arm-none-linux-gnueabihf, x86_64-pc-linux-gnu
> > > >> > -m32, -m64 and no issues.
> > > >> >
> > > >> > Ok for master?
> > > >>
> > > >> Hmm.  I don't like another cost model.  Instead how about changing
> > > >> 'unlimited' to still iterate through vector sizes?  Cost modeli

RE: [PATCH] Add a bootstrap-native build config

2025-04-23 Thread Tamar Christina
> -Original Message-
> From: Richard Biener 
> Sent: Wednesday, April 23, 2025 9:19 AM
> To: Andi Kleen ; GCC Patches 
> Subject: Re: [PATCH] Add a bootstrap-native build config
> 
> On Tue, Apr 22, 2025 at 5:43 PM Andi Kleen  wrote:
> >
> > On 2025-04-22 13:22, Richard Biener wrote:
> > > On Sat, Apr 12, 2025 at 5:09 PM Andi Kleen  wrote:
> > >>
> > >> From: Andi Kleen 
> > >>
> > >> ... that uses -march=native -mtune=native to build a compiler
> > >> optimized
> > >> for the host.
> > >
> > > -march=native implies -mtune=native so I think the latter is redundant.
> >
> > Ok with that change?
> 
> Put the list back in the loop.
> 
> >
> > >
> > >> config/ChangeLog:
> > >>
> > >> * bootstrap-native.mk: New file.
> > >>
> > >> gcc/ChangeLog:
> > >>
> > >> * doc/install.texi: Document bootstrap-native.
> > >> ---
> > >>  config/bootstrap-native.mk | 1 +
> > >>  gcc/doc/install.texi   | 7 +++
> > >>  2 files changed, 8 insertions(+)
> > >>  create mode 100644 config/bootstrap-native.mk
> > >>
> > >> diff --git a/config/bootstrap-native.mk b/config/bootstrap-native.mk
> > >> new file mode 100644
> > >> index 000..a4a3d859408
> > >> --- /dev/null
> > >> +++ b/config/bootstrap-native.mk
> > >> @@ -0,0 +1 @@
> > >> +BOOT_CFLAGS := -march=native -mtune=native $(BOOT_CFLAGS)
> > >
> > > bootstrap-O3 uses
> > >
> > > BOOT_CFLAGS := -O3 $(filter-out -O%, $(BOOT_CFLAGS))
> > >
> > > so do you want to filer-out other -march/-mtune/-mcpu options?
> >
> > I don't think that is needed because these are usually not used (unlike
> > -O)
> >
> > >
> > > Some targets know -mcpu= instead of -march=, did you check whether
> > > any of those have =native?
> >
> > There are some like Alpha and others dont jave it at all. That is the
> > why the documentation says "if supported".
> 

FWIW, Both AArch64 and Arm support native.
There's a slight difference though that unlike x86 -march=native does not
imply -mtune=native on Arm.  On AArch64 it does but only if no other
tuning options are specified.

So perhaps the original change is better?

Thanks,
Tamar

> I see.
> 
> So yes, OK with the above change.
> 
> Richard.
> 
> > >
> > >> diff --git a/gcc/doc/install.texi b/gcc/doc/install.texi
> > >> index 4973f195daf..04a2256b97a 100644
> > >> --- a/gcc/doc/install.texi
> > >> +++ b/gcc/doc/install.texi
> > >> @@ -3052,6 +3052,13 @@ Removes any @option{-O}-started option from
> > >> @code{BOOT_CFLAGS}, and adds
> > >>  @itemx @samp{bootstrap-Og}
> > >>  Analogous to @code{bootstrap-O1}.
> > >>
> > >> +@item @samp{bootstrap-native}
> > >> +@itemx @samp{bootstrap-native}
> > >> +Optimize the compiler code for the build host, if supported by the
> > >> +architecture. Note this only affects the compiler, not the targeted
> > >> +code. If you want the later, choose options suitable to the target
> > >> you
> > >> +are looking for. For example @samp{--with-cpu} would be a good
> > >> starting point.
> > >> +
> > >>  @item @samp{bootstrap-lto}
> > >>  Enables Link-Time Optimization for host tools during bootstrapping.
> > >>  @samp{BUILD_CONFIG=bootstrap-lto} is equivalent to adding
> > >> --
> > >> 2.47.1
> > >>


Re: [PATCH] Add a bootstrap-native build config

2025-04-23 Thread Jakub Jelinek
On Wed, Apr 23, 2025 at 09:36:11AM +, Tamar Christina wrote:
> On AArch64 it does but only if no other
> tuning options are specified.

That is the case on x86 as well, -march=native -mtune=znver5 will
still tune for znver5, but -march=native will tune for native.

Jakub



Re: [PATCH] Add a bootstrap-native build config

2025-04-23 Thread Jakub Jelinek
On Wed, Apr 23, 2025 at 10:05:25AM +, Tamar Christina wrote:
> > -Original Message-
> > From: Jakub Jelinek 
> > Sent: Wednesday, April 23, 2025 10:39 AM
> > To: Tamar Christina 
> > Cc: Richard Biener ; Andi Kleen
> > ; GCC Patches 
> > Subject: Re: [PATCH] Add a bootstrap-native build config
> > 
> > On Wed, Apr 23, 2025 at 09:36:11AM +, Tamar Christina wrote:
> > > On AArch64 it does but only if no other
> > > tuning options are specified.
> > 
> > That is the case on x86 as well, -march=native -mtune=znver5 will
> > still tune for znver5, but -march=native will tune for native.
> > 
> 
> But what happens with
> 
> -mtune=znver5 -march=native

The same obviously.

Jakub



Re: [PATCH] Document AArch64 changes for GCC 15

2025-04-23 Thread Richard Sandiford
Evgeny Karpov  writes:
> Tuesday, April 23, 2025
> Richard Sandiford  wrote:
>> Thanks the summary.  Does the entry below look ok?
>>
>>  Support has been added for the AArch64 MinGW target
>>(aarch64-w64-mingw32).  At present, this target
>>supports C and C++ for base Armv8-A, but with some caveats:
>>
>>  Although most variadic functions work, the implementation
>>of them is not yet complete.
>>  
>>  C++ exception handling is not yet implemented.
>>
>>Further work is planned for GCC 16.
>>  
>
> Thanks, it looks good. Maybe it is worth mentioning that gdb is not supported 
> yet.

It's probably better to stick to GCC for this.  Binutils and gdb are
separate projects on separate schedules, and people might not always
be using the latest version of everything.

Thanks,
Richard


Re: [PATCH] modulo-sched: reject loop conditions when not decrementing with one [PR 116479]

2025-04-23 Thread Jakub Jelinek
On Wed, Apr 23, 2025 at 04:46:04PM +0100, Andre Vieira (lists) wrote:
> On 23/04/2025 16:22, Jakub Jelinek wrote:
> > On Wed, Apr 23, 2025 at 03:57:58PM +0100, Andre Vieira (lists) wrote:
> > > +++ b/gcc/testsuite/gcc.target/aarch64/pr116479.c
> > > @@ -0,0 +1,20 @@
> > > +/* PR 116479 */
> > > +/* { dg-do run } */
> > > +/* { dg-additional-options "-O -funroll-loops -finline-stringops 
> > > -fmodulo-sched --param=max-iterations-computation-cost=637924687 -static 
> > > -std=c23" } */
> > > +_BitInt (13577) b;
> > > +
> > > +void
> > > +foo (char *ret)
> > > +{
> > > +  __builtin_memset (&b, 4, 697);
> > > +  *ret = 0;
> > > +}
> > > +
> > > +int
> > > +main ()
> > > +{
> > > +  char x;
> > > +  foo (&x);
> > > +  for (unsigned i = 0; i < sizeof (x); i++)
> > > +__builtin_printf ("%02x", i[(volatile unsigned char *) &x]);
> > 
> > Shouldn't these 2 lines instead be
> >if (x != 0)
> >  __builtin_abort ();
> > ?
> > 
> 
> Fair, I copied the testcase verbatim from the PR, the error-mode was a
> segfault. But I agree a check !=0 with __builtin_abort here seems more
> appropriate.  Any opinions on whether I should move it to dg with a bitint
> target?

I think there isn't anything aarch64 specific on the test, so yes,
I'd move it to gcc/testsuite/gcc.dg/bitint-123.c,
/* { dg-do run { target bitint } } */
and wrap b/foo definitions into #if __BITINT_MAXWIDTH__ >= 13577
and the main body as well (just in case some target supports smaller maximum
width than that).
Also, drop -static from dg-additional-options?

Jakub



Re: Help: Re: Questions on replacing a structure pointer reference to a call to .ACCESS_WITH_SIZE in C FE

2025-04-23 Thread Qing Zhao
Richard,

Thanks a lot for the hint.

> On Apr 23, 2025, at 04:17, Richard Biener  wrote:
> 
>> I have met the following issue when I tried to implement the following into 
>> tree-object-size.cc:
>> (And this took me quite some time, still don’t know what’s the best solution)
>> 
>>> On Apr 16, 2025, at 10:46, Qing Zhao  wrote:
>>> 
>>> 3. When generating the reference to the field member in tree-object-size, 
>>> we should guard this reference with a checking
>>>   on the pointer to the structure is valid. i.e:
>>> 
>>> struct annotated {
>>> size_t count;
>>> char array[] __attribute__((counted_by (count)));
>>> };
>>> 
>>> static size_t __attribute__((__noinline__)) size_of (struct annotated * obj)
>>> {
>>>  return __builtin_dynamic_object_size (obj, 1);
>>> }
>>> 
>>> When we try to generate the reference to obj->count when evaluating 
>>> __builtin_dynamic_object_size (obj, 1),
>>> We should generate the following:
>>> 
>>>  If (obj != NULL)
>>>* (&obj->count)
>>> 
>>> To make sure that the pointer to the structure object is valid first.
>>> 
>> 
>> Then as I generate the following size_expr in tree-object-size.cc:
>> 
>> Breakpoint 1, gimplify_size_expressions (osi=0xdf30)
>>at ../../latest-gcc-write/gcc/tree-object-size.cc:1178
>> 1178   force_gimple_operand (size_expr, &seq, true, NULL);
>> (gdb) call debug_generic_expr(size_expr)
>> _4 = obj_2(D) != 0B ? (sizetype) (int) MAX_EXPR <(sizetype) MAX_EXPR >  [(void *)&*obj_2(D)], 0> + 4, 4> : 18446744073709551615
>> 
>> When calling “force_gimple_operand” for the above size_expr, I got the 
>> following ICE in gimplify_modify_expr, at gimplify.cc:7505:
> 
> You shouldn't really force_gimple_operand to a MODIFY_EXPR but instead
> only to its RHS.

Do you mean: instead of 

force_gimple_operand (size_expr, &seq, true, NULL);

I should

1178   if (TREE_CODE (size_expr) == MODIFY_EXPR)
1179 {
1180   tree rhs = TREE_OPERAND (size_expr, 1);
1181   force_gimple_operand (rhs, &seq, true, NULL);
1182 }

?

However, with this change, I got the exactly same error at the above line 1181. 
(gdb) call debug_generic_expr(rhs)
obj_2(D) != 0B ? (sizetype) (int) MAX_EXPR <(sizetype) MAX_EXPR  
[(void *)&*obj_2(D)], 0> + 4, 4> : 18446744073709551615

The issue is still the same as before. 
So, I am wondering whether the above size expression I generated has some issue?
Or the routine “force_gimple_operand” has some bug  when the tree expr is a 
COND_EXPR expression?

Thanks.

Qing

The size_expr is a COND_EXPR:

(gdb) call debug_tree(rhs)
 
unit-size 
align:64 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type 
0x7fffea282000 precision:64 min  max >

arg:0 
unit-size 
align:8 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type 
0x7fffea282b28 precision:1 min  max >
arg:0 
visited var 
def_stmt GIMPLE_NOP
version:2
ptr-info 0x7fffea091918>
arg:1 >
arg:1 
arg:0 
arg:0 
arg:0 
arg:0 
arg:0 
arg:0  arg:1 >>
arg:1 > arg:1 
>>>
arg:2  constant 18446744073709551615>>

> 
>> (gdb) c
>> Continuing.
>> during GIMPLE pass: objsz
>> dump file: a-t.c.110t.objsz1
>> In function ‘size_of’:
>> cc1: internal compiler error: in gimplify_modify_expr, at gimplify.cc:7505
>> 0x36feb67 internal_error(char const*, ...)
>> ../../latest-gcc-write/gcc/diagnostic-global-context.cc:517
>> 0x36ccd67 fancy_abort(char const*, int, char const*)
>> ../../latest-gcc-write/gcc/diagnostic.cc:1749
>> 0x14fa8ab gimplify_modify_expr
>> ../../latest-gcc-write/gcc/gimplify.cc:7505
>> 0x15354c3 gimplify_expr(tree_node**, gimple**, gimple**, bool 
>> (*)(tree_node*), int)
>> ../../latest-gcc-write/gcc/gimplify.cc:19530
>> 0x14fe1b3 gimplify_stmt(tree_node**, gimple**)
>> ../../latest-gcc-write/gcc/gimplify.cc:8458
>> ….
>> 0x1b07757 gimplify_size_expressions
>> ../../latest-gcc-write/gcc/tree-object-size.cc:1178
>> 
>> I debugged into this a little bit, and found that the following are the 
>> reason for the assertion failure in the routine “gimplify_modify_expr” of 
>> gimplify.cc:
>> 
>> 1. The assertion failure is:
>> 
>> 7502   if (gimplify_ctxp->into_ssa && is_gimple_reg (*to_p))
>> 7503 {
>> 7504   /* We should have got an SSA name from the start.  */
>> 7505   gcc_assert (TREE_CODE (*to_p) == SSA_NAME
>> 7506   || ! gimple_in_ssa_p (cfun));
>> 7507 }
>> 
>> 2. The above assertion failure is issued for the following temporary tree:
>> 
>> (gdb) call debug_generic_expr(*to_p)
>> iftmp.2
>> (gdb) call debug_generic_expr(*expr_p)
>> iftmp.2 = (sizetype) _10
>> 
>> In the above, the temporary variable “iftmp.2” triggered the assertion since 
>> it

RE: [PATCH]middle-end: Add new "max" vector cost model

2025-04-23 Thread Richard Biener
On Wed, 23 Apr 2025, Tamar Christina wrote:

> > -Original Message-
> > From: Richard Biener 
> > Sent: Wednesday, April 23, 2025 9:46 AM
> > To: Tamar Christina 
> > Cc: gcc-patches@gcc.gnu.org; nd ; Richard Sandiford
> > 
> > Subject: RE: [PATCH]middle-end: Add new "max" vector cost model
> > 
> > On Wed, 23 Apr 2025, Tamar Christina wrote:
> > 
> > > > -Original Message-
> > > > From: Richard Biener 
> > > > Sent: Wednesday, April 23, 2025 9:37 AM
> > > > To: Tamar Christina 
> > > > Cc: gcc-patches@gcc.gnu.org; nd ; Richard Sandiford
> > > > 
> > > > Subject: Re: [PATCH]middle-end: Add new "max" vector cost model
> > > >
> > > > On Wed, 23 Apr 2025, Richard Biener wrote:
> > > >
> > > > > On Wed, 23 Apr 2025, Tamar Christina wrote:
> > > > >
> > > > > > Hi All,
> > > > > >
> > > > > > This patch proposes a new vector cost model called "max".  The cost 
> > > > > > model
> > is
> > > > an
> > > > > > intersection between two of our existing cost models.  Like 
> > > > > > `unlimited` it
> > > > > > disables the costing vs scalar and assumes all vectorization to be 
> > > > > > profitable.
> > > > > >
> > > > > > But unlike unlimited it does not fully disable the vector cost 
> > > > > > model.  That
> > > > > > means that we still perform comparisons between vector modes.
> > > > > >
> > > > > > As an example, the following:
> > > > > >
> > > > > > void
> > > > > > foo (char *restrict a, int *restrict b, int *restrict c,
> > > > > >  int *restrict d, int stride)
> > > > > > {
> > > > > > if (stride <= 1)
> > > > > > return;
> > > > > >
> > > > > > for (int i = 0; i < 3; i++)
> > > > > > {
> > > > > > int res = c[i];
> > > > > > int t = b[i * stride];
> > > > > > if (a[i] != 0)
> > > > > > res = t * d[i];
> > > > > > c[i] = res;
> > > > > > }
> > > > > > }
> > > > > >
> > > > > > compiled with -O3 -march=armv8-a+sve -fvect-cost-model=dynamic 
> > > > > > fails to
> > > > > > vectorize as it assumes scalar would be faster, and with
> > > > > > -fvect-cost-model=unlimited it picks a vector type that's so big 
> > > > > > that the large
> > > > > > sequence generated is working on mostly inactive lanes:
> > > > > >
> > > > > > ...
> > > > > > and p3.b, p3/z, p4.b, p4.b
> > > > > > whilelo p0.s, wzr, w7
> > > > > > ld1wz23.s, p3/z, [x3, #3, mul vl]
> > > > > > ld1wz28.s, p0/z, [x5, z31.s, sxtw 2]
> > > > > > add x0, x5, x0
> > > > > > punpklo p6.h, p6.b
> > > > > > ld1wz27.s, p4/z, [x0, z31.s, sxtw 2]
> > > > > > and p6.b, p6/z, p0.b, p0.b
> > > > > > punpklo p4.h, p7.b
> > > > > > ld1wz24.s, p6/z, [x3, #2, mul vl]
> > > > > > and p4.b, p4/z, p2.b, p2.b
> > > > > > uqdecw  w6
> > > > > > ld1wz26.s, p4/z, [x3]
> > > > > > whilelo p1.s, wzr, w6
> > > > > > mul z27.s, p5/m, z27.s, z23.s
> > > > > > ld1wz29.s, p1/z, [x4, z31.s, sxtw 2]
> > > > > > punpkhi p7.h, p7.b
> > > > > > mul z24.s, p5/m, z24.s, z28.s
> > > > > > and p7.b, p7/z, p1.b, p1.b
> > > > > > mul z26.s, p5/m, z26.s, z30.s
> > > > > > ld1wz25.s, p7/z, [x3, #1, mul vl]
> > > > > > st1wz27.s, p3, [x2, #3, mul vl]
> > > > > > mul z25.s, p5/m, z25.s, z29.s
> > > > > > st1wz24.s, p6, [x2, #2, mul vl]
> > > > > > st1wz25.s, p7, [x2, #1, mul vl]
> > > > > > st1wz26.s, p4, [x2]
> > > > > > ...
> > > > > >
> > > > > > With -fvect-cost-model=max you get more reasonable code:
> > > > > >
> > > > > > foo:
> > > > > > cmp w4, 1
> > > > > > ble .L1
> > > > > > ptrue   p7.s, vl3
> > > > > > index   z0.s, #0, w4
> > > > > > ld1bz29.s, p7/z, [x0]
> > > > > > ld1wz30.s, p7/z, [x1, z0.s, sxtw 2]
> > > > > > ptrue   p6.b, all
> > > > > > cmpne   p7.b, p7/z, z29.b, #0
> > > > > > ld1wz31.s, p7/z, [x3]
> > > > > > mul z31.s, p6/m, z31.s, z30.s
> > > > > > st1wz31.s, p7, [x2]
> > > > > > .L1:
> > > > > > ret
> > > > > >
> > > > > > This model has been useful internally for performance exploration 
> > > > > > and cost-
> > > > model
> > > > > > validation.  It allows us to force realistic vectorization 
> > > > > > overriding the cost
> > > > > > model to be able to tell whether it's correct wrt to profitability.
> > > > > >
> > > > > > Bootstrapped Regtested on aarch64-none-linux-gnu,
> > > > > > arm-none-linux-gnueabihf, x86_64-pc-linux-gnu
> > > > > > -m32, -m64 and no issues.
> > > > > >
> > > > > > Ok for master?
> > > > >
> > > > > Hmm.  I don't like another cost model.  Instead how about changing
> > > > > 'unlimited' to still iterate through vector sizes?  Cost modeling
> > > > > is really about vector vs. scalar, not vector vs. vector w

RE: [PATCH]middle-end: Add new "max" vector cost model

2025-04-23 Thread Richard Biener
On Wed, 23 Apr 2025, Tamar Christina wrote:

> > -Original Message-
> > From: Richard Biener 
> > Sent: Wednesday, April 23, 2025 10:14 AM
> > To: Tamar Christina 
> > Cc: Richard Sandiford ; gcc-patches@gcc.gnu.org;
> > nd 
> > Subject: RE: [PATCH]middle-end: Add new "max" vector cost model
> > 
> > On Wed, 23 Apr 2025, Tamar Christina wrote:
> > 
> > > > -Original Message-
> > > > From: Richard Sandiford 
> > > > Sent: Wednesday, April 23, 2025 9:45 AM
> > > > To: Tamar Christina 
> > > > Cc: Richard Biener ; gcc-patches@gcc.gnu.org; nd
> > > > 
> > > > Subject: Re: [PATCH]middle-end: Add new "max" vector cost model
> > > >
> > > > Tamar Christina  writes:
> > > > >> -Original Message-
> > > > >> From: Richard Biener 
> > > > >> Sent: Wednesday, April 23, 2025 9:31 AM
> > > > >> To: Tamar Christina 
> > > > >> Cc: gcc-patches@gcc.gnu.org; nd ; Richard Sandiford
> > > > >> 
> > > > >> Subject: Re: [PATCH]middle-end: Add new "max" vector cost model
> > > > >>
> > > > >> On Wed, 23 Apr 2025, Tamar Christina wrote:
> > > > >>
> > > > >> > Hi All,
> > > > >> >
> > > > >> > This patch proposes a new vector cost model called "max".  The 
> > > > >> > cost model
> > is
> > > > an
> > > > >> > intersection between two of our existing cost models.  Like 
> > > > >> > `unlimited` it
> > > > >> > disables the costing vs scalar and assumes all vectorization to be 
> > > > >> > profitable.
> > > > >> >
> > > > >> > But unlike unlimited it does not fully disable the vector cost 
> > > > >> > model.  That
> > > > >> > means that we still perform comparisons between vector modes.
> > > > >> >
> > > > >> > As an example, the following:
> > > > >> >
> > > > >> > void
> > > > >> > foo (char *restrict a, int *restrict b, int *restrict c,
> > > > >> >  int *restrict d, int stride)
> > > > >> > {
> > > > >> > if (stride <= 1)
> > > > >> > return;
> > > > >> >
> > > > >> > for (int i = 0; i < 3; i++)
> > > > >> > {
> > > > >> > int res = c[i];
> > > > >> > int t = b[i * stride];
> > > > >> > if (a[i] != 0)
> > > > >> > res = t * d[i];
> > > > >> > c[i] = res;
> > > > >> > }
> > > > >> > }
> > > > >> >
> > > > >> > compiled with -O3 -march=armv8-a+sve -fvect-cost-model=dynamic 
> > > > >> > fails
> > to
> > > > >> > vectorize as it assumes scalar would be faster, and with
> > > > >> > -fvect-cost-model=unlimited it picks a vector type that's so big 
> > > > >> > that the
> > large
> > > > >> > sequence generated is working on mostly inactive lanes:
> > > > >> >
> > > > >> > ...
> > > > >> > and p3.b, p3/z, p4.b, p4.b
> > > > >> > whilelo p0.s, wzr, w7
> > > > >> > ld1wz23.s, p3/z, [x3, #3, mul vl]
> > > > >> > ld1wz28.s, p0/z, [x5, z31.s, sxtw 2]
> > > > >> > add x0, x5, x0
> > > > >> > punpklo p6.h, p6.b
> > > > >> > ld1wz27.s, p4/z, [x0, z31.s, sxtw 2]
> > > > >> > and p6.b, p6/z, p0.b, p0.b
> > > > >> > punpklo p4.h, p7.b
> > > > >> > ld1wz24.s, p6/z, [x3, #2, mul vl]
> > > > >> > and p4.b, p4/z, p2.b, p2.b
> > > > >> > uqdecw  w6
> > > > >> > ld1wz26.s, p4/z, [x3]
> > > > >> > whilelo p1.s, wzr, w6
> > > > >> > mul z27.s, p5/m, z27.s, z23.s
> > > > >> > ld1wz29.s, p1/z, [x4, z31.s, sxtw 2]
> > > > >> > punpkhi p7.h, p7.b
> > > > >> > mul z24.s, p5/m, z24.s, z28.s
> > > > >> > and p7.b, p7/z, p1.b, p1.b
> > > > >> > mul z26.s, p5/m, z26.s, z30.s
> > > > >> > ld1wz25.s, p7/z, [x3, #1, mul vl]
> > > > >> > st1wz27.s, p3, [x2, #3, mul vl]
> > > > >> > mul z25.s, p5/m, z25.s, z29.s
> > > > >> > st1wz24.s, p6, [x2, #2, mul vl]
> > > > >> > st1wz25.s, p7, [x2, #1, mul vl]
> > > > >> > st1wz26.s, p4, [x2]
> > > > >> > ...
> > > > >> >
> > > > >> > With -fvect-cost-model=max you get more reasonable code:
> > > > >> >
> > > > >> > foo:
> > > > >> > cmp w4, 1
> > > > >> > ble .L1
> > > > >> > ptrue   p7.s, vl3
> > > > >> > index   z0.s, #0, w4
> > > > >> > ld1bz29.s, p7/z, [x0]
> > > > >> > ld1wz30.s, p7/z, [x1, z0.s, sxtw 2]
> > > > >> >ptrue   p6.b, all
> > > > >> > cmpne   p7.b, p7/z, z29.b, #0
> > > > >> > ld1wz31.s, p7/z, [x3]
> > > > >> >mul z31.s, p6/m, z31.s, z30.s
> > > > >> > st1wz31.s, p7, [x2]
> > > > >> > .L1:
> > > > >> > ret
> > > > >> >
> > > > >> > This model has been useful internally for performance exploration 
> > > > >> > and
> > cost-
> > > > >> model
> > > > >> > validation.  It allows us to force realistic vectorization 
> > > > >> > overriding the cost
> > > > >> > model to be able to tell whether it's correct wrt to profitability.
> > > > >> >
> > > > >> > Bootstrapped Regtes

[PATCH v2] Document AArch64 changes for GCC 15

2025-04-23 Thread Richard Sandiford
Thanks for all the feedback.  I've tried to address it in the version
below.  I'll push later today if there are no further comments.

Richard


The list is structured as:

- new configurations
- command-line changes
- ACLE changes
- everything else

As usual, the list of new architectures, CPUs, and features is from a
purely mechanical trawl of the associated .def files.  I've identified
features by their architectural name to try to improve searchability.
Similarly, the list of ACLE changes includes the associated ACLE
feature macros, again to try to improve searchability.

The list summarises some of the target-specific optimisations because
it sounded like Tamar had received feedback that people found such
information interesting.

I've used the passive tense for most entries, to try to follow the
style used elsewhere.

We don't yet define __ARM_FEATURE_FAMINMAX, but I'll fix that
separately.
---
 htdocs/gcc-15/changes.html | 255 -
 1 file changed, 254 insertions(+), 1 deletion(-)

diff --git a/htdocs/gcc-15/changes.html b/htdocs/gcc-15/changes.html
index f03e29c8..958cacc1 100644
--- a/htdocs/gcc-15/changes.html
+++ b/htdocs/gcc-15/changes.html
@@ -681,7 +681,260 @@ asm (".text; %cc0: mov %cc2, %%r0; .previous;"
 
 New Targets and Target Specific Improvements
 
-
+AArch64
+
+
+  Support has been added for the AArch64 MinGW target
+(aarch64-w64-mingw32).  At present, this target
+supports C and C++ for base Armv8-A, but with some caveats:
+
+  Although most variadic functions work, the implementation
+of them is not yet complete.
+  
+  C++ exception handling is not yet implemented.
+
+Further work is planned for GCC 16.
+  
+  As noted above, support for ILP32 (-mabi=ilp32)
+has been deprecated and will be removed in a future release.
+aarch64*-elf targets no longer build the ILP32 multilibs.
+  
+  The following architecture level is now supported by
+-march and related source-level constructs
+(GCC identifiers in parentheses):
+
+  Armv9.5-A (arm9.5-a)
+
+  
+  The following CPUs are now supported by -mcpu,
+-mtune, and related source-level constructs
+(GCC identifiers in parentheses):
+
+  Apple A12 (apple-a12)
+  Apple M1 (apple-m1)
+  Apple M2 (apple-m2)
+  Apple M3 (apple-m3)
+  Arm Cortex-A520AE (cortex-a520ae)
+  Arm Cortex-A720AE (cortex-a720ae)
+  Arm Cortex-A725 (cortex-a725)
+  Arm Cortex-R82AE (cortex-r82ae)
+  Arm Cortex-X925 (cortex-x925)
+  Arm Neoverse N3 (neoverse-n3)
+  Arm Neoverse V3 (neoverse-v3)
+  Arm Neoverse V3AE (neoverse-v3ae)
+  FUJITSU-MONAKA (fujitsu-monaka)
+  NVIDIA Grace (grace)
+  NVIDIA Olympus (olympus)
+  Qualcomm Oryon-1 (oryon-1)
+
+  
+  The following features are now supported by -march,
+-mcpu, and related source-level constructs
+(GCC modifiers in parentheses):
+
+  FEAT_CPA (+cpa), enabled by default for
+Arm9.5-A and above
+  
+  FEAT_FAMINMAX (+faminmax), enabled by default for
+Arm9.5-A and above
+  
+  FEAT_FCMA (+fcma), enabled by default for Armv8.3-A
+and above
+  
+  FEAT_FLAGM2 (+flagm2), enabled by default for
+Armv8.5-A and above
+  
+  FEAT_FP8 (+fp8)
+  FEAT_FP8DOT2 (+fp8dot2)
+  FEAT_FP8DOT4 (+fp8dot4)
+  FEAT_FP8FMA (+fp8fma)
+  FEAT_FRINTTS (+frintts), enabled by default for
+Armv8.5-A and above
+  
+  FEAT_JSCVT (+jscvt), enabled by default for
+Armv8.3-A and above
+  
+  FEAT_LUT (+lut), enabled by default for
+Arm9.5-A and above
+  
+  FEAT_LRCPC2 (+rcpc2), enabled by default for
+Armv8.4-A and above
+  
+  FEAT_SME_B16B16 (+sme-b16b16)
+  FEAT_SME_F16F16 (+sme-f16f16)
+  FEAT_SME2p1 (+sme2p1)
+  FEAT_SSVE_FP8DOT2 (+ssve-fp8dot2)
+  FEAT_SSVE_FP8DOT4 (+ssve-fp8dot4)
+  FEAT_SSVE_FP8FMA (+ssve-fp8fma)
+  FEAT_SVE_B16B16 (+sve-b16b16)
+  FEAT_SVE2p1 (+sve2p1), enabled by default for
+Armv9.4-A and above
+  
+  FEAT_WFXT (+wfxt), enabled by default for
+Armv8.7-A and above
+  
+  FEAT_XS (+xs), enabled by default for
+Armv8.7-A and above
+  
+
+The features listed as being enabled by default for Armv8.7-A or earlier
+were previously only selectable using the associated architecture level.
+For example, FEAT_FCMA was previously selected by
+-march=armv8.3-a and above (as it still is), but it wasn't
+previously selectable independently.
+  
+  The -mbranch-protection feature has been extended to
+support the Guarded Control Stack (GCS) extension.  This support
+is included in -mbranch-protection=standard and can
+be enabled individually using -mbranch-protection=gcs.
+  
+  The following additional changes have been made to the
+command-line options:
+
+  In order to align with other tool

[PATCH v2] loop2_unroll: split loop exit during unrolling of uncountable loops

2025-04-23 Thread Artemiy Volkov
Hi all,

sending a v2 of
https://gcc.gnu.org/pipermail/gcc-patches/2025-April/680893.html after
fixing several issues with the original patch.  Namely, the changes
since v1 are:

- Remove the call to df_finish_pass () at the end of split_exit () and
  simply restore the previous value of the DF flags instead to avoid UAF
  errors.
- Remove the call to df_analyze () at the end of split_exit () to save
  compilation time.
- Under -O1, always call df_remove_problem (df_live) whenever there was
  a corresponding call to df_live_add_problem ().
- Restore alphabetical order in common.opt{,.urls}.

Could anyone please review/commit on my behalf if OK?

Thanks,
Artemiy

-- >8 --

Consider the (lightly modified) core_list_reverse () function from Coremark:

struct list_node *
core_list_reverse (struct list_node *list)
{
  struct list_node *next = 0, *tmp;
  #pragma GCC unroll 4
  while (list)
{
  tmp = list->next;
  list->next = next;
  next = list;
  list = tmp;
}

  return next;
}

On AArch64, this compiles to the following:

core_list_reverse:
cbz x0, .L2
ldr x1, [x0]
mov x6, 0
str x6, [x0]
mov x3, x0
cbz x1, .L2
.L4:
ldr x2, [x1]
str x3, [x1]
mov x0, x1
cbz x2, .L2
ldr x4, [x2]
str x1, [x2]
mov x0, x2
cbz x4, .L2
...
mov x0, x5
mov x3, x0
ldr x1, [x0]
str x6, [x0]
cbnzx1, .L4
.L2:
ret

The 'next' variable lives in the x0 register, which is maintained by the
"mov x0, xR" instruction at every unrolled iteration.  However, this can
be improved by removing the instruction and splitting the loop exit into
multiple exits, each corresponding to the hard register in which the
'next' variable is at a particular iteration, so that the code looks
like:

core_list_reverse:
cbz x0, .L2
mov x1, 0
.L3:
ldr x2, [x0]
str x1, [x0]
cbz x2, .L2
ldr x3, [x2]
str x0, [x2]
cbz x3, .L13
...
ldr x0, [x1]
str x3, [x1]
cbnzx0, .L3
mov x0, x1
.L2:
ret
.L13:
mov x0, x2
ret
.L14:
mov x0, x3
ret

This patch implements this transformation by splitting variables defined
in the loop and live at the (single) exit BB of an uncountable loop,
replacing each of those with a unique temporary pseudo inside the loop
and assigning these pseudos back to the original variables in the newly
split exit BBs (one per unrolled iteration).  (This is the behavior that
the GIMPLE unroller would exhibit were it capable of handling
uncountable loops.)  Afterwards, the cprop pass is able to propagate the
(split) loop variables and carry the move instructions out of the loop.

This change is primarily intended for small in-order cores on which the
latency of the move isn't hidden by the latency of the load.  The
optimization is guarded by the new -fsplit-exit-in-unroller flag that is
on by default.  The flag has been documented in doc/invoke.texi, and the
common.opt.urls file has been regenerated.

The change has been bootstrapped and regtested on i386, x86_64, and
aarch64, and additionally regtested on riscv32.  Two new testcases have
been added to demonstrate operation and interaction with
-fvariable-expansion-in-unroller.  On a Cortex-A53, this patch leads to
an ~0.5% improvement for Coremark and SPECINT2006 geomean (the only
regression being 483.xalancbmk), when compiled with -O2
-funroll-all-loops.  The compile-time increase is ~0.4%.

PR rtl-optimization/119681

gcc/ChangeLog:

* common.opt (-fsplit-exit-in-unroller): New flag.
* common.opt.urls: Regenerate.
* doc/invoke.texi (-fsplit-exit-in-unroller): Document it.
* loop-unroll.cc (split_exit): New function.
(regno_defined_inside_loop_p): New predicate.
(unroll_loop_stupid): Call split_exit ().
(has_use_with_multiple_defs_p): New predicate.

gcc/testsuite/ChangeLog:

* gcc.dg/loop-exit-split-1.c: New test.
* gcc.dg/loop-exit-split-2.c: New test.

Signed-off-by: Artemiy Volkov 
---
 gcc/common.opt   |   4 +
 gcc/common.opt.urls  |   3 +
 gcc/doc/invoke.texi  |  12 +-
 gcc/loop-unroll.cc   | 196 +++
 gcc/testsuite/gcc.dg/loop-exit-split-1.c |  31 
 gcc/testsuite/gcc.dg/loop-exit-split-2.c |  33 
 6 files changed, 277 insertions(+), 2 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/loop-exit-split-1.c
 create mode 100644 gcc/testsuite/gcc.dg/loop-exit-split-2.c

diff --git a/gcc/common.opt b/gcc/common.opt
index 88d987e6ab1..b9488a916a1 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -2944,6 +2944,10 @@ fsingle-precision-constant
 Common Var(flag_single_p

Re: [PATCH 07/61] Testsuite: Fix tests properly for compact-branches

2025-04-23 Thread Aleksandar Rakic
HTEC Public

Hi,

> This likely needs to be updated for the trunk.

> Before:


>  === gcc Summary ===

> # of expected passes95
> # of unexpected failures25


> After:
>  === gcc Summary ===

> # of expected passes70
> # of unexpected failures50

> Clearly not going in the right direction.  Configured as
> mips64el-linux-gnuabi64.  Running just the near-far-?.c tests.

> Jeff

I would like to inform you that the version 2 of this patch with the
appropriate ChangeLog entry is available at the following link:

https://gcc.gnu.org/pipermail/gcc-patches/2025-March/677827.html

Please find attached scripts that I used for building the GCC
cross-compiler and for running the GCC testsuite for the
mips64-r6-linux-gnu target.
The script run_mips_gcc_testsuite is meant to be run inside the
$BUILD_DIR/gcc-build directory with the following arguments:

--sys-root=$SYSROOT --test-driver=mips.exp --test-regex="near-far-?.c"

I ran the near-far-?.c tests and all of them passed:

Before:
=== gcc Summary ===

# of expected passes168

After:
=== gcc Summary ===

# of expected passes168

Kind regards,
Aleksandar

# How to Build a *GNU* Toolchain from Source

## Introduction

A *GNU Toolchain* consists of:  
- *GCC*, which can be obtained with the following command:
  ```
  $ git clone git://gcc.gnu.org/git/gcc.git
  ```
- *binutils*, which can be obtained with the following command:
  ```
  $ git clone git://sourceware.org/git/binutils-gdb.git
  ```
- *glibc*, which can be obtained with the following command:
  ```
  $ git clone https://sourceware.org/git/glibc.git
  ```
- *Linux Kernel Headers*, which can be downloaded from [here](https://mirrors.edge.kernel.org/pub/linux/kernel). Version 5.10.116 was used for this tutorial.  
  
The source directory will consist of one main directory (from this point on, referred to as the `$SOURCE_DIR`) and multiple subdirectories - one for each of the previously mentioned parts of the *GNU Toolchain*.  

Besides the source directory, there will be a build directory, which will consist of one main directory (from this point on, referred to as the `$BUILD_DIR`) and multiple subdirectories - one for a build of each of the previously mentioned parts of the *GNU Toolchain* (excluding the *Linux Kernel Headers*).  

Finally, there will be an install directory (from this point on, referred to as the `$PREFIX` directory). Within it, there will be a *system root* subdirectory (from this point on, referred to as the `$SYSROOT` directory). The `$SYSROOT` directory is the root directory in which the target system headers, libraries and run-time objects will be searched for.  

Make sure to add the `bin/` subdirectory of the `$PREFIX` directory to your `PATH` environment variable before proceeding:  
  
```
$ export PATH=$PREFIX/bin:$PATH
```
  
For this tutorial, we'll use the `mips64-r6-linux-gnu` target:  

```
$ export TARGET=mips64-r6-linux-gnu
```
  

## Step 1: Build and Install *binutils*

First, position yourself inside the `$BUILD_DIR`, create `binutils-gdb-build/` subdirectory, and position yourself inside it:  
  
```
$ cd $BUILD_DIR
$ mkdir binutils-gdb-build
$ cd binutils-gdb-build
```
  
Next, configure the build:  
  
```
$ $SOURCE_DIR/binutils-gdb/configure \
--prefix=$PREFIX \
--target=$TARGET \
--with-sysroot=$SYSROOT \
--disable-nls \
--disable-werror \
--with-arch=mips64r6 \
--with-abi=64 \
--disable-multilib
```
  
`--with-sysroot=$SYSROOT` tells *binutils* to consider the `$SYSROOT` directory as the *system root*.  

`--disable-nls` tells *binutils* not to include native language support. This is basically optional, but reduces dependencies and compile time.  

`--disable-werror` tells *binutils* to disable promoting warnings into errors (such as overflow warnings, for example).  

`--with-arch=mips64r6` and `--with-abi=64` options manually specify the target architecture and the target *ABI*, respectively. This is needed to override the default 32-bit *ABI MIPS* has (even if you specify a 64-bit target). For other targets, this should be adjusted or removed.  

Optionally, you can enable the *multilib* support with the `--enable-multilib` option. *multilib* is a mechanism to support building and running code for different *ABI*s for the same *CPU* family on a given system. Most commonly it is used to support 32-bit code on 64-bit systems and 64-bit code on 32-bit systems with a 64-bit kernel. This requires even more setup and will not be covered in this tutorial.  

Finally, build and install *binutils*:  
  
```
$ make
$ make install
```
  

## Step 2: Install *Linux Kernel Headers*

In order to install *Linux Kernel Headers*, position yourself inside the `$SOURCE_DIR/linux-5.10.116/` directory and run the following commands:  
  
```
$ make mrproper
$ make ARCH=mips INSTALL_HDR_PATH=$SYSROOT/usr headers_install
$ mak

Re: [PATCH 1/3] match: Move `(cmp (cond @0 @1 @2) @3)` simplification after the bool compare simplifcation

2025-04-23 Thread Richard Biener
On Wed, Apr 23, 2025 at 5:58 AM Andrew Pinski  wrote:
>
> This moves the `(cmp (cond @0 @1 @2) @3)` simplifcation to be after the 
> boolean comparison
> simplifcations so that we don't end up simplifing into the same thing for a 
> GIMPLE_COND.

OK.

Richard.

> gcc/ChangeLog:
>
> * match.pd: Move `(cmp (cond @0 @1 @2) @3)` simplifcation after
> the bool comparison simplifications.
>
> Signed-off-by: Andrew Pinski 
> ---
>  gcc/match.pd | 31 +--
>  1 file changed, 17 insertions(+), 14 deletions(-)
>
> diff --git a/gcc/match.pd b/gcc/match.pd
> index ba036e52837..0fe90a6edc4 100644
> --- a/gcc/match.pd
> +++ b/gcc/match.pd
> @@ -7759,20 +7759,6 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>(cmp (bit_and@2 @0 integer_pow2p@1) @1)
>(icmp @2 { build_zero_cst (TREE_TYPE (@0)); })))
>
> -#if GIMPLE
> -/* From fold_binary_op_with_conditional_arg handle the case of
> -   rewriting (a ? b : c) > d to a ? (b > d) : (c > d) when the
> -   compares simplify.  */
> -(for cmp (simple_comparison)
> - (simplify
> -  (cmp:c (cond @0 @1 @2) @3)
> -  /* Do not move possibly trapping operations into the conditional as this
> - pessimizes code and causes gimplification issues when applied late.  */
> -  (if (!FLOAT_TYPE_P (TREE_TYPE (@3))
> -   || !operation_could_trap_p (cmp, true, false, @3))
> -   (cond @0 (cmp! @1 @3) (cmp! @2 @3)
> -#endif
> -
>  (for cmp (ge lt)
>  /* x < 0 ? ~y : y into (x >> (prec-1)) ^ y. */
>  /* x >= 0 ? ~y : y into ~((x >> (prec-1)) ^ y). */
> @@ -8119,6 +8105,23 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
> replace if (x == 0) with tem = ~x; if (tem != 0) which is
> clearly less optimal and which we'll transform again in forwprop.  */
>
> +#if GIMPLE
> +/* From fold_binary_op_with_conditional_arg handle the case of
> +   rewriting (a ? b : c) > d to a ? (b > d) : (c > d) when the
> +   compares simplify.
> +   This should be after the boolean comparison simplification so
> +   that it can remove the outer comparison before appling it to
> +   the inner condtional operands.  */
> +(for cmp (simple_comparison)
> + (simplify
> +  (cmp:c (cond @0 @1 @2) @3)
> +  /* Do not move possibly trapping operations into the conditional as this
> + pessimizes code and causes gimplification issues when applied late.  */
> +  (if (!FLOAT_TYPE_P (TREE_TYPE (@3))
> +   || !operation_could_trap_p (cmp, true, false, @3))
> +   (cond @0 (cmp! @1 @3) (cmp! @2 @3)
> +#endif
> +
>  /* Transform comparisons of the form (X & Y) CMP 0 to X CMP2 Z
> where ~Y + 1 == pow2 and Z = ~Y.  */
>  (for cst (VECTOR_CST INTEGER_CST)
> --
> 2.43.0
>


Re: [PATCH 30/61] MSA: Make MSA and microMIPS R5 unsupported

2025-04-23 Thread Aleksandar Rakic
HTEC Public

Hi,

> > There are no platforms nor simulators for MSA and microMIPS R5 so
> > turning off this support for now.
> >
> > gcc/ChangeLog:
> >
> >   * config/mips/mips.cc (mips_option_override): Error out for
> >   -mmicromips -mmsa.
> OK and pushed to the trunk.
> Jeff

We have sent a patch series to binutils in the meantime:
https://sourceware.org/pipermail/binutils/2025-April/140356.html
It includes adding microMIPSR6 support.

We have also sent a patch that adds microMIPS R6 support in GCC:
https://gcc.gnu.org/pipermail/gcc-patches/2025-March/677813.html

I also realized that there is a mips32r6-generic CPU supporting
microMIPS32 Release 6 ISA and that we enabled MSA ASE for it:
https://gitlab.com/qemu-project/qemu/-/commit/5d3d52229b19509eaace662096a52dc91f712fc1

Also, this patch was dependent upon a patch that adds a support for
microMIPS R6 in GCC, where the local variable 'is_micromips' was defined.

Now I must update this patch appropriately. Please find a new version of
this patch in the attachment.

Kind regards,
Aleksandar
From 16b3207aed5e4846fde4f3ffa1253c65ef6ba056 Mon Sep 17 00:00:00 2001
From: Aleksandar Rakic 
Date: Wed, 23 Apr 2025 14:14:17 +0200
Subject: [PATCH] Make MSA and microMIPS R5 unsupported

There are no platforms nor simulators for MSA and microMIPS R5 so
turning off this support for now.

gcc/ChangeLog:

	* config/mips/mips.cc (mips_option_override): Error out for
	-mmicromips -mips32r5 -mmsa.

Cherry-picked 1009d6ff7a8d3b56e0224a6b193c5a7b3c29aa5f
from https://github.com/MIPS/gcc

Signed-off-by: Matthew Fortune 
Signed-off-by: Faraz Shahbazker 
Signed-off-by: Aleksandar Rakic 
---
 gcc/config/mips/mips.cc | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/gcc/config/mips/mips.cc b/gcc/config/mips/mips.cc
index 0d3d0263f2d..23205dfb616 100644
--- a/gcc/config/mips/mips.cc
+++ b/gcc/config/mips/mips.cc
@@ -20414,6 +20414,7 @@ static void
 mips_option_override (void)
 {
   int i, regno, mode;
+  unsigned int is_micromips;
 
   if (OPTION_SET_P (mips_isa_option))
 mips_isa_option_info = &mips_cpu_info_table[mips_isa_option];
@@ -20434,6 +20435,7 @@ mips_option_override (void)
   /* Save the base compression state and process flags as though we
  were generating uncompressed code.  */
   mips_base_compression_flags = TARGET_COMPRESSION;
+  is_micromips = TARGET_MICROMIPS;
   target_flags &= ~TARGET_COMPRESSION;
   mips_base_code_readable = mips_code_readable;
 
@@ -20678,7 +20680,7 @@ mips_option_override (void)
 	  "-mcompact-branches=never");
 }
 
-  if (is_micromips && TARGET_MSA)
+  if (is_micromips && mips_isa_rev <= 5 && TARGET_MSA)
 error ("unsupported combination: %s", "-mmicromips -mmsa");
 
   /* Require explicit relocs for MIPS R6 onwards.  This enables simplification
-- 
2.34.1



Re: [PATCH] Consider frequency in cost estimation when converting scalar to vector.

2025-04-23 Thread Jan Hubicka
> In some benchmark, I notice stv failed due to cost unprofitable, but the igain
> is inside the loop, but sse<->integer conversion is outside the loop, current 
> cost
> model doesn't consider the frequency of those gain/cost.
> The patch weights those cost with frequency just like LRA does.
> 
> Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> Ok for GCC16?
> 
> gcc/ChangeLog:
> 
>   * config/i386/i386-features.cc (scalar_chain::mark_dual_mode_def):
>   (general_scalar_chain::compute_convert_gain):
> ---
>  gcc/config/i386/i386-features.cc | 9 +++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/gcc/config/i386/i386-features.cc 
> b/gcc/config/i386/i386-features.cc
> index c35ac24fd8a..ae0844a70c2 100644
> --- a/gcc/config/i386/i386-features.cc
> +++ b/gcc/config/i386/i386-features.cc
> @@ -337,18 +337,20 @@ scalar_chain::mark_dual_mode_def (df_ref def)
>/* Record the def/insn pair so we can later efficiently iterate over
>   the defs to convert on insns not in the chain.  */
>bool reg_new = bitmap_set_bit (defs_conv, DF_REF_REGNO (def));
> +  unsigned frequency
> += REG_FREQ_FROM_BB (BLOCK_FOR_INSN (DF_REF_INSN (def)));

I am generally trying to get rid of remaing uses of REG_FREQ since the
1 based fixed point arithmetics iot always working that well.

You can do the sums in profile_count type (doing something reasonable
when count is uninitialized) and then convert it to sreal for the final
heuristics.

Typically such code also wants skip scaling by count when optimizing for
size (since in this case we want to count statically).  Not sure how
important it is for vector code but I suppose it can happen.

Honza
>if (!bitmap_bit_p (insns, DF_REF_INSN_UID (def)))
>  {
>if (!bitmap_set_bit (insns_conv, DF_REF_INSN_UID (def))
> && !reg_new)
>   return;
> -  n_integer_to_sse++;
> +  n_integer_to_sse += frequency;
>  }
>else
>  {
>if (!reg_new)
>   return;
> -  n_sse_to_integer++;
> +  n_sse_to_integer += frequency;
>  }
>  
>if (dump_file)
> @@ -556,6 +558,8 @@ general_scalar_chain::compute_convert_gain ()
>rtx src = SET_SRC (def_set);
>rtx dst = SET_DEST (def_set);
>int igain = 0;
> +  unsigned frequency
> + = REG_FREQ_FROM_BB (BLOCK_FOR_INSN (insn));
>  
>if (REG_P (src) && REG_P (dst))
>   igain += 2 * m - ix86_cost->xmm_move;
> @@ -755,6 +759,7 @@ general_scalar_chain::compute_convert_gain ()
>   }
>   }
>  
> +  igain *= frequency;
>if (igain != 0 && dump_file)
>   {
> fprintf (dump_file, "  Instruction gain %d for ", igain);
> -- 
> 2.34.1
> 


Re: [PATCH] [x86] Generate 2 FMA instructions in ix86_expand_swdivsf.

2025-04-23 Thread Jan Hubicka
> From: "hongtao.liu" 
> 
> When FMA is available, N-R step can be rewritten with
> 
> a / b = (a - (rcp(b) * a * b)) * rcp(b) + rcp(b) * a
> 
> which have 2 fma generated.[1]
> 
> [1] https://bugs.llvm.org/show_bug.cgi?id=21385
> 
> Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> Ok for trunk?

How this behaves on CPUs where FMA has longer latency then addition when
swdifsf is on the critical path through the loop?

Honza
> 
> 
> gcc/ChangeLog:
> 
>   * config/i386/i386-expand.cc (ix86_emit_swdivsf): Generate 2
>   FMA instructions when TARGET_FMA.
> 
> gcc/testsuite/ChangeLog:
> 
>   * gcc.target/i386/recip-vec-divf-fma.c: New test.
> ---
>  gcc/config/i386/i386-expand.cc| 44 ++-
>  .../gcc.target/i386/recip-vec-divf-fma.c  | 12 +
>  2 files changed, 44 insertions(+), 12 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/recip-vec-divf-fma.c
> 
> diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
> index cdfd94d3c73..4fffbfdd574 100644
> --- a/gcc/config/i386/i386-expand.cc
> +++ b/gcc/config/i386/i386-expand.cc
> @@ -19256,8 +19256,6 @@ ix86_emit_swdivsf (rtx res, rtx a, rtx b, 
> machine_mode mode)
>e1 = gen_reg_rtx (mode);
>x1 = gen_reg_rtx (mode);
>  
> -  /* a / b = a * ((rcp(b) + rcp(b)) - (b * rcp(b) * rcp (b))) */
> -
>b = force_reg (mode, b);
>  
>/* x0 = rcp(b) estimate */
> @@ -19270,20 +19268,42 @@ ix86_emit_swdivsf (rtx res, rtx a, rtx b, 
> machine_mode mode)
>  emit_insn (gen_rtx_SET (x0, gen_rtx_UNSPEC (mode, gen_rtvec (1, b),
>   UNSPEC_RCP)));
>  
> -  /* e0 = x0 * b */
> -  emit_insn (gen_rtx_SET (e0, gen_rtx_MULT (mode, x0, b)));
> +  unsigned vector_size = GET_MODE_SIZE (mode);
>  
> -  /* e0 = x0 * e0 */
> -  emit_insn (gen_rtx_SET (e0, gen_rtx_MULT (mode, x0, e0)));
> +  /* (a - (rcp(b) * a * b)) * rcp(b) + rcp(b) * a
> + N-R step with 2 fma implementation.  */
> +  if (TARGET_FMA
> +  || (TARGET_AVX512F && vector_size == 64)
> +  || (TARGET_AVX512VL && (vector_size == 32 || vector_size == 16)))
> +{
> +  /* e0 = x0 * a  */
> +  emit_insn (gen_rtx_SET (e0, gen_rtx_MULT (mode, x0, a)));
> +  /* e1 = e0 * b - a  */
> +  emit_insn (gen_rtx_SET (e1, gen_rtx_FMA (mode, e0, b,
> +gen_rtx_NEG (mode, a;
> +  /* res = - e1 * x0 + e0  */
> +  emit_insn (gen_rtx_SET (res, gen_rtx_FMA (mode,
> +gen_rtx_NEG (mode, e1),
> +x0, e0)));
> +}
> +/* a / b = a * ((rcp(b) + rcp(b)) - (b * rcp(b) * rcp (b))) */
> +  else
> +{
> +  /* e0 = x0 * b */
> +  emit_insn (gen_rtx_SET (e0, gen_rtx_MULT (mode, x0, b)));
>  
> -  /* e1 = x0 + x0 */
> -  emit_insn (gen_rtx_SET (e1, gen_rtx_PLUS (mode, x0, x0)));
> +  /* e1 = x0 + x0 */
> +  emit_insn (gen_rtx_SET (e1, gen_rtx_PLUS (mode, x0, x0)));
>  
> -  /* x1 = e1 - e0 */
> -  emit_insn (gen_rtx_SET (x1, gen_rtx_MINUS (mode, e1, e0)));
> +  /* e0 = x0 * e0 */
> +  emit_insn (gen_rtx_SET (e0, gen_rtx_MULT (mode, x0, e0)));
>  
> -  /* res = a * x1 */
> -  emit_insn (gen_rtx_SET (res, gen_rtx_MULT (mode, a, x1)));
> +  /* x1 = e1 - e0 */
> +  emit_insn (gen_rtx_SET (x1, gen_rtx_MINUS (mode, e1, e0)));
> +
> +  /* res = a * x1 */
> +  emit_insn (gen_rtx_SET (res, gen_rtx_MULT (mode, a, x1)));
> +}
>  }
>  
>  /* Output code to perform a Newton-Rhapson approximation of a
> diff --git a/gcc/testsuite/gcc.target/i386/recip-vec-divf-fma.c 
> b/gcc/testsuite/gcc.target/i386/recip-vec-divf-fma.c
> new file mode 100644
> index 000..ad9e07b1eb6
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/recip-vec-divf-fma.c
> @@ -0,0 +1,12 @@
> +/* { dg-do compile } */
> +/* { dg-options "-Ofast -mfma -mavx2" } */
> +/* { dg-final { scan-assembler-times {(?n)vfn?m(add|sub)[1-3]*ps} 2 } } */
> +
> +typedef float v4sf __attribute__((vector_size(16)));
> +/* (a - (rcp(b) * a * b)) * rcp(b) + rcp(b) * a  */
> +
> +v4sf
> +foo (v4sf a, v4sf b)
> +{
> +return a / b;
> +}
> -- 
> 2.34.1
> 


Re: [PATCH] GCN: Properly switch sections in 'gcn_hsa_declare_function_name' [PR119737]

2025-04-23 Thread Andrew Stubbs

On 22/04/2025 21:41, Thomas Schwinge wrote:

From: Andrew Pinski 

There are GCN/C++ target as well as offloading codes, where the hard-coded
section names in 'gcn_hsa_declare_function_name' do not fit, and assembly thus
fails:

 LLVM ERROR: Size expression must be absolute.

This commit progresses GCN target:

 [-FAIL: g++.dg/init/call1.C  -std=gnu++17 (internal compiler error: 
Aborted signal terminated program as)-]
 [-FAIL:-]{+PASS:+} g++.dg/init/call1.C  -std=gnu++17 (test for excess 
errors)
 [-UNRESOLVED:-]{+PASS:+} g++.dg/init/call1.C  -std=gnu++17 [-compilation 
failed to produce executable-]{+execution test+}
 [-FAIL: g++.dg/init/call1.C  -std=gnu++26 (internal compiler error: 
Aborted signal terminated program as)-]
 [-FAIL:-]{+PASS:+} g++.dg/init/call1.C  -std=gnu++26 (test for excess 
errors)
 [-UNRESOLVED:-]{+PASS:+} g++.dg/init/call1.C  -std=gnu++26 [-compilation 
failed to produce executable-]{+execution test+}
 UNSUPPORTED: g++.dg/init/call1.C  -std=gnu++98: exception handling not 
supported

..., and GCN offloading:

 [-XFAIL: libgomp.c++/target-exceptions-throw-1.C (internal compiler error: 
Aborted signal terminated program as)-]
 [-XFAIL: libgomp.c++/target-exceptions-throw-1.C PR119737 at line 7 (test 
for bogus messages, line )-]
 [-XFAIL:-]{+PASS:+} libgomp.c++/target-exceptions-throw-1.C (test for 
excess errors)
 [-UNRESOLVED:-]{+PASS:+} libgomp.c++/target-exceptions-throw-1.C 
[-compilation failed to produce executable-]{+execution test+}
 {+PASS: libgomp.c++/target-exceptions-throw-1.C output pattern test+}

 [-XFAIL: libgomp.c++/target-exceptions-throw-2.C (internal compiler error: 
Aborted signal terminated program as)-]
 [-XFAIL: libgomp.c++/target-exceptions-throw-2.C PR119737 at line 7 (test 
for bogus messages, line )-]
 [-XFAIL:-]{+PASS:+} libgomp.c++/target-exceptions-throw-2.C (test for 
excess errors)
 [-UNRESOLVED:-]{+PASS:+} libgomp.c++/target-exceptions-throw-2.C 
[-compilation failed to produce executable-]{+execution test+}
 {+PASS: libgomp.c++/target-exceptions-throw-2.C output pattern test+}

 [-XFAIL: libgomp.oacc-c++/exceptions-throw-1.C -DACC_DEVICE_TYPE_radeon=1 
-DACC_MEM_SHARED=0 -foffload=amdgcn-amdhsa  -O2  (internal compiler error: 
Aborted signal terminated program as)-]
 [-XFAIL: libgomp.oacc-c++/exceptions-throw-1.C -DACC_DEVICE_TYPE_radeon=1 
-DACC_MEM_SHARED=0 -foffload=amdgcn-amdhsa  -O2  PR119737 at line 7 (test for 
bogus messages, line )-]
 [-XFAIL:-]{+PASS:+} libgomp.oacc-c++/exceptions-throw-1.C 
-DACC_DEVICE_TYPE_radeon=1 -DACC_MEM_SHARED=0 -foffload=amdgcn-amdhsa  -O2  
(test for excess errors)
 [-UNRESOLVED:-]{+PASS:+} libgomp.oacc-c++/exceptions-throw-1.C 
-DACC_DEVICE_TYPE_radeon=1 -DACC_MEM_SHARED=0 -foffload=amdgcn-amdhsa  -O2  
[-compilation failed to produce executable-]{+execution test+}
 {+PASS: libgomp.oacc-c++/exceptions-throw-1.C -DACC_DEVICE_TYPE_radeon=1 
-DACC_MEM_SHARED=0 -foffload=amdgcn-amdhsa  -O2  output pattern test+}

 [-XFAIL: libgomp.oacc-c++/exceptions-throw-2.C -DACC_DEVICE_TYPE_radeon=1 
-DACC_MEM_SHARED=0 -foffload=amdgcn-amdhsa  -O2  (internal compiler error: 
Aborted signal terminated program as)-]
 [-XFAIL: libgomp.oacc-c++/exceptions-throw-2.C -DACC_DEVICE_TYPE_radeon=1 
-DACC_MEM_SHARED=0 -foffload=amdgcn-amdhsa  -O2  PR119737 at line 9 (test for 
bogus messages, line )-]
 [-XFAIL:-]{+PASS:+} libgomp.oacc-c++/exceptions-throw-2.C 
-DACC_DEVICE_TYPE_radeon=1 -DACC_MEM_SHARED=0 -foffload=amdgcn-amdhsa  -O2  
(test for excess errors)
 [-UNRESOLVED:-]{+PASS:+} libgomp.oacc-c++/exceptions-throw-2.C 
-DACC_DEVICE_TYPE_radeon=1 -DACC_MEM_SHARED=0 -foffload=amdgcn-amdhsa  -O2  
[-compilation failed to produce executable-]{+execution test+}
 {+PASS: libgomp.oacc-c++/exceptions-throw-2.C -DACC_DEVICE_TYPE_radeon=1 
-DACC_MEM_SHARED=0 -foffload=amdgcn-amdhsa  -O2  output pattern test+}

PR target/119737
gcc/
* config/gcn/gcn.cc (gcn_hsa_declare_function_name): Properly
switch sections.
libgomp/
* testsuite/libgomp.c++/target-exceptions-throw-1.C: Remove
PR119737 XFAILing.
* testsuite/libgomp.c++/target-exceptions-throw-2.C: Likewise.
* testsuite/libgomp.oacc-c++/exceptions-throw-1.C: Likewise.
* testsuite/libgomp.oacc-c++/exceptions-throw-2.C: Likewise.

Co-authored-by: Thomas Schwinge 
---
  gcc/config/gcn/gcn.cc | 6 +++---
  libgomp/testsuite/libgomp.c++/target-exceptions-throw-1.C | 3 ---
  libgomp/testsuite/libgomp.c++/target-exceptions-throw-2.C | 3 ---
  libgomp/testsuite/libgomp.oacc-c++/exceptions-throw-1.C   | 3 ---
  libgomp/testsuite/libgomp.oacc-c++/exceptions-throw-2.C   | 3 ---
  5 files changed, 3 insertions(+), 15 deletions(-)

diff --git a/gcc/config/gcn/gcn.cc b/gcc/config/gcn/gcn.cc
index d59e87bed46..91ce8019480 100644
--- a/gcc/config/gcn/gcn.cc
+++ b/gcc/config/gcn/gcn.cc

Re: [PATCH] modulo-sched: reject loop conditions when not decrementing with one [PR 116479]

2025-04-23 Thread Jakub Jelinek
On Wed, Apr 23, 2025 at 03:57:58PM +0100, Andre Vieira (lists) wrote:
> +++ b/gcc/testsuite/gcc.target/aarch64/pr116479.c
> @@ -0,0 +1,20 @@
> +/* PR 116479 */
> +/* { dg-do run } */
> +/* { dg-additional-options "-O -funroll-loops -finline-stringops 
> -fmodulo-sched --param=max-iterations-computation-cost=637924687 -static 
> -std=c23" } */
> +_BitInt (13577) b;
> +
> +void
> +foo (char *ret)
> +{
> +  __builtin_memset (&b, 4, 697);
> +  *ret = 0;
> +}
> +
> +int
> +main ()
> +{
> +  char x;
> +  foo (&x);
> +  for (unsigned i = 0; i < sizeof (x); i++)
> +__builtin_printf ("%02x", i[(volatile unsigned char *) &x]);

Shouldn't these 2 lines instead be
  if (x != 0)
__builtin_abort ();
?

> +}


Jakub



Re: [PATCH] modulo-sched: reject loop conditions when not decrementing with one [PR 116479]

2025-04-23 Thread Andre Vieira (lists)




On 23/04/2025 16:22, Jakub Jelinek wrote:

On Wed, Apr 23, 2025 at 03:57:58PM +0100, Andre Vieira (lists) wrote:

+++ b/gcc/testsuite/gcc.target/aarch64/pr116479.c
@@ -0,0 +1,20 @@
+/* PR 116479 */
+/* { dg-do run } */
+/* { dg-additional-options "-O -funroll-loops -finline-stringops -fmodulo-sched 
--param=max-iterations-computation-cost=637924687 -static -std=c23" } */
+_BitInt (13577) b;
+
+void
+foo (char *ret)
+{
+  __builtin_memset (&b, 4, 697);
+  *ret = 0;
+}
+
+int
+main ()
+{
+  char x;
+  foo (&x);
+  for (unsigned i = 0; i < sizeof (x); i++)
+__builtin_printf ("%02x", i[(volatile unsigned char *) &x]);


Shouldn't these 2 lines instead be
   if (x != 0)
 __builtin_abort ();
?



Fair, I copied the testcase verbatim from the PR, the error-mode was a 
segfault. But I agree a check !=0 with __builtin_abort here seems more 
appropriate.  Any opinions on whether I should move it to dg with a 
bitint target?




+}



Jakub





[PATCH] libstdc++: Minimalize temporary allocations when width is specified [PR109162]

2025-04-23 Thread Tomasz Kamiński
When width parameter is specified for formatting range, tuple or escaped
presentation of string, we used to format characters to temporary string,
and write produce sequence padded according to the spec. However, once the
estimated width of formatted representation of input is larger than the value
of spec width, it can be written directly to the output. This limits size of
required allocation, especially for large ranges.

Similarly, if precision (maximum) width is provided for string presentation,
on a prefix of sequence with estimated width not greater than precision, needs
to be buffered.

To realize above, this commit implements a new _Padding_sink specialization.
This sink holds an output iterator, a value of padding width, (optionally)
maximum width and a string buffer inherited from _Str_sink.
Then any incoming characters are treated in one of following ways, depending of
estimated width W of written sequence:
* written to string if W is smaller than padding width and maximum width (if 
present)
* ignored, if W is greater than maximum width
* written to output iterator, if W is greater than padding width

The padding sink is used instead of _Str_sink in __format::__format_padded,
__formatter_str::_M_format_escaped functions.

Furthermore __formatter_str::_M_format implementation was reworked, to:
* reduce number of instantiations by delegating to _Rg& and const _Rg& 
overloads,
* non-debug presentation is written to _Out directly or via _Padding_sink
* if maximum width is specified for debug format with non-unicode encoding,
  string size is limited to that number.

PR libstdc++/109162

libstdc++-v3/ChangeLog:

* include/bits/formatfwd.h (__simply_formattable_range): Moved from
std/format.
* include/std/format (__formatter_str::_format): Extracted escaped
string handling to separate method...
(__formatter_str::_M_format_escaped): Use __Padding_sink.
(__formatter_str::_M_format): Adjusted implementation.
(__formatter_str::_S_trunc): Extracted as namespace function...
(__format::_truncate): Extracted from __formatter_str::_S_trunc.
(__format::_Seq_sink): Removed forward declarations, made members
protected and non-final.
(_Seq_sink::_M_trim): Define.
(_Seq_sink::_M_span): Renamed from view.
(_Seq_sink::view): Returns string_view instead of span.
(__format::_Str_sink): Moved after _Seq_sink.
(__format::__format_padded): Use _Padding_sink.
* testsuite/std/format/debug.cc: Add timeout and new tests.
* testsuite/std/format/ranges/sequence.cc: Specify unicode as
encoding and new tests.
* testsuite/std/format/ranges/string.cc: Likewise.
* testsuite/std/format/tuple.cc: Likewise.
---
This is for sure 16 material, and nothing to backport.
This addressed the TODO I created in __format_padded.
OK for trunk after 15.1?

 libstdc++-v3/include/bits/formatfwd.h |   8 +
 libstdc++-v3/include/std/format   | 396 +-
 libstdc++-v3/testsuite/std/format/debug.cc| 386 -
 .../testsuite/std/format/ranges/sequence.cc   | 116 +
 .../testsuite/std/format/ranges/string.cc |  63 +++
 libstdc++-v3/testsuite/std/format/tuple.cc|  93 
 6 files changed, 957 insertions(+), 105 deletions(-)

diff --git a/libstdc++-v3/include/bits/formatfwd.h 
b/libstdc++-v3/include/bits/formatfwd.h
index 9ba658b078a..2d54ee5d30b 100644
--- a/libstdc++-v3/include/bits/formatfwd.h
+++ b/libstdc++-v3/include/bits/formatfwd.h
@@ -131,6 +131,14 @@ namespace __format
   = ranges::input_range
  && formattable, _CharT>;
 
+  // _Rg& and const _Rg& are both formattable and use same formatter
+  // specialization for their references.
+  template
+concept __simply_formattable_range
+  = __const_formattable_range<_Rg, _CharT>
+ && same_as>,
+remove_cvref_t>>;
+
   template
 using __maybe_const_range
   = __conditional_t<__const_formattable_range<_Rg, _CharT>, const _Rg, 
_Rg>;
diff --git a/libstdc++-v3/include/std/format b/libstdc++-v3/include/std/format
index 7d3067098be..355db5f2a60 100644
--- a/libstdc++-v3/include/std/format
+++ b/libstdc++-v3/include/std/format
@@ -56,7 +56,7 @@
 #include   // input_range, range_reference_t
 #include   // subrange
 #include  // ranges::copy
-#include  // back_insert_iterator
+#include  // back_insert_iterator, counted_iterator
 #include  // __is_pair
 #include   // __is_scalar_value, _Utf_view, etc.
 #include   // tuple_size_v
@@ -99,19 +99,12 @@ namespace __format
 
   // Size for stack located buffer
   template
-  constexpr size_t __stackbuf_size = 32 * sizeof(void*) / sizeof(_CharT);
+constexpr size_t __stackbuf_size = 32 * sizeof(void*) / sizeof(_CharT);
 
   // Type-erased character sinks.
   template class _Sink;
   template class _Fixedbuf_sink;
-  template class _Seq_sink;
-
-  template>
-using _Str_sink

RE: [PATCH]middle-end: Add new "max" vector cost model

2025-04-23 Thread Tamar Christina
> -Original Message-
> From: Richard Biener 
> Sent: Wednesday, April 23, 2025 9:37 AM
> To: Tamar Christina 
> Cc: gcc-patches@gcc.gnu.org; nd ; Richard Sandiford
> 
> Subject: Re: [PATCH]middle-end: Add new "max" vector cost model
> 
> On Wed, 23 Apr 2025, Richard Biener wrote:
> 
> > On Wed, 23 Apr 2025, Tamar Christina wrote:
> >
> > > Hi All,
> > >
> > > This patch proposes a new vector cost model called "max".  The cost model 
> > > is
> an
> > > intersection between two of our existing cost models.  Like `unlimited` it
> > > disables the costing vs scalar and assumes all vectorization to be 
> > > profitable.
> > >
> > > But unlike unlimited it does not fully disable the vector cost model.  
> > > That
> > > means that we still perform comparisons between vector modes.
> > >
> > > As an example, the following:
> > >
> > > void
> > > foo (char *restrict a, int *restrict b, int *restrict c,
> > >  int *restrict d, int stride)
> > > {
> > > if (stride <= 1)
> > > return;
> > >
> > > for (int i = 0; i < 3; i++)
> > > {
> > > int res = c[i];
> > > int t = b[i * stride];
> > > if (a[i] != 0)
> > > res = t * d[i];
> > > c[i] = res;
> > > }
> > > }
> > >
> > > compiled with -O3 -march=armv8-a+sve -fvect-cost-model=dynamic fails to
> > > vectorize as it assumes scalar would be faster, and with
> > > -fvect-cost-model=unlimited it picks a vector type that's so big that the 
> > > large
> > > sequence generated is working on mostly inactive lanes:
> > >
> > > ...
> > > and p3.b, p3/z, p4.b, p4.b
> > > whilelo p0.s, wzr, w7
> > > ld1wz23.s, p3/z, [x3, #3, mul vl]
> > > ld1wz28.s, p0/z, [x5, z31.s, sxtw 2]
> > > add x0, x5, x0
> > > punpklo p6.h, p6.b
> > > ld1wz27.s, p4/z, [x0, z31.s, sxtw 2]
> > > and p6.b, p6/z, p0.b, p0.b
> > > punpklo p4.h, p7.b
> > > ld1wz24.s, p6/z, [x3, #2, mul vl]
> > > and p4.b, p4/z, p2.b, p2.b
> > > uqdecw  w6
> > > ld1wz26.s, p4/z, [x3]
> > > whilelo p1.s, wzr, w6
> > > mul z27.s, p5/m, z27.s, z23.s
> > > ld1wz29.s, p1/z, [x4, z31.s, sxtw 2]
> > > punpkhi p7.h, p7.b
> > > mul z24.s, p5/m, z24.s, z28.s
> > > and p7.b, p7/z, p1.b, p1.b
> > > mul z26.s, p5/m, z26.s, z30.s
> > > ld1wz25.s, p7/z, [x3, #1, mul vl]
> > > st1wz27.s, p3, [x2, #3, mul vl]
> > > mul z25.s, p5/m, z25.s, z29.s
> > > st1wz24.s, p6, [x2, #2, mul vl]
> > > st1wz25.s, p7, [x2, #1, mul vl]
> > > st1wz26.s, p4, [x2]
> > > ...
> > >
> > > With -fvect-cost-model=max you get more reasonable code:
> > >
> > > foo:
> > > cmp w4, 1
> > > ble .L1
> > > ptrue   p7.s, vl3
> > > index   z0.s, #0, w4
> > > ld1bz29.s, p7/z, [x0]
> > > ld1wz30.s, p7/z, [x1, z0.s, sxtw 2]
> > >   ptrue   p6.b, all
> > > cmpne   p7.b, p7/z, z29.b, #0
> > > ld1wz31.s, p7/z, [x3]
> > >   mul z31.s, p6/m, z31.s, z30.s
> > > st1wz31.s, p7, [x2]
> > > .L1:
> > > ret
> > >
> > > This model has been useful internally for performance exploration and 
> > > cost-
> model
> > > validation.  It allows us to force realistic vectorization overriding the 
> > > cost
> > > model to be able to tell whether it's correct wrt to profitability.
> > >
> > > Bootstrapped Regtested on aarch64-none-linux-gnu,
> > > arm-none-linux-gnueabihf, x86_64-pc-linux-gnu
> > > -m32, -m64 and no issues.
> > >
> > > Ok for master?
> >
> > Hmm.  I don't like another cost model.  Instead how about changing
> > 'unlimited' to still iterate through vector sizes?  Cost modeling
> > is really about vector vs. scalar, not vector vs. vector which is
> > completely under target control.  Targets should provide a way
> > to limit iteration, like aarch64 has with the aarch64-autovec-preference
> > --param or x86 has with -mprefer-vector-width.
> >
> > Of course changing 'unlimited' might result in somewhat of a testsuite
> > churn, but still the fix there would be to inject a proper -mXYZ
> > or --param to get the old behavior back (or even consider cycling
> > through the different aarch64-autovec-preference settings for the
> > testsuite).
> 
> Note this will completely remove the ability to reject never profitable
> vectorizations, so I'm not sure that this is what you'd want in practice.
> You want to fix cost modeling instead.
> 
> So why does it consider the scalar code to be faster with =dynamic
> and why do you think that's not possible to fix?  Don't we have
> per-loop #pragma control to force vectorization here (but maybe that
> has the 'unlimited' cost modeling issue)?
> 

The addition wasn't for the GCC testsuite usage specifically.   This is about
testing real world code wr

RE: [PATCH] Add a bootstrap-native build config

2025-04-23 Thread Tamar Christina
> -Original Message-
> From: Jakub Jelinek 
> Sent: Wednesday, April 23, 2025 10:39 AM
> To: Tamar Christina 
> Cc: Richard Biener ; Andi Kleen
> ; GCC Patches 
> Subject: Re: [PATCH] Add a bootstrap-native build config
> 
> On Wed, Apr 23, 2025 at 09:36:11AM +, Tamar Christina wrote:
> > On AArch64 it does but only if no other
> > tuning options are specified.
> 
> That is the case on x86 as well, -march=native -mtune=znver5 will
> still tune for znver5, but -march=native will tune for native.
> 

But what happens with

-mtune=znver5 -march=native

On, AArch64 this would tune for znver5.  So the order of the arguments
don't matter for this one specific rewrite case.

Thanks,
Tamar

>   Jakub



Re: [PATCH] s390: Allow 5+ argument tail-calls in some special cases [PR119873]

2025-04-23 Thread Jakub Jelinek
On Wed, Apr 23, 2025 at 04:23:37PM +0200, Stefan Schulze Frielinghaus wrote:
> > So, the following patch checks for this special case, where the argument
> > which uses %r6 is passed in a single register and it is passed default
> > definition of SSA_NAME of a PARM_DECL with the same DECL_INCOMING_RTL.
> 
> Do we really need a check for nregs==1 here?  Only, for -m31 we pass
> parameters in register pairs.  With check nregs==1 we fail on -m31 for
> the following example and without we pass:
> 
> extern int bar (int p1, int p2, int p3, long long p4, int p5);
> int foo (int p1, int p2, int p3, long long p4, int p5)
> {
>   [[gnu::musttail]] return bar (p1, p2, p3, p4, p5);
> }
> 
> Parameter p4 should be passed in r5,r6 and p5 via stack.

I guess the nregs==1 check can be dropped.
I didn't want to test it multiple times for each nregs, but guess it
will happen only for the %r6 normally as for other regs
call_used_or_fixed_reg_p will be true unless user uses -ffixed- etc.
options.

> > It won't really work at -O0 but should work for -O1 and above, at least when
> > one doesn't really try to modify the parameter conditionally and hope it 
> > will
> > be optimized away in the end.
> 
> It also fails for
> 
> extern int bar (int p1, int p2, int p3, int p4, int p5);
> int foo (int p1, int p2, int p3, int p4, int p5)
> {
>   [[gnu::musttail]] return bar (p1, p2, p3, p4, p5);
> }
> 
> since rtx_equal_p (parm, parm_rtx) does not hold for p5
> 
> (gdb) call debug_rtx(parm_rtx)
> (reg:SI 6 %r6)
> (gdb) call debug_rtx(parm)
> (reg:DI 6 %r6 [ p5+-4 ])
> 
> due to a missmatch between extended and non-extended values.  Maybe a
> check like
> 
> REGNO (parm) == REGNO (parm_rtx)
> && REG_NREGS (parm) == REG_NREGS (parm_rtx)
> 
> is sufficient?

I think it depends the details of the ABI.
I believe when the argument is SSA_NAME (D) of a PARM_DECL it means
the parameter and the argument have (GIMPLE) compatible types.
Are the upper 32 bits always sign extended (or zero extended, or
sign vs. zero extended depending on if it is signed or unsigned < 64bit
type)?
If it must be extended, then yes, testing REGNO and REG_NREGS might be
enough, if the upper bits are unspecified, I'd worry we could sometimes
try to change those bits and break the caller that way.
If {un,}signed {char,short,int} are always zero/sign extended to 64-bit,
what about BITINT_TYPEs?

Another question is if it shouldn't be handled in the PARALLEL case as well
(e.g. if some -m32 long long parameter could be partly passed in %r6 and
partly on the stack).

Jakub



Re: [PATCH] s390: Allow 5+ argument tail-calls in some special cases [PR119873]

2025-04-23 Thread Stefan Schulze Frielinghaus
Hi Jakub,

On Tue, Apr 22, 2025 at 10:41:29AM +0200, Jakub Jelinek wrote:
> Hi!
> 
> protobuf (and therefore firefox too) currently doesn't build on s390*-linux.
> The problem is that it uses [[clang::musttail]] attribute heavily, and in
> llvm (IMHO llvm bug) [[clang::musttail]] calls with 5+ arguments on
> s390*-linux are silently accepted and result in a normal non-tail call.
> In GCC we just reject those because the target hook refuses to tail call it
> (IMHO the right behavior).
> Now, the reason why that happens is as s390_function_ok_for_sibcall attempts
> to explain, the 5th argument (assuming normal <= wordsize integer or pointer
> arguments, nothing that needs 2+ registers) is passed in %r6 which is not
> call clobbered, so we can't do tail call when we'd have to change content
> of that register and then caller would assume %r6 content didn't change and
> use it again.
> In the protobuf case though, the 5th argument is always passed through
> from the caller to the musttail callee unmodified, so one can actually
> emit just jg tail_called_function or perhaps tweak some registers but
> keep %r6 untouched, and in that case I think it is just fine to tail call
> it (at least unless the stack slots used for 6+ argument can't be modified
> by the callee in the ABI and nothing checks for that).

I very much like the idea.

> 
> So, the following patch checks for this special case, where the argument
> which uses %r6 is passed in a single register and it is passed default
> definition of SSA_NAME of a PARM_DECL with the same DECL_INCOMING_RTL.

Do we really need a check for nregs==1 here?  Only, for -m31 we pass
parameters in register pairs.  With check nregs==1 we fail on -m31 for
the following example and without we pass:

extern int bar (int p1, int p2, int p3, long long p4, int p5);
int foo (int p1, int p2, int p3, long long p4, int p5)
{
  [[gnu::musttail]] return bar (p1, p2, p3, p4, p5);
}

Parameter p4 should be passed in r5,r6 and p5 via stack.

> 
> It won't really work at -O0 but should work for -O1 and above, at least when
> one doesn't really try to modify the parameter conditionally and hope it will
> be optimized away in the end.

It also fails for

extern int bar (int p1, int p2, int p3, int p4, int p5);
int foo (int p1, int p2, int p3, int p4, int p5)
{
  [[gnu::musttail]] return bar (p1, p2, p3, p4, p5);
}

since rtx_equal_p (parm, parm_rtx) does not hold for p5

(gdb) call debug_rtx(parm_rtx)
(reg:SI 6 %r6)
(gdb) call debug_rtx(parm)
(reg:DI 6 %r6 [ p5+-4 ])

due to a missmatch between extended and non-extended values.  Maybe a
check like

REGNO (parm) == REGNO (parm_rtx)
&& REG_NREGS (parm) == REG_NREGS (parm_rtx)

is sufficient?

Cheers,
Stefan

> 
> Bootstrapped/regtested on s390x-linux, ok for trunk?
> 
> I wonder if we shouldn't do this for 15.1 as well with additional
> && CALL_EXPR_MUST_TAIL_CALL (call_expr) check ideally after nregs == 1
> so that we only do that for the musttail cases where we'd otherwise
> error and not for anything else, to fix up protobuf/firefox out of the box.
> 
> 2025-04-21  Jakub Jelinek  
> 
>   PR target/119873
>   * config/s390/s390.cc (s390_call_saved_register_used): Don't return
>   true if default definition of PARM_DECL SSA_NAME of the same register
>   is passed in call saved register.
>   (s390_function_ok_for_sibcall): Adjust comment.
> 
>   * gcc.target/s390/pr119873-1.c: New test.
>   * gcc.target/s390/pr119873-2.c: New test.
> 
> --- gcc/config/s390/s390.cc.jj2025-04-14 07:26:46.441883927 +0200
> +++ gcc/config/s390/s390.cc   2025-04-21 21:57:37.457535989 +0200
> @@ -14496,7 +14496,21 @@ s390_call_saved_register_used (tree call
>  
> for (reg = 0; reg < nregs; reg++)
>   if (!call_used_or_fixed_reg_p (reg + REGNO (parm_rtx)))
> -   return true;
> +   {
> + rtx parm;
> + /* Allow passing through unmodified value from caller,
> +see PR119873.  */
> + if (nregs == 1
> + && TREE_CODE (parameter) == SSA_NAME
> + && SSA_NAME_IS_DEFAULT_DEF (parameter)
> + && SSA_NAME_VAR (parameter)
> + && TREE_CODE (SSA_NAME_VAR (parameter)) == PARM_DECL
> + && (parm = DECL_INCOMING_RTL (SSA_NAME_VAR (parameter)))
> + && REG_P (parm)
> + && rtx_equal_p (parm, parm_rtx))
> +   break;
> + return true;
> +   }
>   }
>else if (GET_CODE (parm_rtx) == PARALLEL)
>   {
> @@ -14543,8 +14557,9 @@ s390_function_ok_for_sibcall (tree decl,
>  return false;
>  
>/* Register 6 on s390 is available as an argument register but 
> unfortunately
> - "caller saved". This makes functions needing this register for arguments
> - not suitable for sibcalls.  */
> + "caller saved".  This makes functions needing this register for 
> arguments
> + not suitable for sibcalls, unless the sam

[PATCH] modulo-sched: reject loop conditions when not decrementing with one [PR 116479]

2025-04-23 Thread Andre Vieira (lists)
In the commit titled 'doloop: Add support for predicated vectorized 
loops' the doloop_condition_get function was changed to accept loops 
with decrements larger than 1.  This patch rejects such loops for 
modulo-sched.


I've put the test for this in the aarch64 testsuite, but I just realized 
even though the testcase failed for aarch64, it can and should run on 
any target that supports BitInt, should I move it to dg and add a target 
{ bitint } ? PS: I have not checked whether the testcase used to fail in 
other targets.


Bootstrapped and regtested aarch64-none-linux-gnu, 
arm-none-linux-gnueabihf, x86_64-pc-linux-gnu.


OK for trunk (and backport to 15 branch)?

gcc/ChangeLog:

PR rtl-optimization/116479
* modulo-sched.cc (doloop_register_get): Reject conditions with
decrements that are not 1.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/pr116479.c: New test.diff --git a/gcc/modulo-sched.cc b/gcc/modulo-sched.cc
index 
08af5a929e148df8b3f6f4f9c4ada564aac22cdb..002346778f447ffe4fbad803872ba03880236e34
 100644
--- a/gcc/modulo-sched.cc
+++ b/gcc/modulo-sched.cc
@@ -356,7 +356,13 @@ doloop_register_get (rtx_insn *head, rtx_insn *tail)
 reg = XEXP (condition, 0);
   else if (GET_CODE (XEXP (condition, 0)) == PLUS
   && REG_P (XEXP (XEXP (condition, 0), 0)))
-reg = XEXP (XEXP (condition, 0), 0);
+{
+  if (CONST_INT_P (XEXP (condition, 1))
+ && INTVAL (XEXP (condition, 1)) == -1)
+   reg = XEXP (XEXP (condition, 0), 0);
+  else
+   return NULL_RTX;
+}
   else
 gcc_unreachable ();
 
diff --git a/gcc/testsuite/gcc.target/aarch64/pr116479.c 
b/gcc/testsuite/gcc.target/aarch64/pr116479.c
new file mode 100644
index 
..73315c7f4d6ea93587a9a042289004782aa92190
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr116479.c
@@ -0,0 +1,20 @@
+/* PR 116479 */
+/* { dg-do run } */
+/* { dg-additional-options "-O -funroll-loops -finline-stringops 
-fmodulo-sched --param=max-iterations-computation-cost=637924687 -static 
-std=c23" } */
+_BitInt (13577) b;
+
+void
+foo (char *ret)
+{
+  __builtin_memset (&b, 4, 697);
+  *ret = 0;
+}
+
+int
+main ()
+{
+  char x;
+  foo (&x);
+  for (unsigned i = 0; i < sizeof (x); i++)
+__builtin_printf ("%02x", i[(volatile unsigned char *) &x]);
+}


Re: [PATCH] libstdc++: Update baseline symbols for powerpc-linux and powerpc64-linux

2025-04-23 Thread Jonathan Wakely
On Wed, 23 Apr 2025 at 14:41, Andreas Schwab  wrote:
>
> * config/abi/post/powerpc-linux-gnu/baseline_symbols.txt: Update.
> * config/abi/post/powerpc64-linux-gnu/32/baseline_symbols.txt: Update.
> * config/abi/post/powerpc64-linux-gnu/baseline_symbols.txt: Update.

OK for trunk, and OK for gcc-15 with RM approval.


> ---
>  .../abi/post/powerpc-linux-gnu/baseline_symbols.txt   | 11 +++
>  .../post/powerpc64-linux-gnu/32/baseline_symbols.txt  | 11 +++
>  .../abi/post/powerpc64-linux-gnu/baseline_symbols.txt | 11 +++
>  3 files changed, 33 insertions(+)
>
> diff --git 
> a/libstdc++-v3/config/abi/post/powerpc-linux-gnu/baseline_symbols.txt 
> b/libstdc++-v3/config/abi/post/powerpc-linux-gnu/baseline_symbols.txt
> index c38386543b6..b8b27d0a91b 100644
> --- a/libstdc++-v3/config/abi/post/powerpc-linux-gnu/baseline_symbols.txt
> +++ b/libstdc++-v3/config/abi/post/powerpc-linux-gnu/baseline_symbols.txt
> @@ -2270,6 +2270,10 @@ 
> FUNC:_ZNSt12__shared_ptrINSt10filesystem7__cxx114_DirELN9__gnu_cxx12_Lock_policy
>  
> FUNC:_ZNSt12__shared_ptrINSt10filesystem7__cxx114_DirELN9__gnu_cxx12_Lock_policyE2EEC2EOS5_@@GLIBCXX_3.4.28
>  
> FUNC:_ZNSt12__shared_ptrINSt10filesystem7__cxx114_DirELN9__gnu_cxx12_Lock_policyE2EEC2Ev@@GLIBCXX_3.4.27
>  
> FUNC:_ZNSt12__shared_ptrINSt10filesystem7__cxx114_DirELN9__gnu_cxx12_Lock_policyE2EEaSEOS5_@@GLIBCXX_3.4.26
> +FUNC:_ZNSt12__sso_stringC1Ev@@GLIBCXX_3.4.34
> +FUNC:_ZNSt12__sso_stringC2Ev@@GLIBCXX_3.4.34
> +FUNC:_ZNSt12__sso_stringD1Ev@@GLIBCXX_3.4.34
> +FUNC:_ZNSt12__sso_stringD2Ev@@GLIBCXX_3.4.34
>  FUNC:_ZNSt12bad_weak_ptrD0Ev@@GLIBCXX_3.4.15
>  FUNC:_ZNSt12bad_weak_ptrD1Ev@@GLIBCXX_3.4.15
>  FUNC:_ZNSt12bad_weak_ptrD2Ev@@GLIBCXX_3.4.15
> @@ -3411,6 +3415,8 @@ 
> FUNC:_ZNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEE12_Alloc_hiderC1EPcRKS
>  
> FUNC:_ZNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEE12_Alloc_hiderC2EPcOS3_@@GLIBCXX_3.4.23
>  
> FUNC:_ZNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEE12_Alloc_hiderC2EPcRKS3_@@GLIBCXX_3.4.21
>  
> FUNC:_ZNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEE12_M_constructEjc@@GLIBCXX_3.4.21
> +FUNC:_ZNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEE12_M_constructILb0EEEvPKcj@@GLIBCXX_3.4.34
> +FUNC:_ZNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEE12_M_constructILb1EEEvPKcj@@GLIBCXX_3.4.34
>  
> FUNC:_ZNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEE12_M_constructIN9__gnu_cxx17__normal_iteratorIPKcS4_vT_SB_St20forward_iterator_tag@@GLIBCXX_3.4.21
>  
> FUNC:_ZNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEE12_M_constructIN9__gnu_cxx17__normal_iteratorIPcS4_vT_SA_St20forward_iterator_tag@@GLIBCXX_3.4.21
>  
> FUNC:_ZNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEE12_M_constructIPKcEEvT_S8_St20forward_iterator_tag@@GLIBCXX_3.4.21
> @@ -3564,6 +3570,8 @@ 
> FUNC:_ZNSt7__cxx1112basic_stringIwSt11char_traitsIwESaIwEE12_Alloc_hiderC1EPwRKS
>  
> FUNC:_ZNSt7__cxx1112basic_stringIwSt11char_traitsIwESaIwEE12_Alloc_hiderC2EPwOS3_@@GLIBCXX_3.4.23
>  
> FUNC:_ZNSt7__cxx1112basic_stringIwSt11char_traitsIwESaIwEE12_Alloc_hiderC2EPwRKS3_@@GLIBCXX_3.4.21
>  
> FUNC:_ZNSt7__cxx1112basic_stringIwSt11char_traitsIwESaIwEE12_M_constructEjw@@GLIBCXX_3.4.21
> +FUNC:_ZNSt7__cxx1112basic_stringIwSt11char_traitsIwESaIwEE12_M_constructILb0EEEvPKwj@@GLIBCXX_3.4.34
> +FUNC:_ZNSt7__cxx1112basic_stringIwSt11char_traitsIwESaIwEE12_M_constructILb1EEEvPKwj@@GLIBCXX_3.4.34
>  
> FUNC:_ZNSt7__cxx1112basic_stringIwSt11char_traitsIwESaIwEE12_M_constructIN9__gnu_cxx17__normal_iteratorIPKwS4_vT_SB_St20forward_iterator_tag@@GLIBCXX_3.4.21
>  
> FUNC:_ZNSt7__cxx1112basic_stringIwSt11char_traitsIwESaIwEE12_M_constructIN9__gnu_cxx17__normal_iteratorIPwS4_vT_SA_St20forward_iterator_tag@@GLIBCXX_3.4.21
>  
> FUNC:_ZNSt7__cxx1112basic_stringIwSt11char_traitsIwESaIwEE12_M_constructIPKwEEvT_S8_St20forward_iterator_tag@@GLIBCXX_3.4.21
> @@ -4131,6 +4139,8 @@ 
> FUNC:_ZNSt8__detail15_List_node_base11_M_transferEPS0_S1_@@GLIBCXX_3.4.15
>  FUNC:_ZNSt8__detail15_List_node_base4swapERS0_S1_@@GLIBCXX_3.4.15
>  FUNC:_ZNSt8__detail15_List_node_base7_M_hookEPS0_@@GLIBCXX_3.4.15
>  FUNC:_ZNSt8__detail15_List_node_base9_M_unhookEv@@GLIBCXX_3.4.15
> +FUNC:_ZNSt8__format25__locale_encoding_to_utf8ERKSt6localeSt17basic_string_viewIcSt11char_traitsIcEEPv@@GLIBCXX_3.4.34
> +FUNC:_ZNSt8__format26__with_encoding_conversionERKSt6locale@@GLIBCXX_3.4.34
>  FUNC:_ZNSt8bad_castD0Ev@@GLIBCXX_3.4
>  FUNC:_ZNSt8bad_castD1Ev@@GLIBCXX_3.4
>  FUNC:_ZNSt8bad_castD2Ev@@GLIBCXX_3.4
> @@ -4858,6 +4868,7 @@ OBJECT:0:GLIBCXX_3.4.30
>  OBJECT:0:GLIBCXX_3.4.31
>  OBJECT:0:GLIBCXX_3.4.32
>  OBJECT:0:GLIBCXX_3.4.33
> +OBJECT:0:GLIBCXX_3.4.34
>  OBJECT:0:GLIBCXX_3.4.4
>  OBJECT:0:GLIBCXX_3.4.5
>  OBJECT:0:GLIBCXX_3.4.6
> diff --git 
> a/libstdc++-v3/config/abi/post/powerpc64-linux-gnu/32/baseline_symbols.txt 
> b/libstdc++-v3/config/abi/post/powerpc64-linux-gnu/32/baseline_symbols.txt
> index c38386543b6..b8b27d0a91b 100644
> --- a/l

[COMMITTED] testsuite: Skip g++.dg/eh/pr119507.C on Solaris/SPARC with as

2025-04-23 Thread Rainer Orth
The new g++.dg/eh/pr119507.C test FAILs on Solaris/SPARC with the native as:

FAIL: g++.dg/eh/pr119507.C  -std=gnu++17  scan-assembler-times .section[\\t 
][^\\n]*.gcc_except_table._Z6comdatv 1
FAIL: g++.dg/eh/pr119507.C  -std=gnu++17  scan-assembler-times .section[\\t 
][^\\n]*.gcc_except_table._Z7comdat1v 1
FAIL: g++.dg/eh/pr119507.C  -std=gnu++26  scan-assembler-times .section[\\t 
][^\\n]*.gcc_except_table._Z6comdatv 1
FAIL: g++.dg/eh/pr119507.C  -std=gnu++26  scan-assembler-times .section[\\t 
][^\\n]*.gcc_except_table._Z7comdat1v 1
FAIL: g++.dg/eh/pr119507.C  -std=gnu++98  scan-assembler-times .section[\\t 
][^\\n]*.gcc_except_table._Z6comdatv 1
FAIL: g++.dg/eh/pr119507.C  -std=gnu++98  scan-assembler-times .section[\\t 
][^\\n]*.gcc_except_table._Z7comdat1v 1

This happens because the syntax for COMDAT sections is vastly different
from the one used by gas.

Rather than trying to handle this, this patch just skips the test.

Tested on sparc-sun-solaris2.11 with both as and gas,
i386-pc-solaris2.11, and x86_64-pc-linux-gnu.

Committed to trunk.

Rainer

-- 
-
Rainer Orth, Center for Biotechnology, Bielefeld University


2025-04-23  Rainer Orth  

gcc/testsuite:
* g++.dg/eh/pr119507.C: Skip on sparc*-*-solaris2* && !gas.

# HG changeset patch
# Parent  ad8df6a561fc43899b59a2d336a080d06f7c38c5
testsuite: Skip g++.dg/eh/pr119507.C on Solaris/SPARC with as

diff --git a/gcc/testsuite/g++.dg/eh/pr119507.C b/gcc/testsuite/g++.dg/eh/pr119507.C
--- a/gcc/testsuite/g++.dg/eh/pr119507.C
+++ b/gcc/testsuite/g++.dg/eh/pr119507.C
@@ -1,6 +1,8 @@
 // { dg-do compile { target comdat_group } }
 // ARM EABI has its own exception handling data handling and does not use gcc_except_table
 // { dg-skip-if "!TARGET_EXCEPTION_DATA" { arm_eabi } }
+// Solaris/SPARC as uses a widely different COMDAT section syntax.
+// { dg-skip-if "Solaris/SPARC as syntax" { sparc*-*-solaris2* && { ! gas } } }
 // Force off function sections
 // Force on exceptions
 // { dg-options "-fno-function-sections -fexceptions" }


Re: [PATCH] libstdc++: Minimalize temporary allocations when width is specified [PR109162]

2025-04-23 Thread Tomasz Kaminski
On Wed, Apr 23, 2025 at 2:03 PM Tomasz Kamiński  wrote:

> When width parameter is specified for formatting range, tuple or escaped
> presentation of string, we used to format characters to temporary string,
> and write produce sequence padded according to the spec. However, once the
> estimated width of formatted representation of input is larger than the
> value
> of spec width, it can be written directly to the output. This limits size
> of
> required allocation, especially for large ranges.
>
> Similarly, if precision (maximum) width is provided for string
> presentation,
> on a prefix of sequence with estimated width not greater than precision,
> needs
> to be buffered.
>
> To realize above, this commit implements a new _Padding_sink
> specialization.
> This sink holds an output iterator, a value of padding width, (optionally)
> maximum width and a string buffer inherited from _Str_sink.
> Then any incoming characters are treated in one of following ways,
> depending of
> estimated width W of written sequence:
> * written to string if W is smaller than padding width and maximum width
> (if present)
> * ignored, if W is greater than maximum width
> * written to output iterator, if W is greater than padding width
>
> The padding sink is used instead of _Str_sink in __format::__format_padded,
> __formatter_str::_M_format_escaped functions.
>
> Furthermore __formatter_str::_M_format implementation was reworked, to:
> * reduce number of instantiations by delegating to _Rg& and const _Rg&
> overloads,
> * non-debug presentation is written to _Out directly or via _Padding_sink
> * if maximum width is specified for debug format with non-unicode encoding,
>   string size is limited to that number.
>
> PR libstdc++/109162
>
> libstdc++-v3/ChangeLog:
>
> * include/bits/formatfwd.h (__simply_formattable_range): Moved from
> std/format.
> * include/std/format (__formatter_str::_format): Extracted escaped
> string handling to separate method...
> (__formatter_str::_M_format_escaped): Use __Padding_sink.
> (__formatter_str::_M_format): Adjusted implementation.
> (__formatter_str::_S_trunc): Extracted as namespace function...
> (__format::_truncate): Extracted from __formatter_str::_S_trunc.
> (__format::_Seq_sink): Removed forward declarations, made members
> protected and non-final.
> (_Seq_sink::_M_trim): Define.
> (_Seq_sink::_M_span): Renamed from view.
> (_Seq_sink::view): Returns string_view instead of span.
> (__format::_Str_sink): Moved after _Seq_sink.
> (__format::__format_padded): Use _Padding_sink.
> * testsuite/std/format/debug.cc: Add timeout and new tests.
> * testsuite/std/format/ranges/sequence.cc: Specify unicode as
> encoding and new tests.
> * testsuite/std/format/ranges/string.cc: Likewise.
> * testsuite/std/format/tuple.cc: Likewise.
> ---
> This is for sure 16 material, and nothing to backport.
> This addressed the TODO I created in __format_padded.
> OK for trunk after 15.1?
>
>  libstdc++-v3/include/bits/formatfwd.h |   8 +
>  libstdc++-v3/include/std/format   | 396 +-
>  libstdc++-v3/testsuite/std/format/debug.cc| 386 -
>  .../testsuite/std/format/ranges/sequence.cc   | 116 +
>  .../testsuite/std/format/ranges/string.cc |  63 +++
>  libstdc++-v3/testsuite/std/format/tuple.cc|  93 
>  6 files changed, 957 insertions(+), 105 deletions(-)
>
> diff --git a/libstdc++-v3/include/bits/formatfwd.h
> b/libstdc++-v3/include/bits/formatfwd.h
> index 9ba658b078a..2d54ee5d30b 100644
> --- a/libstdc++-v3/include/bits/formatfwd.h
> +++ b/libstdc++-v3/include/bits/formatfwd.h
> @@ -131,6 +131,14 @@ namespace __format
>= ranges::input_range
>   && formattable, _CharT>;
>
> +  // _Rg& and const _Rg& are both formattable and use same formatter
> +  // specialization for their references.
> +  template
> +concept __simply_formattable_range
> +  = __const_formattable_range<_Rg, _CharT>
> + && same_as>,
> +remove_cvref_t>>;
> +
>template
>  using __maybe_const_range
>= __conditional_t<__const_formattable_range<_Rg, _CharT>, const
> _Rg, _Rg>;
> diff --git a/libstdc++-v3/include/std/format
> b/libstdc++-v3/include/std/format
> index 7d3067098be..355db5f2a60 100644
> --- a/libstdc++-v3/include/std/format
> +++ b/libstdc++-v3/include/std/format
> @@ -56,7 +56,7 @@
>  #include   // input_range, range_reference_t
>  #include   // subrange
>  #include  // ranges::copy
> -#include  // back_insert_iterator
> +#include  // back_insert_iterator, counted_iterator
>  #include  // __is_pair
>  #include   // __is_scalar_value, _Utf_view, etc.
>  #include   // tuple_size_v
> @@ -99,19 +99,12 @@ namespace __format
>
>// Size for stack located buffer
>template
> -  constexpr size_t __stackbuf_size = 32 * si

[PATCH] libstdc++: Update baseline symbols for powerpc-linux and powerpc64-linux

2025-04-23 Thread Andreas Schwab
* config/abi/post/powerpc-linux-gnu/baseline_symbols.txt: Update.
* config/abi/post/powerpc64-linux-gnu/32/baseline_symbols.txt: Update.
* config/abi/post/powerpc64-linux-gnu/baseline_symbols.txt: Update.
---
 .../abi/post/powerpc-linux-gnu/baseline_symbols.txt   | 11 +++
 .../post/powerpc64-linux-gnu/32/baseline_symbols.txt  | 11 +++
 .../abi/post/powerpc64-linux-gnu/baseline_symbols.txt | 11 +++
 3 files changed, 33 insertions(+)

diff --git 
a/libstdc++-v3/config/abi/post/powerpc-linux-gnu/baseline_symbols.txt 
b/libstdc++-v3/config/abi/post/powerpc-linux-gnu/baseline_symbols.txt
index c38386543b6..b8b27d0a91b 100644
--- a/libstdc++-v3/config/abi/post/powerpc-linux-gnu/baseline_symbols.txt
+++ b/libstdc++-v3/config/abi/post/powerpc-linux-gnu/baseline_symbols.txt
@@ -2270,6 +2270,10 @@ 
FUNC:_ZNSt12__shared_ptrINSt10filesystem7__cxx114_DirELN9__gnu_cxx12_Lock_policy
 
FUNC:_ZNSt12__shared_ptrINSt10filesystem7__cxx114_DirELN9__gnu_cxx12_Lock_policyE2EEC2EOS5_@@GLIBCXX_3.4.28
 
FUNC:_ZNSt12__shared_ptrINSt10filesystem7__cxx114_DirELN9__gnu_cxx12_Lock_policyE2EEC2Ev@@GLIBCXX_3.4.27
 
FUNC:_ZNSt12__shared_ptrINSt10filesystem7__cxx114_DirELN9__gnu_cxx12_Lock_policyE2EEaSEOS5_@@GLIBCXX_3.4.26
+FUNC:_ZNSt12__sso_stringC1Ev@@GLIBCXX_3.4.34
+FUNC:_ZNSt12__sso_stringC2Ev@@GLIBCXX_3.4.34
+FUNC:_ZNSt12__sso_stringD1Ev@@GLIBCXX_3.4.34
+FUNC:_ZNSt12__sso_stringD2Ev@@GLIBCXX_3.4.34
 FUNC:_ZNSt12bad_weak_ptrD0Ev@@GLIBCXX_3.4.15
 FUNC:_ZNSt12bad_weak_ptrD1Ev@@GLIBCXX_3.4.15
 FUNC:_ZNSt12bad_weak_ptrD2Ev@@GLIBCXX_3.4.15
@@ -3411,6 +3415,8 @@ 
FUNC:_ZNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEE12_Alloc_hiderC1EPcRKS
 
FUNC:_ZNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEE12_Alloc_hiderC2EPcOS3_@@GLIBCXX_3.4.23
 
FUNC:_ZNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEE12_Alloc_hiderC2EPcRKS3_@@GLIBCXX_3.4.21
 
FUNC:_ZNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEE12_M_constructEjc@@GLIBCXX_3.4.21
+FUNC:_ZNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEE12_M_constructILb0EEEvPKcj@@GLIBCXX_3.4.34
+FUNC:_ZNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEE12_M_constructILb1EEEvPKcj@@GLIBCXX_3.4.34
 
FUNC:_ZNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEE12_M_constructIN9__gnu_cxx17__normal_iteratorIPKcS4_vT_SB_St20forward_iterator_tag@@GLIBCXX_3.4.21
 
FUNC:_ZNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEE12_M_constructIN9__gnu_cxx17__normal_iteratorIPcS4_vT_SA_St20forward_iterator_tag@@GLIBCXX_3.4.21
 
FUNC:_ZNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEE12_M_constructIPKcEEvT_S8_St20forward_iterator_tag@@GLIBCXX_3.4.21
@@ -3564,6 +3570,8 @@ 
FUNC:_ZNSt7__cxx1112basic_stringIwSt11char_traitsIwESaIwEE12_Alloc_hiderC1EPwRKS
 
FUNC:_ZNSt7__cxx1112basic_stringIwSt11char_traitsIwESaIwEE12_Alloc_hiderC2EPwOS3_@@GLIBCXX_3.4.23
 
FUNC:_ZNSt7__cxx1112basic_stringIwSt11char_traitsIwESaIwEE12_Alloc_hiderC2EPwRKS3_@@GLIBCXX_3.4.21
 
FUNC:_ZNSt7__cxx1112basic_stringIwSt11char_traitsIwESaIwEE12_M_constructEjw@@GLIBCXX_3.4.21
+FUNC:_ZNSt7__cxx1112basic_stringIwSt11char_traitsIwESaIwEE12_M_constructILb0EEEvPKwj@@GLIBCXX_3.4.34
+FUNC:_ZNSt7__cxx1112basic_stringIwSt11char_traitsIwESaIwEE12_M_constructILb1EEEvPKwj@@GLIBCXX_3.4.34
 
FUNC:_ZNSt7__cxx1112basic_stringIwSt11char_traitsIwESaIwEE12_M_constructIN9__gnu_cxx17__normal_iteratorIPKwS4_vT_SB_St20forward_iterator_tag@@GLIBCXX_3.4.21
 
FUNC:_ZNSt7__cxx1112basic_stringIwSt11char_traitsIwESaIwEE12_M_constructIN9__gnu_cxx17__normal_iteratorIPwS4_vT_SA_St20forward_iterator_tag@@GLIBCXX_3.4.21
 
FUNC:_ZNSt7__cxx1112basic_stringIwSt11char_traitsIwESaIwEE12_M_constructIPKwEEvT_S8_St20forward_iterator_tag@@GLIBCXX_3.4.21
@@ -4131,6 +4139,8 @@ 
FUNC:_ZNSt8__detail15_List_node_base11_M_transferEPS0_S1_@@GLIBCXX_3.4.15
 FUNC:_ZNSt8__detail15_List_node_base4swapERS0_S1_@@GLIBCXX_3.4.15
 FUNC:_ZNSt8__detail15_List_node_base7_M_hookEPS0_@@GLIBCXX_3.4.15
 FUNC:_ZNSt8__detail15_List_node_base9_M_unhookEv@@GLIBCXX_3.4.15
+FUNC:_ZNSt8__format25__locale_encoding_to_utf8ERKSt6localeSt17basic_string_viewIcSt11char_traitsIcEEPv@@GLIBCXX_3.4.34
+FUNC:_ZNSt8__format26__with_encoding_conversionERKSt6locale@@GLIBCXX_3.4.34
 FUNC:_ZNSt8bad_castD0Ev@@GLIBCXX_3.4
 FUNC:_ZNSt8bad_castD1Ev@@GLIBCXX_3.4
 FUNC:_ZNSt8bad_castD2Ev@@GLIBCXX_3.4
@@ -4858,6 +4868,7 @@ OBJECT:0:GLIBCXX_3.4.30
 OBJECT:0:GLIBCXX_3.4.31
 OBJECT:0:GLIBCXX_3.4.32
 OBJECT:0:GLIBCXX_3.4.33
+OBJECT:0:GLIBCXX_3.4.34
 OBJECT:0:GLIBCXX_3.4.4
 OBJECT:0:GLIBCXX_3.4.5
 OBJECT:0:GLIBCXX_3.4.6
diff --git 
a/libstdc++-v3/config/abi/post/powerpc64-linux-gnu/32/baseline_symbols.txt 
b/libstdc++-v3/config/abi/post/powerpc64-linux-gnu/32/baseline_symbols.txt
index c38386543b6..b8b27d0a91b 100644
--- a/libstdc++-v3/config/abi/post/powerpc64-linux-gnu/32/baseline_symbols.txt
+++ b/libstdc++-v3/config/abi/post/powerpc64-linux-gnu/32/baseline_symbols.txt
@@ -2270,6 +2270,10 @@ 
FUNC:_ZNSt12__shared_ptrINSt10filesystem7__cxx114_DirELN9__gnu_cxx12_Lock_policy
 
FUNC:_ZNSt12__share

Re: [Fortran, Patch, PR119200, v1] Use correct locus while check()ing coarray functions.

2025-04-23 Thread Andre Vehreschild
Hi Harald,

thanks for the review.

> this is bordering on the obvious and thus OK, except for:

Well, it wasn't so obvious, when was able to add a mistake ;-)

I have fixed that and committed as gcc-16-94-gcc2716a3f52.

Thanks again for the review,
Andre

> 
> @@ -6967,7 +6972,8 @@ gfc_check_ucobound (gfc_expr *coarray, gfc_expr 
> *dim, gfc_expr *kind)
>   {
> if (flag_coarray == GFC_FCOARRAY_NONE)
>   {
> -  gfc_fatal_error ("Coarrays disabled at %C, use %<-fcoarray=%> to 
> enable");
> +  gfc_fatal_error ("Coarrays disabled at L, use %<-fcoarray=%> to 
> enable",
> +  gfc_current_intrinsic_where);
> return false;
>   }
> 
> A percent is missing.  It should read "%L", not "L".
> 
> > This error does not crash gfortran reliably. But valgrind
> > reports an access to uninitialized memory. I therefore do not know how to
> > test this in the testsuite.  
> 
> I don't know a reasonable way to test this either.  There is one
> existing test with dg-error "Coarrays disabled..., but the issue
> addressed here might show up only in an instrumented compiler
> (ASAN or UBSAN?).  And since each message here is emitted by
> gfc_fatal_error(), one could only test one case per testcase.
> (IMHO testing this would be insane.)
> 
> > Regtests ok on x86_64-pc-linux-gnu / F41. Ok for mainline?  
> 
> Yes, this is OK.  Thanks for the patch!
> 
> Harald
> 
> > Regards,
> > Andre  
> 


-- 
Andre Vehreschild * Email: vehre ad gmx dot de 


Re: Help: Re: Questions on replacing a structure pointer reference to a call to .ACCESS_WITH_SIZE in C FE

2025-04-23 Thread Richard Biener
On Tue, Apr 22, 2025 at 5:22 PM Qing Zhao  wrote:
>
> Hi,
>
> I have met the following issue when I tried to implement the following into 
> tree-object-size.cc:
> (And this took me quite some time, still don’t know what’s the best solution)
>
> > On Apr 16, 2025, at 10:46, Qing Zhao  wrote:
> >
> > 3. When generating the reference to the field member in tree-object-size, 
> > we should guard this reference with a checking
> >on the pointer to the structure is valid. i.e:
> >
> > struct annotated {
> >  size_t count;
> >  char array[] __attribute__((counted_by (count)));
> > };
> >
> > static size_t __attribute__((__noinline__)) size_of (struct annotated * obj)
> > {
> >   return __builtin_dynamic_object_size (obj, 1);
> > }
> >
> > When we try to generate the reference to obj->count when evaluating 
> > __builtin_dynamic_object_size (obj, 1),
> > We should generate the following:
> >
> >   If (obj != NULL)
> > * (&obj->count)
> >
> > To make sure that the pointer to the structure object is valid first.
> >
>
> Then as I generate the following size_expr in tree-object-size.cc:
>
> Breakpoint 1, gimplify_size_expressions (osi=0xdf30)
> at ../../latest-gcc-write/gcc/tree-object-size.cc:1178
> 1178   force_gimple_operand (size_expr, &seq, true, NULL);
> (gdb) call debug_generic_expr(size_expr)
> _4 = obj_2(D) != 0B ? (sizetype) (int) MAX_EXPR <(sizetype) MAX_EXPR   [(void *)&*obj_2(D)], 0> + 4, 4> : 18446744073709551615
>
> When calling “force_gimple_operand” for the above size_expr, I got the 
> following ICE in gimplify_modify_expr, at gimplify.cc:7505:

You shouldn't really force_gimple_operand to a MODIFY_EXPR but instead
only to its RHS.

> (gdb) c
> Continuing.
> during GIMPLE pass: objsz
> dump file: a-t.c.110t.objsz1
> In function ‘size_of’:
> cc1: internal compiler error: in gimplify_modify_expr, at gimplify.cc:7505
> 0x36feb67 internal_error(char const*, ...)
> ../../latest-gcc-write/gcc/diagnostic-global-context.cc:517
> 0x36ccd67 fancy_abort(char const*, int, char const*)
> ../../latest-gcc-write/gcc/diagnostic.cc:1749
> 0x14fa8ab gimplify_modify_expr
> ../../latest-gcc-write/gcc/gimplify.cc:7505
> 0x15354c3 gimplify_expr(tree_node**, gimple**, gimple**, bool 
> (*)(tree_node*), int)
> ../../latest-gcc-write/gcc/gimplify.cc:19530
> 0x14fe1b3 gimplify_stmt(tree_node**, gimple**)
> ../../latest-gcc-write/gcc/gimplify.cc:8458
> ….
> 0x1b07757 gimplify_size_expressions
> ../../latest-gcc-write/gcc/tree-object-size.cc:1178
>
> I debugged into this a little bit, and found that the following are the 
> reason for the assertion failure in the routine “gimplify_modify_expr” of 
> gimplify.cc:
>
> 1. The assertion failure is:
>
>  7502   if (gimplify_ctxp->into_ssa && is_gimple_reg (*to_p))
>  7503 {
>  7504   /* We should have got an SSA name from the start.  */
>  7505   gcc_assert (TREE_CODE (*to_p) == SSA_NAME
>  7506   || ! gimple_in_ssa_p (cfun));
>  7507 }
>
> 2. The above assertion failure is issued for the following temporary tree:
>
> (gdb) call debug_generic_expr(*to_p)
> iftmp.2
> (gdb) call debug_generic_expr(*expr_p)
> iftmp.2 = (sizetype) _10
>
> In the above, the temporary variable “iftmp.2” triggered the assertion since 
> it’s NOT a SSA_NAME but the gimple_in_ssa_p (cfun) is TRUE.
>
> 3. As I checked, this temporary variable “iftmp.2” was generated at line 5498 
> in the routine “gimplify_cond_expr” of gimplify.cc:
>
>  5477   /* If this COND_EXPR has a value, copy the values into a temporary 
> within
>  5478  the arms.  */
>  5479   if (!VOID_TYPE_P (type))
>  5480 {
> …..
>  5498   tmp = create_tmp_var (type, "iftmp”);
> ...
>  5537 }
>
> 4. And then later, this temporary created here “iftmp.2” triggered the 
> assertion failure.
>
> Right now, I have the following questions:
>
> 1. Can I generate a size_expr as complicate as the following in 
> tree-object-size.cc:
>
> _4 = obj_2(D) != 0B ? (sizetype) (int) MAX_EXPR <(sizetype) MAX_EXPR   [(void *)&*obj_2(D)], 0> + 4, 4> : 18446744073709551615
>
> 2. If Yes to 1, is this a bug in “gimplify_cond_expr”? Shall we call 
> “make_ssa_name” after the call to “create_tmp_var” if “gimple_in_ssa_p(cfun)” 
> is TRUE?
>
> 3. If No to 1, how can we check whether the pointer is zero before 
> dereference from it to access its field?
>
> Thanks a lot for any hints.
>
> Qing


RE: [PATCH]middle-end: Add new "max" vector cost model

2025-04-23 Thread Tamar Christina
> -Original Message-
> From: Richard Biener 
> Sent: Wednesday, April 23, 2025 9:31 AM
> To: Tamar Christina 
> Cc: gcc-patches@gcc.gnu.org; nd ; Richard Sandiford
> 
> Subject: Re: [PATCH]middle-end: Add new "max" vector cost model
> 
> On Wed, 23 Apr 2025, Tamar Christina wrote:
> 
> > Hi All,
> >
> > This patch proposes a new vector cost model called "max".  The cost model 
> > is an
> > intersection between two of our existing cost models.  Like `unlimited` it
> > disables the costing vs scalar and assumes all vectorization to be 
> > profitable.
> >
> > But unlike unlimited it does not fully disable the vector cost model.  That
> > means that we still perform comparisons between vector modes.
> >
> > As an example, the following:
> >
> > void
> > foo (char *restrict a, int *restrict b, int *restrict c,
> >  int *restrict d, int stride)
> > {
> > if (stride <= 1)
> > return;
> >
> > for (int i = 0; i < 3; i++)
> > {
> > int res = c[i];
> > int t = b[i * stride];
> > if (a[i] != 0)
> > res = t * d[i];
> > c[i] = res;
> > }
> > }
> >
> > compiled with -O3 -march=armv8-a+sve -fvect-cost-model=dynamic fails to
> > vectorize as it assumes scalar would be faster, and with
> > -fvect-cost-model=unlimited it picks a vector type that's so big that the 
> > large
> > sequence generated is working on mostly inactive lanes:
> >
> > ...
> > and p3.b, p3/z, p4.b, p4.b
> > whilelo p0.s, wzr, w7
> > ld1wz23.s, p3/z, [x3, #3, mul vl]
> > ld1wz28.s, p0/z, [x5, z31.s, sxtw 2]
> > add x0, x5, x0
> > punpklo p6.h, p6.b
> > ld1wz27.s, p4/z, [x0, z31.s, sxtw 2]
> > and p6.b, p6/z, p0.b, p0.b
> > punpklo p4.h, p7.b
> > ld1wz24.s, p6/z, [x3, #2, mul vl]
> > and p4.b, p4/z, p2.b, p2.b
> > uqdecw  w6
> > ld1wz26.s, p4/z, [x3]
> > whilelo p1.s, wzr, w6
> > mul z27.s, p5/m, z27.s, z23.s
> > ld1wz29.s, p1/z, [x4, z31.s, sxtw 2]
> > punpkhi p7.h, p7.b
> > mul z24.s, p5/m, z24.s, z28.s
> > and p7.b, p7/z, p1.b, p1.b
> > mul z26.s, p5/m, z26.s, z30.s
> > ld1wz25.s, p7/z, [x3, #1, mul vl]
> > st1wz27.s, p3, [x2, #3, mul vl]
> > mul z25.s, p5/m, z25.s, z29.s
> > st1wz24.s, p6, [x2, #2, mul vl]
> > st1wz25.s, p7, [x2, #1, mul vl]
> > st1wz26.s, p4, [x2]
> > ...
> >
> > With -fvect-cost-model=max you get more reasonable code:
> >
> > foo:
> > cmp w4, 1
> > ble .L1
> > ptrue   p7.s, vl3
> > index   z0.s, #0, w4
> > ld1bz29.s, p7/z, [x0]
> > ld1wz30.s, p7/z, [x1, z0.s, sxtw 2]
> > ptrue   p6.b, all
> > cmpne   p7.b, p7/z, z29.b, #0
> > ld1wz31.s, p7/z, [x3]
> > mul z31.s, p6/m, z31.s, z30.s
> > st1wz31.s, p7, [x2]
> > .L1:
> > ret
> >
> > This model has been useful internally for performance exploration and cost-
> model
> > validation.  It allows us to force realistic vectorization overriding the 
> > cost
> > model to be able to tell whether it's correct wrt to profitability.
> >
> > Bootstrapped Regtested on aarch64-none-linux-gnu,
> > arm-none-linux-gnueabihf, x86_64-pc-linux-gnu
> > -m32, -m64 and no issues.
> >
> > Ok for master?
> 
> Hmm.  I don't like another cost model.  Instead how about changing
> 'unlimited' to still iterate through vector sizes?  Cost modeling
> is really about vector vs. scalar, not vector vs. vector which is
> completely under target control.  Targets should provide a way
> to limit iteration, like aarch64 has with the aarch64-autovec-preference
> --param or x86 has with -mprefer-vector-width.
> 

I'm ok with changing 'unlimited' if that's preferred, but I do want to point
out that we don't have enough control with current --param or -m options
to simulate all cases.

For instance for SVE there's way for us to force a smaller type to be used
and thus force an unpacking to happen.  Or there's no way to force an
unrolling with Adv. SIMD.

Basically there's not enough control over the VF to exercise some tests
reliably.  Some tests explicitly relied on unlimited just picking the first
mode.

Thanks,
Tamar

> Of course changing 'unlimited' might result in somewhat of a testsuite
> churn, but still the fix there would be to inject a proper -mXYZ
> or --param to get the old behavior back (or even consider cycling
> through the different aarch64-autovec-preference settings for the
> testsuite).
> 
> Richard.
> 
> > Thanks,
> > Tamar
> >
> > gcc/ChangeLog:
> >
> > * common.opt (vect-cost-model, simd-cost-model): Add max cost model.
> > * doc/invoke.texi: Document it.
> > * flag-types.h (enum vect_cost_model): Add VECT_COST_MODEL_MAX.
> > * tree-vect-data-refs.cc (vect_peeling_hash_in

Re: [PATCH] Document AArch64 changes for GCC 15

2025-04-23 Thread Richard Sandiford
Evgeny Karpov  writes:
> Tuesday, April 22, 2025
> "Richard Sandiford"  wrote:
>
>> +  Support has been added for the AArch64 MinGW target
>> +    (aarch64-w64-mingw32).  At present, this target only
>> +    supports C, but further work is planned.
>> +  
>
> Thank you for the release summary for AArch64 and for mentioning the new 
> aarch64-w64-mingw32 target.
> Here is some clarification about the current upstream changes for 
> aarch64-w64-mingw32 and 
> the upstreaming status for the next release cycle:
>
> - C and C++ languages are supported.
> - 605k tests were executed with a pass rate of 89%.
> - The main tested architecture is armv8-a.
> - Optional extensions, such as SVE, are not supported.
> - SEH and variadic functions are not included in this release and are still 
> under review for upstreaming.
>
> armv8-a has been used as the main testing architecture for the new target.
> C++ code with exceptions can be compiled, and it works until an exception 
> needs to be handled.
> The current SEH patch series for binutils and GCC covers all SEH cases in the 
> Boost library testing.
> Once they are upstreamed, the Boost library test results will be very close 
> for x64 and aarch64.
>
> Variadic functions mostly work, however some changes need to be upstreamed to 
> support all cases.

Thanks the summary.  Does the entry below look ok?

  Support has been added for the AArch64 MinGW target
(aarch64-w64-mingw32).  At present, this target
supports C and C++ for base Armv8-A, but with some caveats:

  Although most variadic functions work, the implementation
of them is not yet complete.
  
  C++ exception handling is not yet implemented.

Further work is planned for GCC 16.
  

Richard


Re: [PATCH]middle-end: Add new "max" vector cost model

2025-04-23 Thread Richard Biener
On Wed, 23 Apr 2025, Tamar Christina wrote:

> Hi All,
> 
> This patch proposes a new vector cost model called "max".  The cost model is 
> an
> intersection between two of our existing cost models.  Like `unlimited` it
> disables the costing vs scalar and assumes all vectorization to be profitable.
> 
> But unlike unlimited it does not fully disable the vector cost model.  That
> means that we still perform comparisons between vector modes.
> 
> As an example, the following:
> 
> void
> foo (char *restrict a, int *restrict b, int *restrict c,
>  int *restrict d, int stride)
> {
> if (stride <= 1)
> return;
> 
> for (int i = 0; i < 3; i++)
> {
> int res = c[i];
> int t = b[i * stride];
> if (a[i] != 0)
> res = t * d[i];
> c[i] = res;
> }
> }
> 
> compiled with -O3 -march=armv8-a+sve -fvect-cost-model=dynamic fails to
> vectorize as it assumes scalar would be faster, and with
> -fvect-cost-model=unlimited it picks a vector type that's so big that the 
> large
> sequence generated is working on mostly inactive lanes:
> 
> ...
> and p3.b, p3/z, p4.b, p4.b
> whilelo p0.s, wzr, w7
> ld1wz23.s, p3/z, [x3, #3, mul vl]
> ld1wz28.s, p0/z, [x5, z31.s, sxtw 2]
> add x0, x5, x0
> punpklo p6.h, p6.b
> ld1wz27.s, p4/z, [x0, z31.s, sxtw 2]
> and p6.b, p6/z, p0.b, p0.b
> punpklo p4.h, p7.b
> ld1wz24.s, p6/z, [x3, #2, mul vl]
> and p4.b, p4/z, p2.b, p2.b
> uqdecw  w6
> ld1wz26.s, p4/z, [x3]
> whilelo p1.s, wzr, w6
> mul z27.s, p5/m, z27.s, z23.s
> ld1wz29.s, p1/z, [x4, z31.s, sxtw 2]
> punpkhi p7.h, p7.b
> mul z24.s, p5/m, z24.s, z28.s
> and p7.b, p7/z, p1.b, p1.b
> mul z26.s, p5/m, z26.s, z30.s
> ld1wz25.s, p7/z, [x3, #1, mul vl]
> st1wz27.s, p3, [x2, #3, mul vl]
> mul z25.s, p5/m, z25.s, z29.s
> st1wz24.s, p6, [x2, #2, mul vl]
> st1wz25.s, p7, [x2, #1, mul vl]
> st1wz26.s, p4, [x2]
> ...
> 
> With -fvect-cost-model=max you get more reasonable code:
> 
> foo:
> cmp w4, 1
> ble .L1
> ptrue   p7.s, vl3
> index   z0.s, #0, w4
> ld1bz29.s, p7/z, [x0]
> ld1wz30.s, p7/z, [x1, z0.s, sxtw 2]
>   ptrue   p6.b, all
> cmpne   p7.b, p7/z, z29.b, #0
> ld1wz31.s, p7/z, [x3]
>   mul z31.s, p6/m, z31.s, z30.s
> st1wz31.s, p7, [x2]
> .L1:
> ret
> 
> This model has been useful internally for performance exploration and 
> cost-model
> validation.  It allows us to force realistic vectorization overriding the cost
> model to be able to tell whether it's correct wrt to profitability.
> 
> Bootstrapped Regtested on aarch64-none-linux-gnu,
> arm-none-linux-gnueabihf, x86_64-pc-linux-gnu
> -m32, -m64 and no issues.
> 
> Ok for master?

Hmm.  I don't like another cost model.  Instead how about changing
'unlimited' to still iterate through vector sizes?  Cost modeling
is really about vector vs. scalar, not vector vs. vector which is
completely under target control.  Targets should provide a way
to limit iteration, like aarch64 has with the aarch64-autovec-preference
--param or x86 has with -mprefer-vector-width.

Of course changing 'unlimited' might result in somewhat of a testsuite
churn, but still the fix there would be to inject a proper -mXYZ
or --param to get the old behavior back (or even consider cycling
through the different aarch64-autovec-preference settings for the
testsuite).

Richard.

> Thanks,
> Tamar
> 
> gcc/ChangeLog:
> 
>   * common.opt (vect-cost-model, simd-cost-model): Add max cost model.
>   * doc/invoke.texi: Document it.
>   * flag-types.h (enum vect_cost_model): Add VECT_COST_MODEL_MAX.
>   * tree-vect-data-refs.cc (vect_peeling_hash_insert,
>   vect_peeling_hash_choose_best_peeling,
>   vect_enhance_data_refs_alignment): Use it.
>   * tree-vect-loop.cc (vect_analyze_loop_costing,
>   vect_estimate_min_profitable_iters): Likewise.
> 
> ---
> diff --git a/gcc/common.opt b/gcc/common.opt
> index 
> 88d987e6ab14d9f8df7aa686efffc43418dbb42d..bd5e2e951f9388b12206d9addc736e336cd0e4ee
>  100644
> --- a/gcc/common.opt
> +++ b/gcc/common.opt
> @@ -3442,11 +3442,11 @@ Enable basic block vectorization (SLP) on trees.
>  
>  fvect-cost-model=
>  Common Joined RejectNegative Enum(vect_cost_model) Var(flag_vect_cost_model) 
> Init(VECT_COST_MODEL_DEFAULT) Optimization
> --fvect-cost-model=[unlimited|dynamic|cheap|very-cheap]   Specifies the 
> cost model for vectorization.
> +-fvect-cost-model=[unlimited|max|dynamic|cheap|very-cheap]   Specifies the 
> cost model for vectorization.
>  
>  fsimd-cost-model=
>  Common Joined RejectNegative Enum(vect_cost_model) Var(flag_simd_cost_model) 
> In

Re: [PATCH] Document AArch64 changes for GCC 15

2025-04-23 Thread Richard Sandiford
Andrew Pinski  writes:
> On Tue, Apr 22, 2025 at 5:32 AM Richard Sandiford
>  wrote:
>>
>> The list is structured as:
>>
>> - new configurations
>> - command-line changes
>> - ACLE changes
>> - everything else
>>
>> As usual, the list of new architectures, CPUs, and features is from a
>> purely mechanical trawl of the associated .def files.  I've identified
>> features by their architectural name to try to improve searchability.
>> Similarly, the list of ACLE changes includes the associated ACLE
>> feature macros, again to try to improve searchability.
>>
>> The list summarises some of the target-specific optimisations because
>> it sounded like Tamar had received feedback that people found such
>> information interesting.
>>
>> I've used the passive tense for most entries, to try to follow the
>> style used elsewhere.
>>
>> We don't yet define __ARM_FEATURE_FAMINMAX, but I'll fix that
>> separately.
>>
>> How does this look?  Anything I missed?
>
> I don't see a mention that even if falkor and saphira support still
> exists, the tuning for them are mostly removed.
> (scheduler and the tag collision pass was removed).

Ah, yeah, that was deliberate.  My take was that, if anyone is going
to care, we shouldn't have removed it.  And if no-one is going to care,
there's no point mentioning it.

> Maybe a mention that the pre-RA scheduler is disabled at -O2? (I am
> not 100% sure this should be mentioned).

Oops, yes, thanks for the spot.  I forgot to go through
gcc/common/config/aarch64 properly.

Richard

>
> Those are the only 2 I saw missing.
>
> Thanks,
> Andrew Pinski
>
>>
>> I'll leave a few days for comments.
>>
>> Thanks,
>> Richard
>>
>> ---
>>  htdocs/gcc-15/changes.html | 241 -
>>  1 file changed, 240 insertions(+), 1 deletion(-)
>>
>> diff --git a/htdocs/gcc-15/changes.html b/htdocs/gcc-15/changes.html
>> index f03e29c8..dee476c7 100644
>> --- a/htdocs/gcc-15/changes.html
>> +++ b/htdocs/gcc-15/changes.html
>> @@ -681,7 +681,246 @@ asm (".text; %cc0: mov %cc2, %%r0; .previous;"
>>  
>>  New Targets and Target Specific Improvements
>>
>> -
>> +AArch64
>> +
>> +
>> +  Support has been added for the AArch64 MinGW target
>> +(aarch64-w64-mingw32).  At present, this target only
>> +supports C, but further work is planned.
>> +  
>> +
>> +  The following architecture level is now supported by
>> +-march and related source-level constructs
>> +(GCC identifiers in parentheses):
>> +
>> +  Armv9.5-A (arm9.5-a)
>> +
>> +  
>> +  The following CPUs are now supported by -mcpu,
>> +-mtune, and related source-level constructs
>> +(GCC identifiers in parentheses):
>> +
>> +  Apple A12 (apple-a12)
>> +  Apple M1 (apple-m1)
>> +  Apple M2 (apple-m2)
>> +  Apple M3 (apple-m3)
>> +  Arm Cortex-A520AE (cortex-a520ae)
>> +  Arm Cortex-A720AE (cortex-a720ae)
>> +  Arm Cortex-A725 (cortex-a725)
>> +  Arm Cortex-R82AE (cortex-r82ae)
>> +  Arm Cortex-X925 (cortex-x925)
>> +  Arm Neoverse N3 (neoverse-n3)
>> +  Arm Neoverse V3 (neoverse-v3)
>> +  Arm Neoverse V3AE (neoverse-v3ae)
>> +  FUJITSU-MONAKA (fujitsu-monaka)
>> +  NVIDIA Grace (grace)
>> +  NVIDIA Olympus (olympus)
>> +  Qualcomm Oryon-1 (oryon-1)
>> +
>> +  
>> +  The following features are now supported by -march,
>> +-mcpu, and related source-level constructs
>> +(GCC modifiers in parentheses):
>> +
>> +  FEAT_CPA (+cpa), enabled by default for
>> +Arm9.5-A and above
>> +  
>> +  FEAT_FAMINMAX (+faminmax), enabled by default for
>> +Arm9.5-A and above
>> +  
>> +  FEAT_FCMA (+fcma), enabled by default for Armv8.3-A
>> +and above
>> +  
>> +  FEAT_FLAGM2 (+flagm2), enabled by default for
>> +Armv8.5-A and above
>> +  
>> +  FEAT_FP8 (+fp8)
>> +  FEAT_FP8DOT2 (+fp8dot2)
>> +  FEAT_FP8DOT4 (+fp8dot4)
>> +  FEAT_FP8FMA (+fp8fma)
>> +  FEAT_FRINTTS (+frintts), enabled by default for
>> +Armv8.5-A and above
>> +  
>> +  FEAT_JSCVT (+jscvt), enabled by default for
>> +Armv8.3-A and above
>> +  
>> +  FEAT_LUT (+lut), enabled by default for
>> +Arm9.5-A and above
>> +  
>> +  FEAT_LRCPC2 (+rcpc2), enabled by default for
>> +Armv8.4-A and above
>> +  
>> +  FEAT_SME_B16B16 (+sme-b16b16)
>> +  FEAT_SME_F16F16 (+sme-f16f16)
>> +  FEAT_SME2p1 (+sme2p1)
>> +  FEAT_SSVE_FP8DOT2 (+ssve-fp8dot2)
>> +  FEAT_SSVE_FP8DOT4 (+ssve-fp8dot4)
>> +  FEAT_SSVE_FP8FMA (+ssve-fp8fma)
>> +  FEAT_SVE_B16B16 (+sve-b16b16)
>> +  FEAT_SVE2p1 (+sve2p1), enabled by default for
>> +Armv9.4-A and above
>> +  
>> +  FEAT_WFXT (+wfxt), enabled by default for
>> +Armv8.7-A and above
>> +  
>> +  FEAT_XS (+xs), enabled by default for
>> +Armv8.7-A and above
>> +  
>> +
>> +The features listed as being enabled by default for Ar

Re: [PATCH]middle-end: Add new "max" vector cost model

2025-04-23 Thread Richard Sandiford
Tamar Christina  writes:
>> -Original Message-
>> From: Richard Biener 
>> Sent: Wednesday, April 23, 2025 9:31 AM
>> To: Tamar Christina 
>> Cc: gcc-patches@gcc.gnu.org; nd ; Richard Sandiford
>> 
>> Subject: Re: [PATCH]middle-end: Add new "max" vector cost model
>> 
>> On Wed, 23 Apr 2025, Tamar Christina wrote:
>> 
>> > Hi All,
>> >
>> > This patch proposes a new vector cost model called "max".  The cost model 
>> > is an
>> > intersection between two of our existing cost models.  Like `unlimited` it
>> > disables the costing vs scalar and assumes all vectorization to be 
>> > profitable.
>> >
>> > But unlike unlimited it does not fully disable the vector cost model.  That
>> > means that we still perform comparisons between vector modes.
>> >
>> > As an example, the following:
>> >
>> > void
>> > foo (char *restrict a, int *restrict b, int *restrict c,
>> >  int *restrict d, int stride)
>> > {
>> > if (stride <= 1)
>> > return;
>> >
>> > for (int i = 0; i < 3; i++)
>> > {
>> > int res = c[i];
>> > int t = b[i * stride];
>> > if (a[i] != 0)
>> > res = t * d[i];
>> > c[i] = res;
>> > }
>> > }
>> >
>> > compiled with -O3 -march=armv8-a+sve -fvect-cost-model=dynamic fails to
>> > vectorize as it assumes scalar would be faster, and with
>> > -fvect-cost-model=unlimited it picks a vector type that's so big that the 
>> > large
>> > sequence generated is working on mostly inactive lanes:
>> >
>> > ...
>> > and p3.b, p3/z, p4.b, p4.b
>> > whilelo p0.s, wzr, w7
>> > ld1wz23.s, p3/z, [x3, #3, mul vl]
>> > ld1wz28.s, p0/z, [x5, z31.s, sxtw 2]
>> > add x0, x5, x0
>> > punpklo p6.h, p6.b
>> > ld1wz27.s, p4/z, [x0, z31.s, sxtw 2]
>> > and p6.b, p6/z, p0.b, p0.b
>> > punpklo p4.h, p7.b
>> > ld1wz24.s, p6/z, [x3, #2, mul vl]
>> > and p4.b, p4/z, p2.b, p2.b
>> > uqdecw  w6
>> > ld1wz26.s, p4/z, [x3]
>> > whilelo p1.s, wzr, w6
>> > mul z27.s, p5/m, z27.s, z23.s
>> > ld1wz29.s, p1/z, [x4, z31.s, sxtw 2]
>> > punpkhi p7.h, p7.b
>> > mul z24.s, p5/m, z24.s, z28.s
>> > and p7.b, p7/z, p1.b, p1.b
>> > mul z26.s, p5/m, z26.s, z30.s
>> > ld1wz25.s, p7/z, [x3, #1, mul vl]
>> > st1wz27.s, p3, [x2, #3, mul vl]
>> > mul z25.s, p5/m, z25.s, z29.s
>> > st1wz24.s, p6, [x2, #2, mul vl]
>> > st1wz25.s, p7, [x2, #1, mul vl]
>> > st1wz26.s, p4, [x2]
>> > ...
>> >
>> > With -fvect-cost-model=max you get more reasonable code:
>> >
>> > foo:
>> > cmp w4, 1
>> > ble .L1
>> > ptrue   p7.s, vl3
>> > index   z0.s, #0, w4
>> > ld1bz29.s, p7/z, [x0]
>> > ld1wz30.s, p7/z, [x1, z0.s, sxtw 2]
>> >ptrue   p6.b, all
>> > cmpne   p7.b, p7/z, z29.b, #0
>> > ld1wz31.s, p7/z, [x3]
>> >mul z31.s, p6/m, z31.s, z30.s
>> > st1wz31.s, p7, [x2]
>> > .L1:
>> > ret
>> >
>> > This model has been useful internally for performance exploration and cost-
>> model
>> > validation.  It allows us to force realistic vectorization overriding the 
>> > cost
>> > model to be able to tell whether it's correct wrt to profitability.
>> >
>> > Bootstrapped Regtested on aarch64-none-linux-gnu,
>> > arm-none-linux-gnueabihf, x86_64-pc-linux-gnu
>> > -m32, -m64 and no issues.
>> >
>> > Ok for master?
>> 
>> Hmm.  I don't like another cost model.  Instead how about changing
>> 'unlimited' to still iterate through vector sizes?  Cost modeling
>> is really about vector vs. scalar, not vector vs. vector which is
>> completely under target control.  Targets should provide a way
>> to limit iteration, like aarch64 has with the aarch64-autovec-preference
>> --param or x86 has with -mprefer-vector-width.
>> 
>
> I'm ok with changing 'unlimited' if that's preferred, but I do want to point
> out that we don't have enough control with current --param or -m options
> to simulate all cases.
>
> For instance for SVE there's way for us to force a smaller type to be used
> and thus force an unpacking to happen.  Or there's no way to force an
> unrolling with Adv. SIMD.
>
> Basically there's not enough control over the VF to exercise some tests
> reliably.  Some tests explicitly relied on unlimited just picking the first
> mode.

FWIW, adding extra AArch64 --params sounds ok to me.  The ones we have
were just added on an as-needed/as-wanted basis, rather than as an attempt
to be complete.

After the aarch64-autovec-preference backward-compatibility controversy,
we should consider whether what we add is something that is intended for
developers and can be taken away at any time (--param), or whether it's
something that we promise to support going forward (-m).

Thanks,
Richard


RE: [PATCH]middle-end: Add new "max" vector cost model

2025-04-23 Thread Richard Biener
On Wed, 23 Apr 2025, Tamar Christina wrote:

> > -Original Message-
> > From: Richard Biener 
> > Sent: Wednesday, April 23, 2025 9:37 AM
> > To: Tamar Christina 
> > Cc: gcc-patches@gcc.gnu.org; nd ; Richard Sandiford
> > 
> > Subject: Re: [PATCH]middle-end: Add new "max" vector cost model
> > 
> > On Wed, 23 Apr 2025, Richard Biener wrote:
> > 
> > > On Wed, 23 Apr 2025, Tamar Christina wrote:
> > >
> > > > Hi All,
> > > >
> > > > This patch proposes a new vector cost model called "max".  The cost 
> > > > model is
> > an
> > > > intersection between two of our existing cost models.  Like `unlimited` 
> > > > it
> > > > disables the costing vs scalar and assumes all vectorization to be 
> > > > profitable.
> > > >
> > > > But unlike unlimited it does not fully disable the vector cost model.  
> > > > That
> > > > means that we still perform comparisons between vector modes.
> > > >
> > > > As an example, the following:
> > > >
> > > > void
> > > > foo (char *restrict a, int *restrict b, int *restrict c,
> > > >  int *restrict d, int stride)
> > > > {
> > > > if (stride <= 1)
> > > > return;
> > > >
> > > > for (int i = 0; i < 3; i++)
> > > > {
> > > > int res = c[i];
> > > > int t = b[i * stride];
> > > > if (a[i] != 0)
> > > > res = t * d[i];
> > > > c[i] = res;
> > > > }
> > > > }
> > > >
> > > > compiled with -O3 -march=armv8-a+sve -fvect-cost-model=dynamic fails to
> > > > vectorize as it assumes scalar would be faster, and with
> > > > -fvect-cost-model=unlimited it picks a vector type that's so big that 
> > > > the large
> > > > sequence generated is working on mostly inactive lanes:
> > > >
> > > > ...
> > > > and p3.b, p3/z, p4.b, p4.b
> > > > whilelo p0.s, wzr, w7
> > > > ld1wz23.s, p3/z, [x3, #3, mul vl]
> > > > ld1wz28.s, p0/z, [x5, z31.s, sxtw 2]
> > > > add x0, x5, x0
> > > > punpklo p6.h, p6.b
> > > > ld1wz27.s, p4/z, [x0, z31.s, sxtw 2]
> > > > and p6.b, p6/z, p0.b, p0.b
> > > > punpklo p4.h, p7.b
> > > > ld1wz24.s, p6/z, [x3, #2, mul vl]
> > > > and p4.b, p4/z, p2.b, p2.b
> > > > uqdecw  w6
> > > > ld1wz26.s, p4/z, [x3]
> > > > whilelo p1.s, wzr, w6
> > > > mul z27.s, p5/m, z27.s, z23.s
> > > > ld1wz29.s, p1/z, [x4, z31.s, sxtw 2]
> > > > punpkhi p7.h, p7.b
> > > > mul z24.s, p5/m, z24.s, z28.s
> > > > and p7.b, p7/z, p1.b, p1.b
> > > > mul z26.s, p5/m, z26.s, z30.s
> > > > ld1wz25.s, p7/z, [x3, #1, mul vl]
> > > > st1wz27.s, p3, [x2, #3, mul vl]
> > > > mul z25.s, p5/m, z25.s, z29.s
> > > > st1wz24.s, p6, [x2, #2, mul vl]
> > > > st1wz25.s, p7, [x2, #1, mul vl]
> > > > st1wz26.s, p4, [x2]
> > > > ...
> > > >
> > > > With -fvect-cost-model=max you get more reasonable code:
> > > >
> > > > foo:
> > > > cmp w4, 1
> > > > ble .L1
> > > > ptrue   p7.s, vl3
> > > > index   z0.s, #0, w4
> > > > ld1bz29.s, p7/z, [x0]
> > > > ld1wz30.s, p7/z, [x1, z0.s, sxtw 2]
> > > > ptrue   p6.b, all
> > > > cmpne   p7.b, p7/z, z29.b, #0
> > > > ld1wz31.s, p7/z, [x3]
> > > > mul z31.s, p6/m, z31.s, z30.s
> > > > st1wz31.s, p7, [x2]
> > > > .L1:
> > > > ret
> > > >
> > > > This model has been useful internally for performance exploration and 
> > > > cost-
> > model
> > > > validation.  It allows us to force realistic vectorization overriding 
> > > > the cost
> > > > model to be able to tell whether it's correct wrt to profitability.
> > > >
> > > > Bootstrapped Regtested on aarch64-none-linux-gnu,
> > > > arm-none-linux-gnueabihf, x86_64-pc-linux-gnu
> > > > -m32, -m64 and no issues.
> > > >
> > > > Ok for master?
> > >
> > > Hmm.  I don't like another cost model.  Instead how about changing
> > > 'unlimited' to still iterate through vector sizes?  Cost modeling
> > > is really about vector vs. scalar, not vector vs. vector which is
> > > completely under target control.  Targets should provide a way
> > > to limit iteration, like aarch64 has with the aarch64-autovec-preference
> > > --param or x86 has with -mprefer-vector-width.
> > >
> > > Of course changing 'unlimited' might result in somewhat of a testsuite
> > > churn, but still the fix there would be to inject a proper -mXYZ
> > > or --param to get the old behavior back (or even consider cycling
> > > through the different aarch64-autovec-preference settings for the
> > > testsuite).
> > 
> > Note this will completely remove the ability to reject never profitable
> > vectorizations, so I'm not sure that this is what you'd want in practice.
> > You want to fix cost modeling instead.
> > 
> > So why does it consider the scalar c

Re: [PATCH]middle-end: Add new "max" vector cost model

2025-04-23 Thread Richard Biener
On Wed, 23 Apr 2025, Richard Biener wrote:

> On Wed, 23 Apr 2025, Tamar Christina wrote:
> 
> > Hi All,
> > 
> > This patch proposes a new vector cost model called "max".  The cost model 
> > is an
> > intersection between two of our existing cost models.  Like `unlimited` it
> > disables the costing vs scalar and assumes all vectorization to be 
> > profitable.
> > 
> > But unlike unlimited it does not fully disable the vector cost model.  That
> > means that we still perform comparisons between vector modes.
> > 
> > As an example, the following:
> > 
> > void
> > foo (char *restrict a, int *restrict b, int *restrict c,
> >  int *restrict d, int stride)
> > {
> > if (stride <= 1)
> > return;
> > 
> > for (int i = 0; i < 3; i++)
> > {
> > int res = c[i];
> > int t = b[i * stride];
> > if (a[i] != 0)
> > res = t * d[i];
> > c[i] = res;
> > }
> > }
> > 
> > compiled with -O3 -march=armv8-a+sve -fvect-cost-model=dynamic fails to
> > vectorize as it assumes scalar would be faster, and with
> > -fvect-cost-model=unlimited it picks a vector type that's so big that the 
> > large
> > sequence generated is working on mostly inactive lanes:
> > 
> > ...
> > and p3.b, p3/z, p4.b, p4.b
> > whilelo p0.s, wzr, w7
> > ld1wz23.s, p3/z, [x3, #3, mul vl]
> > ld1wz28.s, p0/z, [x5, z31.s, sxtw 2]
> > add x0, x5, x0
> > punpklo p6.h, p6.b
> > ld1wz27.s, p4/z, [x0, z31.s, sxtw 2]
> > and p6.b, p6/z, p0.b, p0.b
> > punpklo p4.h, p7.b
> > ld1wz24.s, p6/z, [x3, #2, mul vl]
> > and p4.b, p4/z, p2.b, p2.b
> > uqdecw  w6
> > ld1wz26.s, p4/z, [x3]
> > whilelo p1.s, wzr, w6
> > mul z27.s, p5/m, z27.s, z23.s
> > ld1wz29.s, p1/z, [x4, z31.s, sxtw 2]
> > punpkhi p7.h, p7.b
> > mul z24.s, p5/m, z24.s, z28.s
> > and p7.b, p7/z, p1.b, p1.b
> > mul z26.s, p5/m, z26.s, z30.s
> > ld1wz25.s, p7/z, [x3, #1, mul vl]
> > st1wz27.s, p3, [x2, #3, mul vl]
> > mul z25.s, p5/m, z25.s, z29.s
> > st1wz24.s, p6, [x2, #2, mul vl]
> > st1wz25.s, p7, [x2, #1, mul vl]
> > st1wz26.s, p4, [x2]
> > ...
> > 
> > With -fvect-cost-model=max you get more reasonable code:
> > 
> > foo:
> > cmp w4, 1
> > ble .L1
> > ptrue   p7.s, vl3
> > index   z0.s, #0, w4
> > ld1bz29.s, p7/z, [x0]
> > ld1wz30.s, p7/z, [x1, z0.s, sxtw 2]
> > ptrue   p6.b, all
> > cmpne   p7.b, p7/z, z29.b, #0
> > ld1wz31.s, p7/z, [x3]
> > mul z31.s, p6/m, z31.s, z30.s
> > st1wz31.s, p7, [x2]
> > .L1:
> > ret
> > 
> > This model has been useful internally for performance exploration and 
> > cost-model
> > validation.  It allows us to force realistic vectorization overriding the 
> > cost
> > model to be able to tell whether it's correct wrt to profitability.
> > 
> > Bootstrapped Regtested on aarch64-none-linux-gnu,
> > arm-none-linux-gnueabihf, x86_64-pc-linux-gnu
> > -m32, -m64 and no issues.
> > 
> > Ok for master?
> 
> Hmm.  I don't like another cost model.  Instead how about changing
> 'unlimited' to still iterate through vector sizes?  Cost modeling
> is really about vector vs. scalar, not vector vs. vector which is
> completely under target control.  Targets should provide a way
> to limit iteration, like aarch64 has with the aarch64-autovec-preference
> --param or x86 has with -mprefer-vector-width.
> 
> Of course changing 'unlimited' might result in somewhat of a testsuite
> churn, but still the fix there would be to inject a proper -mXYZ
> or --param to get the old behavior back (or even consider cycling
> through the different aarch64-autovec-preference settings for the
> testsuite).

Note this will completely remove the ability to reject never profitable
vectorizations, so I'm not sure that this is what you'd want in practice.
You want to fix cost modeling instead.

So why does it consider the scalar code to be faster with =dynamic
and why do you think that's not possible to fix?  Don't we have
per-loop #pragma control to force vectorization here (but maybe that
has the 'unlimited' cost modeling issue)?

Richard.

> Richard.
> 
> > Thanks,
> > Tamar
> > 
> > gcc/ChangeLog:
> > 
> > * common.opt (vect-cost-model, simd-cost-model): Add max cost model.
> > * doc/invoke.texi: Document it.
> > * flag-types.h (enum vect_cost_model): Add VECT_COST_MODEL_MAX.
> > * tree-vect-data-refs.cc (vect_peeling_hash_insert,
> > vect_peeling_hash_choose_best_peeling,
> > vect_enhance_data_refs_alignment): Use it.
> > * tree-vect-loop.cc (vect_analyze_loop_costing,
> > vect_estimate_min_profitable_iters): Likewise.
> > 
> > ---
> > diff --git a/gcc/common.opt b/gc

RE: [PATCH]middle-end: Add new "max" vector cost model

2025-04-23 Thread Tamar Christina
> -Original Message-
> From: Richard Biener 
> Sent: Wednesday, April 23, 2025 9:46 AM
> To: Tamar Christina 
> Cc: gcc-patches@gcc.gnu.org; nd ; Richard Sandiford
> 
> Subject: RE: [PATCH]middle-end: Add new "max" vector cost model
> 
> On Wed, 23 Apr 2025, Tamar Christina wrote:
> 
> > > -Original Message-
> > > From: Richard Biener 
> > > Sent: Wednesday, April 23, 2025 9:37 AM
> > > To: Tamar Christina 
> > > Cc: gcc-patches@gcc.gnu.org; nd ; Richard Sandiford
> > > 
> > > Subject: Re: [PATCH]middle-end: Add new "max" vector cost model
> > >
> > > On Wed, 23 Apr 2025, Richard Biener wrote:
> > >
> > > > On Wed, 23 Apr 2025, Tamar Christina wrote:
> > > >
> > > > > Hi All,
> > > > >
> > > > > This patch proposes a new vector cost model called "max".  The cost 
> > > > > model
> is
> > > an
> > > > > intersection between two of our existing cost models.  Like 
> > > > > `unlimited` it
> > > > > disables the costing vs scalar and assumes all vectorization to be 
> > > > > profitable.
> > > > >
> > > > > But unlike unlimited it does not fully disable the vector cost model. 
> > > > >  That
> > > > > means that we still perform comparisons between vector modes.
> > > > >
> > > > > As an example, the following:
> > > > >
> > > > > void
> > > > > foo (char *restrict a, int *restrict b, int *restrict c,
> > > > >  int *restrict d, int stride)
> > > > > {
> > > > > if (stride <= 1)
> > > > > return;
> > > > >
> > > > > for (int i = 0; i < 3; i++)
> > > > > {
> > > > > int res = c[i];
> > > > > int t = b[i * stride];
> > > > > if (a[i] != 0)
> > > > > res = t * d[i];
> > > > > c[i] = res;
> > > > > }
> > > > > }
> > > > >
> > > > > compiled with -O3 -march=armv8-a+sve -fvect-cost-model=dynamic fails 
> > > > > to
> > > > > vectorize as it assumes scalar would be faster, and with
> > > > > -fvect-cost-model=unlimited it picks a vector type that's so big that 
> > > > > the large
> > > > > sequence generated is working on mostly inactive lanes:
> > > > >
> > > > > ...
> > > > > and p3.b, p3/z, p4.b, p4.b
> > > > > whilelo p0.s, wzr, w7
> > > > > ld1wz23.s, p3/z, [x3, #3, mul vl]
> > > > > ld1wz28.s, p0/z, [x5, z31.s, sxtw 2]
> > > > > add x0, x5, x0
> > > > > punpklo p6.h, p6.b
> > > > > ld1wz27.s, p4/z, [x0, z31.s, sxtw 2]
> > > > > and p6.b, p6/z, p0.b, p0.b
> > > > > punpklo p4.h, p7.b
> > > > > ld1wz24.s, p6/z, [x3, #2, mul vl]
> > > > > and p4.b, p4/z, p2.b, p2.b
> > > > > uqdecw  w6
> > > > > ld1wz26.s, p4/z, [x3]
> > > > > whilelo p1.s, wzr, w6
> > > > > mul z27.s, p5/m, z27.s, z23.s
> > > > > ld1wz29.s, p1/z, [x4, z31.s, sxtw 2]
> > > > > punpkhi p7.h, p7.b
> > > > > mul z24.s, p5/m, z24.s, z28.s
> > > > > and p7.b, p7/z, p1.b, p1.b
> > > > > mul z26.s, p5/m, z26.s, z30.s
> > > > > ld1wz25.s, p7/z, [x3, #1, mul vl]
> > > > > st1wz27.s, p3, [x2, #3, mul vl]
> > > > > mul z25.s, p5/m, z25.s, z29.s
> > > > > st1wz24.s, p6, [x2, #2, mul vl]
> > > > > st1wz25.s, p7, [x2, #1, mul vl]
> > > > > st1wz26.s, p4, [x2]
> > > > > ...
> > > > >
> > > > > With -fvect-cost-model=max you get more reasonable code:
> > > > >
> > > > > foo:
> > > > > cmp w4, 1
> > > > > ble .L1
> > > > > ptrue   p7.s, vl3
> > > > > index   z0.s, #0, w4
> > > > > ld1bz29.s, p7/z, [x0]
> > > > > ld1wz30.s, p7/z, [x1, z0.s, sxtw 2]
> > > > >   ptrue   p6.b, all
> > > > > cmpne   p7.b, p7/z, z29.b, #0
> > > > > ld1wz31.s, p7/z, [x3]
> > > > >   mul z31.s, p6/m, z31.s, z30.s
> > > > > st1wz31.s, p7, [x2]
> > > > > .L1:
> > > > > ret
> > > > >
> > > > > This model has been useful internally for performance exploration and 
> > > > > cost-
> > > model
> > > > > validation.  It allows us to force realistic vectorization overriding 
> > > > > the cost
> > > > > model to be able to tell whether it's correct wrt to profitability.
> > > > >
> > > > > Bootstrapped Regtested on aarch64-none-linux-gnu,
> > > > > arm-none-linux-gnueabihf, x86_64-pc-linux-gnu
> > > > > -m32, -m64 and no issues.
> > > > >
> > > > > Ok for master?
> > > >
> > > > Hmm.  I don't like another cost model.  Instead how about changing
> > > > 'unlimited' to still iterate through vector sizes?  Cost modeling
> > > > is really about vector vs. scalar, not vector vs. vector which is
> > > > completely under target control.  Targets should provide a way
> > > > to limit iteration, like aarch64 has with the aarch64-autovec-preference
> > > > --param or x86 has with -mprefer-vector-width.
> > > >
> > > > Of course changing 'unlimited' might result in somewhat of a

Re: [PATCH 2/3] gimple-fold: Return early for GIMPLE_COND with true/false

2025-04-23 Thread Richard Biener
On Wed, Apr 23, 2025 at 5:59 AM Andrew Pinski  wrote:
>
> To speed up things slightly so not needing to call all the way through
> to match and simplify, we should return early for true/false on GIMPLE_COND.

I think we'd still canonicalize the various forms matched by
gimple_cond_true/false_p
to a standard one - we should go through resimplify2 which should constant fold
the compare and in the end we do gimple_cond_make_true/false.

I'm also not sure it's worth short-cutting this, it shouldn't be common to fold
an already canonical if (0) or if (1), no?

Richard.

> gcc/ChangeLog:
>
> * gimple-fold.cc (fold_stmt_1): For GIMPLE_COND return early
> for true/false.
>
> Signed-off-by: Andrew Pinski 
> ---
>  gcc/gimple-fold.cc | 13 ++---
>  1 file changed, 10 insertions(+), 3 deletions(-)
>
> diff --git a/gcc/gimple-fold.cc b/gcc/gimple-fold.cc
> index 94d5a1ebbd7..2381a82d2b1 100644
> --- a/gcc/gimple-fold.cc
> +++ b/gcc/gimple-fold.cc
> @@ -6646,12 +6646,19 @@ fold_stmt_1 (gimple_stmt_iterator *gsi, bool inplace, 
> tree (*valueize) (tree),
>break;
>  case GIMPLE_COND:
>{
> +   gcond *gc = as_a  (stmt);
> +   /* If the cond is already true/false, just return false.  */
> +   if (gimple_cond_true_p (gc)
> +   || gimple_cond_false_p (gc))
> + {
> +   fold_undefer_overflow_warnings (false, stmt, 0);
> +   return false;
> + }
> /* Canonicalize operand order.  */
> -   tree lhs = gimple_cond_lhs (stmt);
> -   tree rhs = gimple_cond_rhs (stmt);
> +   tree lhs = gimple_cond_lhs (gc);
> +   tree rhs = gimple_cond_rhs (gc);
> if (tree_swap_operands_p (lhs, rhs))
>   {
> -   gcond *gc = as_a  (stmt);
> gimple_cond_set_lhs (gc, rhs);
> gimple_cond_set_rhs (gc, lhs);
> gimple_cond_set_code (gc,
> --
> 2.43.0
>


[PATCH] Document AArch64 changes for GCC 15

2025-04-23 Thread Evgeny Karpov
Tuesday, April 23, 2025
Richard Sandiford  wrote:
> Thanks the summary.  Does the entry below look ok?
>
>  Support has been added for the AArch64 MinGW target
>(aarch64-w64-mingw32).  At present, this target
>supports C and C++ for base Armv8-A, but with some caveats:
>
>  Although most variadic functions work, the implementation
>of them is not yet complete.
>  
>  C++ exception handling is not yet implemented.
>
>Further work is planned for GCC 16.
>  

Thanks, it looks good. Maybe it is worth mentioning that gdb is not supported 
yet.

Regards,
Evgeny

RE: [PATCH]middle-end: Add new "max" vector cost model

2025-04-23 Thread Tamar Christina
> -Original Message-
> From: Richard Sandiford 
> Sent: Wednesday, April 23, 2025 9:45 AM
> To: Tamar Christina 
> Cc: Richard Biener ; gcc-patches@gcc.gnu.org; nd
> 
> Subject: Re: [PATCH]middle-end: Add new "max" vector cost model
> 
> Tamar Christina  writes:
> >> -Original Message-
> >> From: Richard Biener 
> >> Sent: Wednesday, April 23, 2025 9:31 AM
> >> To: Tamar Christina 
> >> Cc: gcc-patches@gcc.gnu.org; nd ; Richard Sandiford
> >> 
> >> Subject: Re: [PATCH]middle-end: Add new "max" vector cost model
> >>
> >> On Wed, 23 Apr 2025, Tamar Christina wrote:
> >>
> >> > Hi All,
> >> >
> >> > This patch proposes a new vector cost model called "max".  The cost 
> >> > model is
> an
> >> > intersection between two of our existing cost models.  Like `unlimited` 
> >> > it
> >> > disables the costing vs scalar and assumes all vectorization to be 
> >> > profitable.
> >> >
> >> > But unlike unlimited it does not fully disable the vector cost model.  
> >> > That
> >> > means that we still perform comparisons between vector modes.
> >> >
> >> > As an example, the following:
> >> >
> >> > void
> >> > foo (char *restrict a, int *restrict b, int *restrict c,
> >> >  int *restrict d, int stride)
> >> > {
> >> > if (stride <= 1)
> >> > return;
> >> >
> >> > for (int i = 0; i < 3; i++)
> >> > {
> >> > int res = c[i];
> >> > int t = b[i * stride];
> >> > if (a[i] != 0)
> >> > res = t * d[i];
> >> > c[i] = res;
> >> > }
> >> > }
> >> >
> >> > compiled with -O3 -march=armv8-a+sve -fvect-cost-model=dynamic fails to
> >> > vectorize as it assumes scalar would be faster, and with
> >> > -fvect-cost-model=unlimited it picks a vector type that's so big that 
> >> > the large
> >> > sequence generated is working on mostly inactive lanes:
> >> >
> >> > ...
> >> > and p3.b, p3/z, p4.b, p4.b
> >> > whilelo p0.s, wzr, w7
> >> > ld1wz23.s, p3/z, [x3, #3, mul vl]
> >> > ld1wz28.s, p0/z, [x5, z31.s, sxtw 2]
> >> > add x0, x5, x0
> >> > punpklo p6.h, p6.b
> >> > ld1wz27.s, p4/z, [x0, z31.s, sxtw 2]
> >> > and p6.b, p6/z, p0.b, p0.b
> >> > punpklo p4.h, p7.b
> >> > ld1wz24.s, p6/z, [x3, #2, mul vl]
> >> > and p4.b, p4/z, p2.b, p2.b
> >> > uqdecw  w6
> >> > ld1wz26.s, p4/z, [x3]
> >> > whilelo p1.s, wzr, w6
> >> > mul z27.s, p5/m, z27.s, z23.s
> >> > ld1wz29.s, p1/z, [x4, z31.s, sxtw 2]
> >> > punpkhi p7.h, p7.b
> >> > mul z24.s, p5/m, z24.s, z28.s
> >> > and p7.b, p7/z, p1.b, p1.b
> >> > mul z26.s, p5/m, z26.s, z30.s
> >> > ld1wz25.s, p7/z, [x3, #1, mul vl]
> >> > st1wz27.s, p3, [x2, #3, mul vl]
> >> > mul z25.s, p5/m, z25.s, z29.s
> >> > st1wz24.s, p6, [x2, #2, mul vl]
> >> > st1wz25.s, p7, [x2, #1, mul vl]
> >> > st1wz26.s, p4, [x2]
> >> > ...
> >> >
> >> > With -fvect-cost-model=max you get more reasonable code:
> >> >
> >> > foo:
> >> > cmp w4, 1
> >> > ble .L1
> >> > ptrue   p7.s, vl3
> >> > index   z0.s, #0, w4
> >> > ld1bz29.s, p7/z, [x0]
> >> > ld1wz30.s, p7/z, [x1, z0.s, sxtw 2]
> >> >  ptrue   p6.b, all
> >> > cmpne   p7.b, p7/z, z29.b, #0
> >> > ld1wz31.s, p7/z, [x3]
> >> >  mul z31.s, p6/m, z31.s, z30.s
> >> > st1wz31.s, p7, [x2]
> >> > .L1:
> >> > ret
> >> >
> >> > This model has been useful internally for performance exploration and 
> >> > cost-
> >> model
> >> > validation.  It allows us to force realistic vectorization overriding 
> >> > the cost
> >> > model to be able to tell whether it's correct wrt to profitability.
> >> >
> >> > Bootstrapped Regtested on aarch64-none-linux-gnu,
> >> > arm-none-linux-gnueabihf, x86_64-pc-linux-gnu
> >> > -m32, -m64 and no issues.
> >> >
> >> > Ok for master?
> >>
> >> Hmm.  I don't like another cost model.  Instead how about changing
> >> 'unlimited' to still iterate through vector sizes?  Cost modeling
> >> is really about vector vs. scalar, not vector vs. vector which is
> >> completely under target control.  Targets should provide a way
> >> to limit iteration, like aarch64 has with the aarch64-autovec-preference
> >> --param or x86 has with -mprefer-vector-width.
> >>
> >
> > I'm ok with changing 'unlimited' if that's preferred, but I do want to point
> > out that we don't have enough control with current --param or -m options
> > to simulate all cases.
> >
> > For instance for SVE there's way for us to force a smaller type to be used
> > and thus force an unpacking to happen.  Or there's no way to force an
> > unrolling with Adv. SIMD.
> >
> > Basically there's not enough control over the VF to exercise some tests
> > reliably.  Some tests explicitly rel

Re: [PATCH 3/3] gimple-fold: Don't replace `bool_var != 0` with `bool_var` inside GIMPLE_COND

2025-04-23 Thread Richard Biener
On Wed, Apr 23, 2025 at 6:00 AM Andrew Pinski  wrote:
>
> Since match and simplify will simplify `bool_var != 0` to just `bool_var` and
> this is inside a GIMPLE_COND, fold_stmt will return true but nothing has 
> changed.
> So let's just reject the replacement if we are replacing with the same 
> simplification
> inside replace_stmt_with_simplification. This can speed up things slightly 
> because
> now fold_stmt won't return true on all GIMPLE_COND with `bool_var != 0` in it.

OK.  Note the same should happen for code == INTEGER_CST where we should
return false if the cond is already gimple_cond_false/true_p (but we
should still
canonicalize, thus make_true/false there or more literally just avoid
returning true
for the exact same form generated).

Richard.

> gcc/ChangeLog:
>
> * gimple-fold.cc (replace_stmt_with_simplification): Return false
> if replacing `bool_var != 0` with `bool_var` in GIMPLE_COND.
>
> Signed-off-by: Andrew Pinski 
> ---
>  gcc/gimple-fold.cc | 12 ++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
>
> diff --git a/gcc/gimple-fold.cc b/gcc/gimple-fold.cc
> index 2381a82d2b1..de1f1a44aa3 100644
> --- a/gcc/gimple-fold.cc
> +++ b/gcc/gimple-fold.cc
> @@ -6246,8 +6246,16 @@ replace_stmt_with_simplification (gimple_stmt_iterator 
> *gsi,
>   false, NULL_TREE)))
> gimple_cond_set_condition (cond_stmt, code, ops[0], ops[1]);
>else if (code == SSA_NAME)
> -   gimple_cond_set_condition (cond_stmt, NE_EXPR, ops[0],
> -  build_zero_cst (TREE_TYPE (ops[0])));
> +   {
> + /* If setting the gimple cond to the same thing,
> +return false as nothing changed.  */
> + if (gimple_cond_code (cond_stmt) == NE_EXPR
> + && operand_equal_p (gimple_cond_lhs (cond_stmt), ops[0])
> + && integer_zerop (gimple_cond_rhs (cond_stmt)))
> +   return false;
> + gimple_cond_set_condition (cond_stmt, NE_EXPR, ops[0],
> +build_zero_cst (TREE_TYPE (ops[0])));
> +   }
>else if (code == INTEGER_CST)
> {
>   if (integer_zerop (ops[0]))
> --
> 2.43.0
>


[PATCH] simplify-rtx: Combine bitwise operations in more cases

2025-04-23 Thread Pengfei Li
This patch transforms RTL expressions of the form (subreg (not X) off)
into (not (subreg X off)) when the subreg is an operand of a bitwise AND
or OR. This transformation can expose opportunities to combine a NOT
operation with the bitwise AND/OR.

For example, it improves the codegen of the following AArch64 NEON
intrinsics:
vandq_s64(vreinterpretq_s64_s32(vmvnq_s32(a)),
  vreinterpretq_s64_s32(b));
from:
not v0.16b, v0.16b
and v0.16b, v0.16b, v1.16b
to:
bic v0.16b, v1.16b, v0.16b

Regression tested on x86_64-linux-gnu, arm-linux-gnueabihf and
aarch64-linux-gnu.

gcc/ChangeLog:

* simplify-rtx.cc (simplify_context::simplify_binary_operation_1):
  Add RTX simplification for bitwise AND/OR.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/simd/bic_orn_1.c: New test.
---
 gcc/simplify-rtx.cc   | 24 +++
 .../gcc.target/aarch64/simd/bic_orn_1.c   | 17 +
 2 files changed, 41 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/simd/bic_orn_1.c

diff --git a/gcc/simplify-rtx.cc b/gcc/simplify-rtx.cc
index 88d31a71c05..ed620ef5d45 100644
--- a/gcc/simplify-rtx.cc
+++ b/gcc/simplify-rtx.cc
@@ -3738,6 +3738,18 @@ simplify_context::simplify_binary_operation_1 (rtx_code 
code,
  && rtx_equal_p (XEXP (XEXP (op0, 0), 0), op1))
return simplify_gen_binary (IOR, mode, XEXP (op0, 1), op1);
 
+  /* Convert (ior (subreg (not X) off) Y) into (ior (not (subreg X off)) Y)
+to expose opportunities to combine IOR and NOT.  */
+  if (GET_CODE (op0) == SUBREG
+ && GET_CODE (SUBREG_REG (op0)) == NOT)
+   {
+ rtx new_subreg = gen_rtx_SUBREG (mode,
+  XEXP (SUBREG_REG (op0), 0),
+  SUBREG_BYTE (op0));
+ rtx new_not = simplify_gen_unary (NOT, mode, new_subreg, mode);
+ return simplify_gen_binary (IOR, mode, new_not, op1);
+   }
+
   tem = simplify_byte_swapping_operation (code, mode, op0, op1);
   if (tem)
return tem;
@@ -4274,6 +4286,18 @@ simplify_context::simplify_binary_operation_1 (rtx_code 
code,
return simplify_gen_binary (LSHIFTRT, mode, XEXP (op0, 0), XEXP 
(op0, 1));
}
 
+  /* Convert (and (subreg (not X) off) Y) into (and (not (subreg X off)) Y)
+to expose opportunities to combine AND and NOT.  */
+  if (GET_CODE (op0) == SUBREG
+ && GET_CODE (SUBREG_REG (op0)) == NOT)
+   {
+ rtx new_subreg = gen_rtx_SUBREG (mode,
+  XEXP (SUBREG_REG (op0), 0),
+  SUBREG_BYTE (op0));
+ rtx new_not = simplify_gen_unary (NOT, mode, new_subreg, mode);
+ return simplify_gen_binary (AND, mode, new_not, op1);
+   }
+
   tem = simplify_byte_swapping_operation (code, mode, op0, op1);
   if (tem)
return tem;
diff --git a/gcc/testsuite/gcc.target/aarch64/simd/bic_orn_1.c 
b/gcc/testsuite/gcc.target/aarch64/simd/bic_orn_1.c
new file mode 100644
index 000..1c66f21424e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/simd/bic_orn_1.c
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+
+#include 
+
+int64x2_t bic_16b (int32x4_t a, int32x4_t b) {
+  return vandq_s64 (vreinterpretq_s64_s32 (vmvnq_s32 (a)),
+   vreinterpretq_s64_s32 (b));
+}
+
+int16x4_t orn_8b (int32x2_t a, int32x2_t b) {
+  return vorr_s16 (vreinterpret_s16_s32 (a),
+  vreinterpret_s16_s32 (vmvn_s32 (b)));
+}
+
+/* { dg-final { scan-assembler {\tbic\tv[0-9]+\.16b} } } */
+/* { dg-final { scan-assembler {\torn\tv[0-9]+\.8b} } } */
-- 
2.43.0



RE: [PATCH v2 1/3] RISC-V: Combine vec_duplicate + vadd.vv to vadd.vx on GR2VR cost

2025-04-23 Thread Li, Pan2
> Ah, I see, thanks.  So vec_dup costs 1 + 2 and vadd.vv costs 1 totalling 4 
> while vadd.vx costs 1 + 2, making it cheaper?

Yes, looks we need to just assign the GR2VR when vec_dup. I also tried diff 
cost here to see
the impact to late-combine.

+  if (rcode == VEC_DUPLICATE && SCALAR_INT_MODE_P (GET_MODE (XEXP (x, 0 {
+cost_val = get_vector_costs ()->regmove->GR2VR;
+  }

 cut line 

If GR2VR is 2, we will perform the combine as below.

 51 trying to combine definition of r135 in:
 5211: r135:RVVM1DI=vec_duplicate(r150:DI)
 53 into:
 5418: r147:RVVM1DI=r146:RVVM1DI+r135:RVVM1DI
 55   REG_DEAD r146:RVVM1DI
 56 successfully matched this instruction to *add_vx_rvvm1di:
 57 (set (reg:RVVM1DI 147 [ vect__6.8_16 ])
 58 (plus:RVVM1DI (vec_duplicate:RVVM1DI (reg:DI 150 [ x ]))
 59 (reg:RVVM1DI 146)))
 60 original cost = 8 + 4 (weighted: 39.483637), replacement cost = 4 
(weighted: 32.363637); keeping replacement
 61 rescanning insn with uid = 18.
 62 updating insn 18 in-place
 63 verify found no changes in insn with uid = 18.
 64 deleting insn 11
 65 deleting insn with uid = 11.

 cut line 

If GR2VR is 1, we will perform the combine as below.

  51   │ trying to combine definition of r135 in:
  52   │11: r135:RVVM1DI=vec_duplicate(r150:DI)
  53   │ into:
  54   │18: r147:RVVM1DI=r146:RVVM1DI+r135:RVVM1DI
  55   │   REG_DEAD r146:RVVM1DI
  56   │ successfully matched this instruction to *add_vx_rvvm1di:
  57   │ (set (reg:RVVM1DI 147 [ vect__6.8_16 ])
  58   │ (plus:RVVM1DI (vec_duplicate:RVVM1DI (reg:DI 150 [ x ]))
  59   │ (reg:RVVM1DI 146)))
  60   │ original cost = 4 + 4 (weighted: 35.923637), replacement cost = 4 
(weighted: 32.363637); keeping replacement
  61   │ rescanning insn with uid = 18.
  62   │ updating insn 18 in-place
  63   │ verify found no changes in insn with uid = 18.
  64   │ deleting insn 11
  65   │ deleting insn with uid = 11.

 cut line 

If GR2VR is 0, it will be normalized to 1 as below, thus the combine log looks 
like the same as above.

return cost > 0 ? cost : COSTS_N_INSNS (1); gcc/rtlanal.cc:5766

it looks like we need to reconcile the vadd.vv cost either here ? Or am I 
missing something here.

> With such a change the tests wouldn't pass by default (AFAICT) and we would 
> need a --param=xx.  I wouldn't worry about exposing those details to the user 
> for now as we're so early in the cycle and can easily iterate on it.  I would
> suggest just adding something in order to make the tests work as expected and 
> change things later (if needed).

I see, will append the --param patch into the series.

Pan


-Original Message-
From: Robin Dapp  
Sent: Wednesday, April 23, 2025 3:01 PM
To: Li, Pan2 ; Robin Dapp ; 
gcc-patches@gcc.gnu.org
Cc: juzhe.zh...@rivai.ai; kito.ch...@gmail.com; jeffreya...@gmail.com; Chen, 
Ken ; Liu, Hongtao ; Robin Dapp 

Subject: Re: [PATCH v2 1/3] RISC-V: Combine vec_duplicate + vadd.vv to vadd.vx 
on GR2VR cost

>> The only thing I think we want for the patch (as Pan also raised last time) 
>> is 
>> the param to set those .vx costs to zero in order to ensure the tests test 
>> the 
>> right thing (--param=vx_preferred/gr2vr_cost or something).
>
> I see, shall we start a new series for this? AFAIK, we may need some more 
> alignment
> for something like --param=xx that exposing to end-user.
>
>> According to patchwork the tests you add pass but shouldn't they actually 
>> fail 
>> with a GR2VR cost of 2?  I must be missing something.
>
> For now the cost of GR2VR is 2, take test vx_vadd-1-i64.c for example, the 
> vec_dup + vadd.vv
> has higher cost than vadd.vx, thus perform the late-combine as below.

Ah, I see, thanks.  So vec_dup costs 1 + 2 and vadd.vv costs 1 totalling 4 
while vadd.vx costs 1 + 2, making it cheaper?

IMHO vec_dup should just cost 2 (=GR2VR) rather than 3.  All it does is 
broadcast (no additional operation), while vadd.vx performs the broadcast (cost 
2) as well as an operation (cost 1).  So vec_dup + vadd.vv should cost 3, the 
same as vadd.vx.  In late combine when comparing costs we scale the them by 
"frequency".  The vadd.vx inside the loop should have higher frequency making 
it more costly by default.

With such a change the tests wouldn't pass by default (AFAICT) and we would 
need a --param=xx.  I wouldn't worry about exposing those details to the user 
for now as we're so early in the cycle and can easily iterate on it.  I would
suggest just adding something in order to make the tests work as expected and 
change things later (if needed).

-- 
Regards
 Robin



Re: [PATCH] Add std::deque shrink_to_fit test

2025-04-23 Thread François Dumont

AFAICT I've never got proper validation for this small patch.

Is it ok to commit ?

Thanks


On 14/04/2025 22:25, François Dumont wrote:



On 14/04/2025 08:29, Tomasz Kaminski wrote:



On Sun, Apr 13, 2025 at 12:13 PM François Dumont 
 wrote:



On 11/04/2025 08:36, Tomasz Kaminski wrote:



On Thu, Apr 10, 2025 at 10:47 PM Jonathan Wakely
 wrote:

On 10/04/25 22:36 +0200, François Dumont wrote:
>After running the test with -fno-exceptions option we
rather need this
>patch.
>
>Ok to commit ?
>
>François
>
>
>On 10/04/2025 21:08, François Dumont wrote:
>>Hi
>>
>>    No problem detected now that we really test std::deque
>>shrink_to_fit implementation.
>>
>>    libstdc++: Add std::deque<>::shrink_to_fit test
>>
>>    The existing test is currently testing std::vector.
Make it test
>>std::deque.
>>
>>    libstdc++-v3/ChangeLog:
>>
>>    *
>>testsuite/23_containers/deque/capacity/shrink_to_fit.cc:
Adapt test
>>    to check std::deque shrink_to_fit method.
>>
>>Tested under Linux x64.
>>
>>Ok to commit ?
>>
>>François

>diff --git
a/libstdc++-v3/testsuite/23_containers/deque/capacity/shrink_to_fit.cc
b/libstdc++-v3/testsuite/23_containers/deque/capacity/shrink_to_fit.cc
>index 7cb67079214..9c8b3a926e8 100644
>---
a/libstdc++-v3/testsuite/23_containers/deque/capacity/shrink_to_fit.cc
>+++
b/libstdc++-v3/testsuite/23_containers/deque/capacity/shrink_to_fit.cc
>@@ -1,4 +1,5 @@
> // { dg-do run { target c++11 } }
>+// { dg-add-options no_pch }

Tests using replacement_memory_operators.h need:

// { dg-require-effective-target std_allocator_new }
// { dg-xfail-run-if "AIX operator new" { powerpc-ibm-aix* } }

See e.g. 23_containers/unordered_set/96088.cc


Thanks, I new I needed to add something like that but then forgot
to amend my test.




>
> // 2010-01-08  Paolo Carlini  
>
>@@ -19,18 +20,39 @@
> // with this library; see the file COPYING3.  If not see
> // .
>
>-#include 
>+#define _GLIBCXX_DEQUE_BUF_SIZE sizeof(int) * 3

Couldn't the test just create more elements, instead of
modifying the
internals? We should test it using the default parameters, no?


Sounds better indeed, done using std::__deque_buf_size extension.



In from_range test I have used a class that contains some padding,
to be able to fill the deque buffer, while still test with few
elements,

struct EightInBuf

{
  EightInBuf(int x) : elems{x}
  { }

 private:
   int elems[512 / (sizeof(int) * 8)];

  friend constexpr bool operator==(EightInBuf const& lhs, int rhs)
  { return lhs.elems[0] == rhs; }
 }; 






It is a nice alternative even if you are still relying on the 512
implementation detail hidden by _GLIBCXX_DEQUE_BUF_SIZE macro.

This is why I preferred to use__deque_buf_size.

When using _GLIBCXX_DEBUG this function will be defined inside 
std::__cxx1998 namespace, not in the std directly.
There is a macro _GLIBCXX_STD_C that can be used to refer to it, but 
it works only inside std namespace.


Fixed in this new version.

Ok to commit ?

François



Re: [PATCH] Consider frequency in cost estimation when converting scalar to vector.

2025-04-23 Thread Hongtao Liu
On Thu, Apr 24, 2025 at 12:50 AM Jan Hubicka  wrote:
>
> > In some benchmark, I notice stv failed due to cost unprofitable, but the 
> > igain
> > is inside the loop, but sse<->integer conversion is outside the loop, 
> > current cost
> > model doesn't consider the frequency of those gain/cost.
> > The patch weights those cost with frequency just like LRA does.
> >
> > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> > Ok for GCC16?
> >
> > gcc/ChangeLog:
> >
> >   * config/i386/i386-features.cc (scalar_chain::mark_dual_mode_def):
> >   (general_scalar_chain::compute_convert_gain):
> > ---
> >  gcc/config/i386/i386-features.cc | 9 +++--
> >  1 file changed, 7 insertions(+), 2 deletions(-)
> >
> > diff --git a/gcc/config/i386/i386-features.cc 
> > b/gcc/config/i386/i386-features.cc
> > index c35ac24fd8a..ae0844a70c2 100644
> > --- a/gcc/config/i386/i386-features.cc
> > +++ b/gcc/config/i386/i386-features.cc
> > @@ -337,18 +337,20 @@ scalar_chain::mark_dual_mode_def (df_ref def)
> >/* Record the def/insn pair so we can later efficiently iterate over
> >   the defs to convert on insns not in the chain.  */
> >bool reg_new = bitmap_set_bit (defs_conv, DF_REF_REGNO (def));
> > +  unsigned frequency
> > += REG_FREQ_FROM_BB (BLOCK_FOR_INSN (DF_REF_INSN (def)));
>
> I am generally trying to get rid of remaing uses of REG_FREQ since the
> 1 based fixed point arithmetics iot always working that well.
>
> You can do the sums in profile_count type (doing something reasonable
> when count is uninitialized) and then convert it to sreal for the final
> heuristics.
Thanks for the suggestion, let me try.
>
> Typically such code also wants skip scaling by count when optimizing for
> size (since in this case we want to count statically).  Not sure how
> important it is for vector code but I suppose it can happen.
>
> Honza
> >if (!bitmap_bit_p (insns, DF_REF_INSN_UID (def)))
> >  {
> >if (!bitmap_set_bit (insns_conv, DF_REF_INSN_UID (def))
> > && !reg_new)
> >   return;
> > -  n_integer_to_sse++;
> > +  n_integer_to_sse += frequency;
> >  }
> >else
> >  {
> >if (!reg_new)
> >   return;
> > -  n_sse_to_integer++;
> > +  n_sse_to_integer += frequency;
> >  }
> >
> >if (dump_file)
> > @@ -556,6 +558,8 @@ general_scalar_chain::compute_convert_gain ()
> >rtx src = SET_SRC (def_set);
> >rtx dst = SET_DEST (def_set);
> >int igain = 0;
> > +  unsigned frequency
> > + = REG_FREQ_FROM_BB (BLOCK_FOR_INSN (insn));
> >
> >if (REG_P (src) && REG_P (dst))
> >   igain += 2 * m - ix86_cost->xmm_move;
> > @@ -755,6 +759,7 @@ general_scalar_chain::compute_convert_gain ()
> >   }
> >   }
> >
> > +  igain *= frequency;
> >if (igain != 0 && dump_file)
> >   {
> > fprintf (dump_file, "  Instruction gain %d for ", igain);
> > --
> > 2.34.1
> >



-- 
BR,
Hongtao


[committed] testsuite: Require fstack_protector for no-stack-protector-attr-3.C

2025-04-23 Thread Dimitar Dimitrov
The test fails on pru-unknown-elf with:
   cc1plus: warning: '-fstack-protector' not supported for this target

Even though the compiled functions have the feature disabled using an
attribute, the command line option is still not supported by some targets.

Tested x86_64-pc-linux-gnu and ensured that g++.sum is the same with and
without this patch.

Pushed to trunk as obvious.

gcc/testsuite/ChangeLog:

* g++.dg/no-stack-protector-attr-3.C: Require effective target
fstack_protector.

Signed-off-by: Dimitar Dimitrov 
---
 gcc/testsuite/g++.dg/no-stack-protector-attr-3.C | 1 +
 1 file changed, 1 insertion(+)

diff --git a/gcc/testsuite/g++.dg/no-stack-protector-attr-3.C 
b/gcc/testsuite/g++.dg/no-stack-protector-attr-3.C
index 147c2b79f78..b858d706bb5 100644
--- a/gcc/testsuite/g++.dg/no-stack-protector-attr-3.C
+++ b/gcc/testsuite/g++.dg/no-stack-protector-attr-3.C
@@ -6,6 +6,7 @@
 /* { dg-additional-options "-fno-PIE" { target ia32 } } */
 
 /* { dg-do compile { target { ! hppa*-*-* } } } */
+/* { dg-require-effective-target fstack_protector } */
 
 int __attribute__((no_stack_protector)) foo()
 {
-- 
2.49.0



[GCC16 stage1][PATCH v2 1/3] Extend "counted_by" attribute to pointer fields of structures.

2025-04-23 Thread Qing Zhao
For example:
struct PP {
  size_t count2;
  char other1;
  char *array2 __attribute__ ((counted_by (count2)));
  int other2;
} *pp;

specifies that the "array2" is an array that is pointed by the
pointer field, and its number of elements is given by the field
"count2" in the same structure.

gcc/c-family/ChangeLog:

* c-attribs.cc (handle_counted_by_attribute): Accept counted_by
attribute for pointer fields.

gcc/c/ChangeLog:

* c-decl.cc (verify_counted_by_attribute): Change the 2nd argument
to a vector of fields with counted_by attribute. Verify all fields
in this vector.
(finish_struct): Collect all the fields with counted_by attribute
to a vector and pass this vector to verify_counted_by_attribute.

gcc/ChangeLog:

* doc/extend.texi: Extend counted_by attribute to pointer fields in
structures. Add one more requirement to pointers with counted_by
attribute.

gcc/testsuite/ChangeLog:

* gcc.dg/flex-array-counted-by.c: Update test.
* gcc.dg/pointer-counted-by-2.c: New test.
* gcc.dg/pointer-counted-by-3.c: New test.
* gcc.dg/pointer-counted-by.c: New test.
---
 gcc/c-family/c-attribs.cc|  15 ++-
 gcc/c/c-decl.cc  |  91 +++--
 gcc/doc/extend.texi  |  35 -
 gcc/testsuite/gcc.dg/flex-array-counted-by.c |   2 +-
 gcc/testsuite/gcc.dg/pointer-counted-by-2.c  |  10 ++
 gcc/testsuite/gcc.dg/pointer-counted-by-3.c  | 127 +++
 gcc/testsuite/gcc.dg/pointer-counted-by.c|  70 ++
 7 files changed, 299 insertions(+), 51 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/pointer-counted-by-2.c
 create mode 100644 gcc/testsuite/gcc.dg/pointer-counted-by-3.c
 create mode 100644 gcc/testsuite/gcc.dg/pointer-counted-by.c

diff --git a/gcc/c-family/c-attribs.cc b/gcc/c-family/c-attribs.cc
index 5a0e3d328ba..51d42999578 100644
--- a/gcc/c-family/c-attribs.cc
+++ b/gcc/c-family/c-attribs.cc
@@ -2906,16 +2906,18 @@ handle_counted_by_attribute (tree *node, tree name,
" declaration %q+D", name, decl);
   *no_add_attrs = true;
 }
-  /* This attribute only applies to field with array type.  */
-  else if (TREE_CODE (TREE_TYPE (decl)) != ARRAY_TYPE)
+  /* This attribute only applies to field with array type or pointer type.  */
+  else if (TREE_CODE (TREE_TYPE (decl)) != ARRAY_TYPE
+  && TREE_CODE (TREE_TYPE (decl)) != POINTER_TYPE)
 {
   error_at (DECL_SOURCE_LOCATION (decl),
-   "%qE attribute is not allowed for a non-array field",
-   name);
+   "%qE attribute is not allowed for a non-array"
+   " or non-pointer field", name);
   *no_add_attrs = true;
 }
   /* This attribute only applies to a C99 flexible array member type.  */
-  else if (! c_flexible_array_member_type_p (TREE_TYPE (decl)))
+  else if (TREE_CODE (TREE_TYPE (decl)) == ARRAY_TYPE
+  && !c_flexible_array_member_type_p (TREE_TYPE (decl)))
 {
   error_at (DECL_SOURCE_LOCATION (decl),
"%qE attribute is not allowed for a non-flexible"
@@ -2930,7 +2932,8 @@ handle_counted_by_attribute (tree *node, tree name,
   *no_add_attrs = true;
 }
   /* Issue error when there is a counted_by attribute with a different
- field as the argument for the same flexible array member field.  */
+ field as the argument for the same flexible array member or
+ pointer field.  */
   else if (old_counted_by != NULL_TREE)
 {
   tree old_fieldname = TREE_VALUE (TREE_VALUE (old_counted_by));
diff --git a/gcc/c/c-decl.cc b/gcc/c/c-decl.cc
index 8c420f22976..53e7b726ee6 100644
--- a/gcc/c/c-decl.cc
+++ b/gcc/c/c-decl.cc
@@ -9448,56 +9448,62 @@ c_update_type_canonical (tree t)
 }
 }
 
-/* Verify the argument of the counted_by attribute of the flexible array
-   member FIELD_DECL is a valid field of the containing structure,
-   STRUCT_TYPE, Report error and remove this attribute when it's not.  */
+/* Verify the argument of the counted_by attribute of each of the
+   FIELDS_WITH_COUNTED_BY is a valid field of the containing structure,
+   STRUCT_TYPE, Report error and remove the corresponding attribute
+   when it's not.  */
 
 static void
-verify_counted_by_attribute (tree struct_type, tree field_decl)
+verify_counted_by_attribute (tree struct_type,
+auto_vec *fields_with_counted_by)
 {
-  tree attr_counted_by = lookup_attribute ("counted_by",
-  DECL_ATTRIBUTES (field_decl));
-
-  if (!attr_counted_by)
-return;
+  for (tree field_decl : *fields_with_counted_by)
+{
+  tree attr_counted_by = lookup_attribute ("counted_by",
+   DECL_ATTRIBUTES (field_decl));
 
-  /* If there is an counted_by attribute attached to the field,
- verify it.  */
+  if (!attr_counted_by)
+   continue;
 
-  tree

[GCC16 stage1][PATCH v2 2/3] Convert a pointer reference with counted_by attribute to .ACCESS_WITH_SIZE and use it in builtinin-object-size.

2025-04-23 Thread Qing Zhao
gcc/c/ChangeLog:

* c-typeck.cc (build_counted_by_ref): Handle pointers with counted_by.
(build_access_with_size_for_counted_by): Likewise.

gcc/ChangeLog:

* tree-object-size.cc (access_with_size_object_size): Handle pointers
with counted_by.
(collect_object_sizes_for): Likewise.

gcc/testsuite/ChangeLog:

* gcc.dg/pointer-counted-by-4.c: New test.
* gcc.dg/pointer-counted-by-5.c: New test.
* gcc.dg/pointer-counted-by-6.c: New test.
* gcc.dg/pointer-counted-by-7.c: New test.
* gcc.dg/pointer-counted-by-8.c: New test.
---
 gcc/c/c-typeck.cc   | 42 --
 gcc/testsuite/gcc.dg/pointer-counted-by-4.c | 63 +
 gcc/testsuite/gcc.dg/pointer-counted-by-5.c | 48 
 gcc/testsuite/gcc.dg/pointer-counted-by-6.c | 47 +++
 gcc/testsuite/gcc.dg/pointer-counted-by-7.c | 30 ++
 gcc/testsuite/gcc.dg/pointer-counted-by-8.c | 30 ++
 gcc/tree-object-size.cc | 11 +++-
 7 files changed, 251 insertions(+), 20 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/pointer-counted-by-4.c
 create mode 100644 gcc/testsuite/gcc.dg/pointer-counted-by-5.c
 create mode 100644 gcc/testsuite/gcc.dg/pointer-counted-by-6.c
 create mode 100644 gcc/testsuite/gcc.dg/pointer-counted-by-7.c
 create mode 100644 gcc/testsuite/gcc.dg/pointer-counted-by-8.c

diff --git a/gcc/c/c-typeck.cc b/gcc/c/c-typeck.cc
index 55d896e02df..7cb19f4a239 100644
--- a/gcc/c/c-typeck.cc
+++ b/gcc/c/c-typeck.cc
@@ -2928,8 +2928,8 @@ should_suggest_deref_p (tree datum_type)
 
 /* For a SUBDATUM field of a structure or union DATUM, generate a REF to
the object that represents its counted_by per the attribute counted_by
-   attached to this field if it's a flexible array member field, otherwise
-   return NULL_TREE.
+   attached to this field if it's a flexible array member or a pointer
+   field, otherwise return NULL_TREE.
Set COUNTED_BY_TYPE to the TYPE of the counted_by field.
For example, if:
 
@@ -2950,7 +2950,9 @@ static tree
 build_counted_by_ref (tree datum, tree subdatum, tree *counted_by_type)
 {
   tree type = TREE_TYPE (datum);
-  if (!c_flexible_array_member_type_p (TREE_TYPE (subdatum)))
+  tree sub_type = TREE_TYPE (subdatum);
+  if (!c_flexible_array_member_type_p (sub_type)
+  && TREE_CODE (sub_type) != POINTER_TYPE)
 return NULL_TREE;
 
   tree attr_counted_by = lookup_attribute ("counted_by",
@@ -2981,8 +2983,11 @@ build_counted_by_ref (tree datum, tree subdatum, tree 
*counted_by_type)
 }
 
 /* Given a COMPONENT_REF REF with the location LOC, the corresponding
-   COUNTED_BY_REF, and the COUNTED_BY_TYPE, generate an INDIRECT_REF
-   to a call to the internal function .ACCESS_WITH_SIZE.
+   COUNTED_BY_REF, and the COUNTED_BY_TYPE, generate the corresponding
+   call to the internal function .ACCESS_WITH_SIZE.
+
+   Generate an INDIRECT_REF to a call to the internal function
+   .ACCESS_WITH_SIZE.
 
REF
 
@@ -2992,17 +2997,15 @@ build_counted_by_ref (tree datum, tree subdatum, tree 
*counted_by_type)
(TYPE_OF_ARRAY *)0))
 
NOTE: The return type of this function is the POINTER type pointing
-   to the original flexible array type.
-   Then the type of the INDIRECT_REF is the original flexible array type.
-
-   The type of the first argument of this function is a POINTER type
-   to the original flexible array type.
+   to the original flexible array type or the original pointer type.
+   Then the type of the INDIRECT_REF is the original flexible array type
+   or the original pointer type.
 
The 4th argument of the call is a constant 0 with the TYPE of the
object pointed by COUNTED_BY_REF.
 
-   The 6th argument of the call is a constant 0 with the pointer TYPE
-   to the original flexible array type.
+   The 6th argument of the call is a constant 0 of the same TYPE as
+   the return type of the call.
 
   */
 static tree
@@ -3010,11 +3013,16 @@ build_access_with_size_for_counted_by (location_t loc, 
tree ref,
   tree counted_by_ref,
   tree counted_by_type)
 {
-  gcc_assert (c_flexible_array_member_type_p (TREE_TYPE (ref)));
-  /* The result type of the call is a pointer to the flexible array type.  */
+  gcc_assert (c_flexible_array_member_type_p (TREE_TYPE (ref))
+ || TREE_CODE (TREE_TYPE (ref)) == POINTER_TYPE);
+  bool is_fam = c_flexible_array_member_type_p (TREE_TYPE (ref));
+  tree first_param = is_fam ? array_to_pointer_conversion (loc, ref)
+: build_unary_op (loc, ADDR_EXPR, ref, false);
+
+  /* The result type of the call is a pointer to the original type
+ of the ref.  */
   tree result_type = c_build_pointer_type (TREE_TYPE (ref));
-  tree first_param
-= c_fully_fold (array_to_pointer_conversion (loc, ref), false, NULL);
+  first_param = c_fully_fold (first_param, false, NULL);