[Bug tree-optimization/116265] New: Missing optimization: Vectorization of modulo operator

2024-08-06 Thread jschmitz at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116265

Bug ID: 116265
   Summary: Missing optimization: Vectorization of modulo operator
   Product: gcc
   Version: 15.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: enhancement
  Priority: P3
 Component: tree-optimization
  Assignee: jschmitz at gcc dot gnu.org
  Reporter: jschmitz at gcc dot gnu.org
  Target Milestone: ---

On aarch64 Neoverse-v2, GCC does not vectorize the modulo operator in loops if
the second operand is a memory reference, as in the test case below, even with
-Ofast.

I am planning to fix this and would like advice on where best to implement it.

void foo (unsigned int *x, unsigned int *y, int n)
{
  for (int i = 0; i < n; ++i)
x[i] = x[i] % y[i];
}

compiles to

ldr w5, [x0, x2]
ldr w4, [x1, x2]
udivw3, w5, w4
msubw3, w3, w4, w5
str w3, [x0, x2]
add x2, x2, 4
cmp x6, x2
bne .L3

[Bug tree-optimization/116265] Missing optimization: Vectorization of modulo operator

2024-08-06 Thread jschmitz at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116265

Jennifer Schmitz  changed:

   What|Removed |Added

 Status|UNCONFIRMED |ASSIGNED
 Ever confirmed|0   |1
   Last reconfirmed||2024-08-07

[Bug tree-optimization/101390] Expand vector mod as vector div + multiply-subtract

2024-08-07 Thread jschmitz at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101390

Jennifer Schmitz  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |jschmitz at gcc dot 
gnu.org

--- Comment #5 from Jennifer Schmitz  ---
I will work on this task and would be grateful for advice on where to best
implement this optimization. 
I already looked at vect_recog_divmod_pattern, but if I'm not mistaken this
function is intended for cases where the second operand is an integer constant.

[Bug tree-optimization/101390] Expand vector mod as vector div + multiply-subtract

2024-08-07 Thread jschmitz at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101390

--- Comment #7 from Jennifer Schmitz  ---
Thank you for the reply. Seems like I have been looking in the right places.
I'm a new member of the GCC community, so I'm still getting familiar with many
parts of the code base. I have been trying to find out where the related case,
but with the division operator is implemented, as this seems a natural place to
also implement the modulo operator. This does not seem to happen in
vect_recog_divmod_pattern. Do you still think vect_recog_divmod_pattern is the
right location to implement this or can you point me to the implementation of
the same case with division?

[Bug target/116365] Add user-friendly arguments to --param aarch64-autovec-preference=N

2024-08-22 Thread jschmitz at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116365

Jennifer Schmitz  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #3 from Jennifer Schmitz  ---
fixed in GCC 15.1

[Bug tree-optimization/101390] Expand vector mod as vector div + multiply-subtract

2024-08-22 Thread jschmitz at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101390

Jennifer Schmitz  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #10 from Jennifer Schmitz  ---
fixed in GCC 15.1

[Bug tree-optimization/53947] [meta-bug] vectorizer missed-optimizations

2024-08-22 Thread jschmitz at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
Bug 53947 depends on bug 101390, which changed state.

Bug 101390 Summary: Expand vector mod as vector div + multiply-subtract
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101390

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

[Bug tree-optimization/116569] [15 Regression] ICE in to_constant, at poly-int.h:592

2024-09-06 Thread jschmitz at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116569

--- Comment #5 from Jennifer Schmitz  ---
I looked into the issue and summarize below what I found:

My current fix that checks for the support of the mod optab for vectors looks
like this:

@@ -894,7 +894,9 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
 /* X - (X / Y) * Y is the same as X % Y.  */
 (simplify
  (minus (convert1? @0) (convert2? (mult:c (trunc_div @@0 @@1) @1)))
- (if (INTEGRAL_TYPE_P (type) || VECTOR_INTEGER_TYPE_P (type))
+ (if (INTEGRAL_TYPE_P (type)
+  || (VECTOR_INTEGER_TYPE_P (type)
+ && target_supports_op_p (type, TRUNC_MOD_EXPR, optab_vector)))
   (convert (trunc_mod @0 @1

However, the test fold-minus-1.c fails, because the simplification is not
applied anymore:

/* { dg-options "-O -fdump-tree-gimple" } */
void f(vec*x,vec*y){
  *x -= *x / *y * *y;
}

/* { dg-final { scan-tree-dump-times "%" 1 "gimple"} } */
/* { dg-final { scan-tree-dump-not "/" "gimple"} } */

I looked into applying the simplification in early tree passes only instead of
checking for support of the mod optab and found functions like
optimize_vectors_before_lowering_p that use the PROP_gimple_xxx macros (in
tree-pass.h) as mask.

I tried different PROP_xxx macros and all tests (fold-minus-1.c; the minimal
testcase Kyrill posted that produced the ICE; and my previous vect-mod tests)
run successfully for

@@ -896,7 +896,7 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
  (minus (convert1? @0) (convert2? (mult:c (trunc_div @@0 @@1) @1)))
  (if (INTEGRAL_TYPE_P (type)
   || (VECTOR_INTEGER_TYPE_P (type)
- && target_supports_op_p (type, TRUNC_MOD_EXPR, optab_vector)))
+ && (!cfun || (cfun->curr_properties & PROP_gimple_any) == 0)))
   (convert (trunc_mod @0 @1

But I don't think that the PROP_gimple_any is exactly what I want, but I
haven't found anything that fits perfectly.

Any advise on how to proceed?

[Bug tree-optimization/116569] [15 Regression] ICE in to_constant, at poly-int.h:592

2024-09-06 Thread jschmitz at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116569

--- Comment #7 from Jennifer Schmitz  ---
Thanks for the quick reply. I tried

(simplify
 (minus (convert1? @0) (convert2? (mult:c (trunc_div @@0 @@1) @1)))
 (if (INTEGRAL_TYPE_P (type)
  || (VECTOR_INTEGER_TYPE_P (type)
  && optimize_vectors_before_lowering_p ()))
  (convert (trunc_mod @0 @1

and the result is that the test case still ICEs, but fold-minus-1.c passes.

[Bug tree-optimization/116831] [15 Regression] ICE with trunc mod vectorising for SVE

2024-10-10 Thread jschmitz at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116831

Jennifer Schmitz  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #7 from Jennifer Schmitz  ---
fixed in GCC 15.1

[Bug tree-optimization/86710] 3 missing logarithm optimizations

2024-10-11 Thread jschmitz at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86710

Jennifer Schmitz  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #4 from Jennifer Schmitz  ---
fixed in GCC 15.1

[Bug tree-optimization/116826] Optimise log (1.0 / x) into -log (x)

2024-10-11 Thread jschmitz at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116826

Jennifer Schmitz  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #5 from Jennifer Schmitz  ---
fixed in GCC 15.1

[Bug tree-optimization/117093] Missing detection of REV64 vector permute

2024-10-31 Thread jschmitz at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117093

Jennifer Schmitz  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |jschmitz at gcc dot 
gnu.org
 Status|NEW |ASSIGNED
 CC||jschmitz at gcc dot gnu.org

--- Comment #5 from Jennifer Schmitz  ---
.

[Bug tree-optimization/116826] Optimise log (1.0 / x) into -log (x)

2024-09-24 Thread jschmitz at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116826

Jennifer Schmitz  changed:

   What|Removed |Added

   Last reconfirmed||2024-09-24
 Status|UNCONFIRMED |ASSIGNED
 Ever confirmed|0   |1
 CC||jschmitz at gcc dot gnu.org
   Assignee|unassigned at gcc dot gnu.org  |jschmitz at gcc dot 
gnu.org

--- Comment #1 from Jennifer Schmitz  ---
.

[Bug tree-optimization/116569] [15 Regression] ICE in to_constant, at poly-int.h:592

2024-09-18 Thread jschmitz at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116569

Jennifer Schmitz  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #15 from Jennifer Schmitz  ---
fixed in GCC 15.1

[Bug tree-optimization/86710] 3 missing logarithm optimizations

2024-09-25 Thread jschmitz at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86710

Jennifer Schmitz  changed:

   What|Removed |Added

 CC||jschmitz at gcc dot gnu.org
   Assignee|unassigned at gcc dot gnu.org  |jschmitz at gcc dot 
gnu.org
 Status|NEW |ASSIGNED

--- Comment #2 from Jennifer Schmitz  ---
.

[Bug tree-optimization/116831] [15 Regression] ICE with trunc mod vectorising for SVE

2024-10-02 Thread jschmitz at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116831

Jennifer Schmitz  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |jschmitz at gcc dot 
gnu.org

[Bug target/106329] No optimization for SVE pfalse predicate

2024-10-24 Thread jschmitz at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106329

Jennifer Schmitz  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |jschmitz at gcc dot 
gnu.org
Version|12.1.0  |15.0
 Status|NEW |ASSIGNED

--- Comment #2 from Jennifer Schmitz  ---
.

[Bug testsuite/117704] gcc.dg/tree-ssa/pow_fold_1.c FAILs on 32-bit x86

2024-11-28 Thread jschmitz at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117704

Jennifer Schmitz  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #3 from Jennifer Schmitz  ---
Fixed in GCC 15 by
https://gcc.gnu.org/pipermail/gcc-patches/2024-November/669910.html

[Bug testsuite/117704] gcc.dg/tree-ssa/pow_fold_1.c FAILs on 32-bit x86

2024-11-20 Thread jschmitz at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117704

Jennifer Schmitz  changed:

   What|Removed |Added

   Last reconfirmed||2024-11-20
 Ever confirmed|0   |1
   Assignee|unassigned at gcc dot gnu.org  |jschmitz at gcc dot 
gnu.org
 Status|UNCONFIRMED |ASSIGNED

--- Comment #2 from Jennifer Schmitz  ---
.

[Bug tree-optimization/117093] Missing detection of REV64 vector permute

2024-11-15 Thread jschmitz at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117093

Jennifer Schmitz  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #7 from Jennifer Schmitz  ---
fixed in GCC 15

[Bug tree-optimization/117093] Missing detection of REV64 vector permute

2024-11-16 Thread jschmitz at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117093

--- Comment #9 from Jennifer Schmitz  ---
Thanks for reporting it, I'll look into it on Monday.

[Bug target/106329] No optimization for SVE pfalse predicate

2024-12-05 Thread jschmitz at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106329

Jennifer Schmitz  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #4 from Jennifer Schmitz  ---
fixed in GCC 15.

[Bug tree-optimization/114999] A few missing optimizations due to `a - b` and `b - a` not being detected as negatives of each other

2024-12-11 Thread jschmitz at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114999

Jennifer Schmitz  changed:

   What|Removed |Added

 CC||jschmitz at gcc dot gnu.org

--- Comment #10 from Jennifer Schmitz  ---
We are also optimizing ABS expressions and improving codegen for the following
types of test cases (for T in {uint8_t, int8_t, uint16_t, int16_t, uint32_t,
int32_t, uint64_t, int64_t, __uint128_t, __int128_t}):

T src(T x, T y)
{
T diff1 = x - y;
T diff2 = y - x;
return x > y ? diff1 : diff2;
}

T tgt(T x, T y)
{
T diff = x - y;
return x > y ? diff : -diff;
}

This seems to be a subset of the transformations described here, so it would
be good to coordinate the work: We have code ready that covers our test cases,
but would also be happy to look at other optimizations mentioned above.

[Bug target/117978] Optimise 128-bit-predicated SVE loads to Advanced SIMD LDRs

2024-12-12 Thread jschmitz at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117978

Jennifer Schmitz  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |jschmitz at gcc dot 
gnu.org
 Status|NEW |ASSIGNED

[Bug tree-optimization/114999] A few missing optimizations due to `a - b` and `b - a` not being detected as negatives of each other

2025-01-14 Thread jschmitz at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114999

--- Comment #12 from Jennifer Schmitz  ---
Created attachment 60149
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=60149&action=edit
Proposed patch for detecting abs diff for signed integers

[Bug target/119009] New: AArch64: Commit 'Node clones share order' causes regression in Snappy workload for -mcpu=neoverse-v2 with LTO

2025-02-25 Thread jschmitz at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119009

Bug ID: 119009
   Summary: AArch64: Commit 'Node clones share order' causes
regression in Snappy workload for -mcpu=neoverse-v2
with LTO
   Product: gcc
   Version: 15.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jschmitz at gcc dot gnu.org
CC: mjires at gcc dot gnu.org
  Target Milestone: ---
Target: aarch64

Created attachment 60581
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=60581&action=edit
Script to reproduce snappy regression

The commit 'Node clones share order'
(https://gcc.gnu.org/g:0895aef01c64c317b489811dbe4ac55f9c13aab3) causes a
performance regression in the Snappy workload for AArch64 with
-mcpu=neoverse-v2 and LTO: the test UIOVecSink/0 shows ~25% longer runtime.

In the attachment is a script to reproduce the regression. It builds GCC from
commits bad3714b and 0895aef0 and runs Snappy with
O3 -Wl,-z,muldefs -lm -flto=auto -Wl,-sort-section=name -mcpu=neoverse-v2

Use the script like this:
parentdir= ./instructions_to_reproduce.sh

[Bug target/118999] New: AArch64: Switching off early scheduling causes regressions in Snappy workload for -mcpu=neoverse-v2

2025-02-24 Thread jschmitz at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118999

Bug ID: 118999
   Summary: AArch64: Switching off early scheduling causes
regressions in Snappy workload for -mcpu=neoverse-v2
   Product: gcc
   Version: 15.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jschmitz at gcc dot gnu.org
  Target Milestone: ---

Created attachment 60573
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=60573&action=edit
Script to reproduce snappy regression

The commit that switched off early scheduling for AArch64
(https://gcc.gnu.org/g:c5db3f50bdf34ea96fd193a2a66d686401053bd2) causes changes
in performance for the Snappy workload for -mcpu=neoverse-v2, including runtime
increases of up to 20%.

In the attachment is a script to reproduce the regressions. It builds GCC from
commit c5db3f50 and runs Snappy with and without the -fschedule-insns option
(in addition to the other flags for which we saw the regression). Use it like
this:

parentdir= ./snappy_script.sh

As of today, we observed the following runtime changes (for Ofast_VLA; values
are percentages; positive values mean that running Snappy WITHOUT
-fschedule-insns has longer runtime than WITH -fschedule-insns):

BM_UFlat/5/2 -2.12766
BM_UValidate/1/1 12.9032
BM_UValidate/1/2 13.6905
BM_UValidate/2/1 8.21918
BM_UValidate/2/2 8.3
BM_UValidate/3/1 5.88235
BM_UValidate/3/2 6.12245
BM_UValidate/5/1 12.5
BM_UValidate/5/2 6.10329
BM_UValidate/6/1 18.4906
BM_UValidate/6/2 15.8458
BM_UValidate/7/1 20.3024
BM_UValidate/7/2 16.3934
BM_UValidate/8/1 9.34066
BM_UValidate/8/2 9.49367
BM_UValidate/9/1 8.51852
BM_UValidate/9/2 9.42623
BM_UValidateMedley 2.24829
BM_UIOVecSource/6/1 3.21285
BM_UIOVecSource/7/1 4.2654
BM_UIOVecSource/11/1 2.32558
BM_UIOVecSink/0 21.1726
BM_UIOVecSink/3 4.83871
BM_UFlatSink/11/1 2.02808
BM_ZFlat/6/1 2.03252
BM_ZFlat/7/1 4.2654

In the past, we have also seen regressions in other tests, such as UFlat/3/2
and UFlat/3/1.

[Bug tree-optimization/114999] A few missing optimizations due to `a - b` and `b - a` not being detected as negatives of each other

2025-02-20 Thread jschmitz at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114999

--- Comment #13 from Jennifer Schmitz  ---
Created attachment 60540
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=60540&action=edit
Patch for improving codegen of absolute differences of unsigned integers in
aarch64

This patch builds on top of the previous one, improving codegen for the same
test cases for unsigned integers (32-bit and 64-bit) for aarch64. The patch
adds a new define_insn_and_split pattern in the aarch64 backend.

[Bug target/118999] [15 regression] AArch64: Switching off early scheduling (r15-6661-gc5db3f50bdf34e) causes regressions in Snappy workload for -mcpu=neoverse-v2

2025-03-10 Thread jschmitz at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118999

--- Comment #2 from Jennifer Schmitz  ---
Thanks for looking into this. The regression looks to have been resolved by:
AArch64: Enable early scheduling for -O3 and higher (PR118351)
On our machines, the runtimes are back to normal. Do you still see the
regressions? If not, feel free to close the ticket.

[Bug ipa/119009] [15 regression] AArch64: Commit 'Node clones share order' (r15-6345-g0895aef01c64c3) causes regression in Snappy workload for -mcpu=neoverse-v2 with LTO

2025-03-05 Thread jschmitz at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119009

--- Comment #4 from Jennifer Schmitz  ---
Thanks for looking into this. Indeed, the runtime has recovered in the
meantime. From our side, we can close the PR.

[Bug target/117978] Optimise 128-bit-predicated SVE loads to Advanced SIMD LDRs

2025-03-17 Thread jschmitz at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117978

--- Comment #6 from Jennifer Schmitz  ---
Created attachment 60790
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=60790&action=edit
Proposed patch for folding SVE load/store with certain ptrue patterns to
LDR/STR

[Bug tree-optimization/119706] [12/13/14 regression] ICE in gimple pass 'dom' for -O3 -mcpu=grace --param=aarch64-autovec-preference=sve-only

2025-04-10 Thread jschmitz at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119706

--- Comment #7 from Jennifer Schmitz  ---
Great, thanks a lot for the quick fix!

[Bug tree-optimization/119706] New: [15 regression] ICE in gimple pass 'dom' for -O3 -mcpu=grace --param=aarch64-autovec-preference=sve-only

2025-04-10 Thread jschmitz at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119706

Bug ID: 119706
   Summary: [15 regression] ICE in gimple pass 'dom' for -O3
-mcpu=grace
--param=aarch64-autovec-preference=sve-only
   Product: gcc
   Version: 15.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jschmitz at gcc dot gnu.org
  Target Milestone: ---
Target: aarch64

Created attachment 61057
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=61057&action=edit
Test case for reproducing the ICE

For the attached test case (reduced from the RAJAPerf kernel
Basic_MULTI_REDUCE), there is an ICE when compiling it with
-O3 -mcpu=grace --param=aarch64-autovec-preference=sve-only:

during GIMPLE pass: dom
testcase.i: In member function ‘void a::basic::u::v(cd, a::b)’:
testcase.i:170:6: internal compiler error: in maybe_canonicalize_mem_ref_addr,
at gimple-fold.cc:6394
  170 | void u::v(cd, b) {
  |  ^
0x2622a9b internal_error(char const*, ...)
.././../../src/gcc/diagnostic-global-context.cc:517
0x844b57 fancy_abort(char const*, int, char const*)
.././../../src/gcc/diagnostic.cc:1749
0xf4ff5b maybe_canonicalize_mem_ref_addr
.././../../src/gcc/gimple-fold.cc:6394
0xf5c117 fold_stmt_1
.././../../src/gcc/gimple-fold.cc:6499
0x1497cd7 dom_opt_dom_walker::optimize_stmt(basic_block_def*,
gimple_stmt_iterator*, bool*)
.././../../src/gcc/tree-ssa-dom.cc:2352
0x149951f dom_opt_dom_walker::before_dom_children(basic_block_def*)
.././../../src/gcc/tree-ssa-dom.cc:1747
0x22f08e3 dom_walker::walk(basic_block_def*)
.././../../src/gcc/domwalk.cc:311
0x1499e13 execute
.././../../src/gcc/tree-ssa-dom.cc:939

The gimple expression
MEM  [(double *)POLY_INT_CST [16B, 16B] + ivtmp_97 * 8]
does not pass the assertion
gcc_checking_assert (TREE_CODE (TREE_OPERAND (*t, 0)) == DEBUG_EXPR_DECL ||
is_gimple_mem_ref_addr (TREE_OPERAND (*t, 0))).

[Bug tree-optimization/119606] New: [15 regression] Commit 'Optimize string constructor' causes regression in Snappy workload for -mcpu=neoverse-v2 with LTO

2025-04-03 Thread jschmitz at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119606

Bug ID: 119606
   Summary: [15 regression] Commit 'Optimize string constructor'
causes regression in Snappy workload for
-mcpu=neoverse-v2 with LTO
   Product: gcc
   Version: 15.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jschmitz at gcc dot gnu.org
CC: hubicka at ucw dot cz
  Target Milestone: ---
Target: aarch64

Created attachment 60969
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=60969&action=edit
Script to reproduce snappy regression

The commit that optimizes string constructors
(https://gcc.gnu.org/g:9c5505a35d9d71705464f9254f55407192d31ec3) causes changes
in performance for the Snappy workload for -mcpu=neoverse-v2, including some
regressions.

In the attachment is a script to reproduce the regressions. It builds GCC from
commits 37f35ebc and 9c5505a3 and runs Snappy with -O3 -Wl,-z,muldefs -lm
-flto=auto -Wl,--sort-section=name -mcpu=neoverse-v2.

Use it like this:
parentdir= ./snappy_script.sh

As of today, we observed the following runtime changes (values are percentages;
positive values mean that running Snappy from commit 9c5505a3 has longer
runtime than from commit 37f35ebc):

BM_UFlat/4/2 2.92308
BM_UValidate/5/2 -2.9106
BM_UValidate/7/1 2.29277
BM_UValidate/11/1 5.47945
BM_UIOVecSource/0/1 4.00891
BM_UIOVecSource/0/2 6.37636
BM_UIOVecSource/2/1 -3.59375
BM_UIOVecSource/2/2 2.8754
BM_UIOVecSource/4/2 4.42478
BM_UIOVecSource/5/2 2.42424
BM_UIOVecSource/10/2 8.71985
BM_UIOVecSink/3 3.1746
BM_UFlatSink/10/2 2.41935
BM_ZFlat/0/1 3.24826
BM_ZFlat/0/2 6.54952
BM_ZFlat/1/2 2.00501
BM_ZFlat/2/2 4.46735
BM_ZFlat/4/2 4.5045
BM_ZFlat/5/2 2.47678
BM_ZFlat/10/2 9.17782

In the past, we have also seen regressions in other tests, such as UFlat/6/1
and UFlat/6/2.

[Bug libstdc++/119606] [15 regression] Commit 'Optimize string constructor' causes regression in Snappy workload for -mcpu=neoverse-v2 with LTO

2025-04-03 Thread jschmitz at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119606

--- Comment #6 from Jennifer Schmitz  ---
(In reply to Jan Hubicka from comment #5)
> the patch to string constructor should be kind of orthogonal to PR86590.
> I downloaded snappy and perfed it on znver3 machine and while I see there
> are some strings involved, I do not see anything obvious.
> 
> Is there a way to localize the problem? 
> Can I run only one of the benchmarks that changed most?

According to the Snappy documentation, there is an option --benchmark_filter
that can be added to the execution command, e.g.

./snappy_benchmark --benchmark_filter=BM_ZFlat_10_2 ...

Does that work for you?

[Bug libstdc++/119606] [15/16 regression] Commit 'Optimize string constructor' causes regression in Snappy workload for -mcpu=neoverse-v2 with LTO

2025-04-24 Thread jschmitz at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119606

--- Comment #7 from Jennifer Schmitz  ---
For another regression in the Snappy workload
(https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119910), we found that it was
caused by an alignment issue. I added -falign-functions=32 -falign-loops=32
-falign-jumps=32 -falign-labels=32 to the compile flags and could not reproduce
the regressions seen below anymore.
Perf profiling of a run with BM_ZFlat/10/2 showed that the hot sections have
the same assembly sequences, but the addresses are shifted.

[Bug target/119910] [15 regression] Commit 'combine: Allow 2->2 combinations...' causes regression in Snappy workload for -mcpu=neoverse-v2

2025-04-24 Thread jschmitz at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119910

--- Comment #3 from Jennifer Schmitz  ---
Yes, it seems to be an alignment problem: I took a look with perf at the hot
sections and the assembly sequence is the same. But objdump of the benchmark
executable showed that the number of nops differs slightly between the commits
and the addresses of the hot sections are shifted.
Indeed, adding -falign-functions=32 -falign-loops=32 -falign-jumps=32
-falign-labels=32 to the build flags get rid of the regressions.

[Bug rtl-optimization/119910] New: [15 regression] Commit 'combine: Allow 2->2 combinations...' causes regression in Snappy workload for -mcpu=neoverse-v2

2025-04-23 Thread jschmitz at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119910

Bug ID: 119910
   Summary: [15 regression] Commit 'combine: Allow 2->2
combinations...' causes regression in Snappy workload
for -mcpu=neoverse-v2
   Product: gcc
   Version: 15.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jschmitz at gcc dot gnu.org
CC: rsandifo at gcc dot gnu.org
  Target Milestone: ---
Target: aarch64

Created attachment 61177
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=61177&action=edit
Script to reproduce snappy regression

The commit 'combine: Allow 2->2 combinations, but with a tweak [PR116398]'
(https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=4d7a634f6d41029811cdcbd5f7282b5b07890094)
causes changes in performance for the Snappy workload for -mcpu=neoverse-v2,
including some regressions.

In the attachment is a script to reproduce the regressions. It builds GCC from
commits 546f28f83ce and 4d7a634f6d4 and runs Snappy with -O3 -Wl,-z,muldefs -lm
-mcpu=neoverse-v2.

Use it like this:
parentdir= ./snappy_script.sh

As of today, we observed the following runtime changes (values are percentages;
positive values mean that running Snappy from commit 4d7a634f6d4 has longer
runtime than from commit 546f28f83ce):

BM_UFlat/5/1 -2.39362
BM_UValidate/5/2 2.85714
BM_UValidate/6/2 2.04461
BM_UValidate/10/2 -2.79503
BM_UValidate/11/2 -5.4321
BM_UIOVecSource/0/1 18.0723
BM_UIOVecSource/5/1 5.10949
BM_UIOVecSource/10/1 8.59951
BM_UIOVecSource/11/1 2.39044
BM_UIOVecSink/3 2.2
BM_UFlatSink/7/1 3.22581
BM_UFlatSink/7/2 3.85164
BM_ZFlat/0/1 19.0184
BM_ZFlat/3/1 -2.08333
BM_ZFlat/3/2 -2.51799
BM_ZFlat/5/1 4.41176
BM_ZFlat/10/1 9.02256
BM_ZFlat/11/1 2.39044

In the past, we have also seen regressions in several of the UValidate tests.