[Bug tree-optimization/116265] New: Missing optimization: Vectorization of modulo operator
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116265 Bug ID: 116265 Summary: Missing optimization: Vectorization of modulo operator Product: gcc Version: 15.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: enhancement Priority: P3 Component: tree-optimization Assignee: jschmitz at gcc dot gnu.org Reporter: jschmitz at gcc dot gnu.org Target Milestone: --- On aarch64 Neoverse-v2, GCC does not vectorize the modulo operator in loops if the second operand is a memory reference, as in the test case below, even with -Ofast. I am planning to fix this and would like advice on where best to implement it. void foo (unsigned int *x, unsigned int *y, int n) { for (int i = 0; i < n; ++i) x[i] = x[i] % y[i]; } compiles to ldr w5, [x0, x2] ldr w4, [x1, x2] udivw3, w5, w4 msubw3, w3, w4, w5 str w3, [x0, x2] add x2, x2, 4 cmp x6, x2 bne .L3
[Bug tree-optimization/116265] Missing optimization: Vectorization of modulo operator
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116265 Jennifer Schmitz changed: What|Removed |Added Status|UNCONFIRMED |ASSIGNED Ever confirmed|0 |1 Last reconfirmed||2024-08-07
[Bug tree-optimization/101390] Expand vector mod as vector div + multiply-subtract
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101390 Jennifer Schmitz changed: What|Removed |Added Status|NEW |ASSIGNED Assignee|unassigned at gcc dot gnu.org |jschmitz at gcc dot gnu.org --- Comment #5 from Jennifer Schmitz --- I will work on this task and would be grateful for advice on where to best implement this optimization. I already looked at vect_recog_divmod_pattern, but if I'm not mistaken this function is intended for cases where the second operand is an integer constant.
[Bug tree-optimization/101390] Expand vector mod as vector div + multiply-subtract
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101390 --- Comment #7 from Jennifer Schmitz --- Thank you for the reply. Seems like I have been looking in the right places. I'm a new member of the GCC community, so I'm still getting familiar with many parts of the code base. I have been trying to find out where the related case, but with the division operator is implemented, as this seems a natural place to also implement the modulo operator. This does not seem to happen in vect_recog_divmod_pattern. Do you still think vect_recog_divmod_pattern is the right location to implement this or can you point me to the implementation of the same case with division?
[Bug target/116365] Add user-friendly arguments to --param aarch64-autovec-preference=N
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116365 Jennifer Schmitz changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #3 from Jennifer Schmitz --- fixed in GCC 15.1
[Bug tree-optimization/101390] Expand vector mod as vector div + multiply-subtract
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101390 Jennifer Schmitz changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #10 from Jennifer Schmitz --- fixed in GCC 15.1
[Bug tree-optimization/53947] [meta-bug] vectorizer missed-optimizations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 Bug 53947 depends on bug 101390, which changed state. Bug 101390 Summary: Expand vector mod as vector div + multiply-subtract https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101390 What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED
[Bug tree-optimization/116569] [15 Regression] ICE in to_constant, at poly-int.h:592
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116569 --- Comment #5 from Jennifer Schmitz --- I looked into the issue and summarize below what I found: My current fix that checks for the support of the mod optab for vectors looks like this: @@ -894,7 +894,9 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) /* X - (X / Y) * Y is the same as X % Y. */ (simplify (minus (convert1? @0) (convert2? (mult:c (trunc_div @@0 @@1) @1))) - (if (INTEGRAL_TYPE_P (type) || VECTOR_INTEGER_TYPE_P (type)) + (if (INTEGRAL_TYPE_P (type) + || (VECTOR_INTEGER_TYPE_P (type) + && target_supports_op_p (type, TRUNC_MOD_EXPR, optab_vector))) (convert (trunc_mod @0 @1 However, the test fold-minus-1.c fails, because the simplification is not applied anymore: /* { dg-options "-O -fdump-tree-gimple" } */ void f(vec*x,vec*y){ *x -= *x / *y * *y; } /* { dg-final { scan-tree-dump-times "%" 1 "gimple"} } */ /* { dg-final { scan-tree-dump-not "/" "gimple"} } */ I looked into applying the simplification in early tree passes only instead of checking for support of the mod optab and found functions like optimize_vectors_before_lowering_p that use the PROP_gimple_xxx macros (in tree-pass.h) as mask. I tried different PROP_xxx macros and all tests (fold-minus-1.c; the minimal testcase Kyrill posted that produced the ICE; and my previous vect-mod tests) run successfully for @@ -896,7 +896,7 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) (minus (convert1? @0) (convert2? (mult:c (trunc_div @@0 @@1) @1))) (if (INTEGRAL_TYPE_P (type) || (VECTOR_INTEGER_TYPE_P (type) - && target_supports_op_p (type, TRUNC_MOD_EXPR, optab_vector))) + && (!cfun || (cfun->curr_properties & PROP_gimple_any) == 0))) (convert (trunc_mod @0 @1 But I don't think that the PROP_gimple_any is exactly what I want, but I haven't found anything that fits perfectly. Any advise on how to proceed?
[Bug tree-optimization/116569] [15 Regression] ICE in to_constant, at poly-int.h:592
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116569 --- Comment #7 from Jennifer Schmitz --- Thanks for the quick reply. I tried (simplify (minus (convert1? @0) (convert2? (mult:c (trunc_div @@0 @@1) @1))) (if (INTEGRAL_TYPE_P (type) || (VECTOR_INTEGER_TYPE_P (type) && optimize_vectors_before_lowering_p ())) (convert (trunc_mod @0 @1 and the result is that the test case still ICEs, but fold-minus-1.c passes.
[Bug tree-optimization/116831] [15 Regression] ICE with trunc mod vectorising for SVE
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116831 Jennifer Schmitz changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #7 from Jennifer Schmitz --- fixed in GCC 15.1
[Bug tree-optimization/86710] 3 missing logarithm optimizations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86710 Jennifer Schmitz changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #4 from Jennifer Schmitz --- fixed in GCC 15.1
[Bug tree-optimization/116826] Optimise log (1.0 / x) into -log (x)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116826 Jennifer Schmitz changed: What|Removed |Added Resolution|--- |FIXED Status|ASSIGNED|RESOLVED --- Comment #5 from Jennifer Schmitz --- fixed in GCC 15.1
[Bug tree-optimization/117093] Missing detection of REV64 vector permute
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117093 Jennifer Schmitz changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |jschmitz at gcc dot gnu.org Status|NEW |ASSIGNED CC||jschmitz at gcc dot gnu.org --- Comment #5 from Jennifer Schmitz --- .
[Bug tree-optimization/116826] Optimise log (1.0 / x) into -log (x)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116826 Jennifer Schmitz changed: What|Removed |Added Last reconfirmed||2024-09-24 Status|UNCONFIRMED |ASSIGNED Ever confirmed|0 |1 CC||jschmitz at gcc dot gnu.org Assignee|unassigned at gcc dot gnu.org |jschmitz at gcc dot gnu.org --- Comment #1 from Jennifer Schmitz --- .
[Bug tree-optimization/116569] [15 Regression] ICE in to_constant, at poly-int.h:592
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116569 Jennifer Schmitz changed: What|Removed |Added Resolution|--- |FIXED Status|ASSIGNED|RESOLVED --- Comment #15 from Jennifer Schmitz --- fixed in GCC 15.1
[Bug tree-optimization/86710] 3 missing logarithm optimizations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86710 Jennifer Schmitz changed: What|Removed |Added CC||jschmitz at gcc dot gnu.org Assignee|unassigned at gcc dot gnu.org |jschmitz at gcc dot gnu.org Status|NEW |ASSIGNED --- Comment #2 from Jennifer Schmitz --- .
[Bug tree-optimization/116831] [15 Regression] ICE with trunc mod vectorising for SVE
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116831 Jennifer Schmitz changed: What|Removed |Added Status|NEW |ASSIGNED Assignee|unassigned at gcc dot gnu.org |jschmitz at gcc dot gnu.org
[Bug target/106329] No optimization for SVE pfalse predicate
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106329 Jennifer Schmitz changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |jschmitz at gcc dot gnu.org Version|12.1.0 |15.0 Status|NEW |ASSIGNED --- Comment #2 from Jennifer Schmitz --- .
[Bug testsuite/117704] gcc.dg/tree-ssa/pow_fold_1.c FAILs on 32-bit x86
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117704 Jennifer Schmitz changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #3 from Jennifer Schmitz --- Fixed in GCC 15 by https://gcc.gnu.org/pipermail/gcc-patches/2024-November/669910.html
[Bug testsuite/117704] gcc.dg/tree-ssa/pow_fold_1.c FAILs on 32-bit x86
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117704 Jennifer Schmitz changed: What|Removed |Added Last reconfirmed||2024-11-20 Ever confirmed|0 |1 Assignee|unassigned at gcc dot gnu.org |jschmitz at gcc dot gnu.org Status|UNCONFIRMED |ASSIGNED --- Comment #2 from Jennifer Schmitz --- .
[Bug tree-optimization/117093] Missing detection of REV64 vector permute
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117093 Jennifer Schmitz changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #7 from Jennifer Schmitz --- fixed in GCC 15
[Bug tree-optimization/117093] Missing detection of REV64 vector permute
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117093 --- Comment #9 from Jennifer Schmitz --- Thanks for reporting it, I'll look into it on Monday.
[Bug target/106329] No optimization for SVE pfalse predicate
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106329 Jennifer Schmitz changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #4 from Jennifer Schmitz --- fixed in GCC 15.
[Bug tree-optimization/114999] A few missing optimizations due to `a - b` and `b - a` not being detected as negatives of each other
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114999 Jennifer Schmitz changed: What|Removed |Added CC||jschmitz at gcc dot gnu.org --- Comment #10 from Jennifer Schmitz --- We are also optimizing ABS expressions and improving codegen for the following types of test cases (for T in {uint8_t, int8_t, uint16_t, int16_t, uint32_t, int32_t, uint64_t, int64_t, __uint128_t, __int128_t}): T src(T x, T y) { T diff1 = x - y; T diff2 = y - x; return x > y ? diff1 : diff2; } T tgt(T x, T y) { T diff = x - y; return x > y ? diff : -diff; } This seems to be a subset of the transformations described here, so it would be good to coordinate the work: We have code ready that covers our test cases, but would also be happy to look at other optimizations mentioned above.
[Bug target/117978] Optimise 128-bit-predicated SVE loads to Advanced SIMD LDRs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117978 Jennifer Schmitz changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |jschmitz at gcc dot gnu.org Status|NEW |ASSIGNED
[Bug tree-optimization/114999] A few missing optimizations due to `a - b` and `b - a` not being detected as negatives of each other
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114999 --- Comment #12 from Jennifer Schmitz --- Created attachment 60149 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=60149&action=edit Proposed patch for detecting abs diff for signed integers
[Bug target/119009] New: AArch64: Commit 'Node clones share order' causes regression in Snappy workload for -mcpu=neoverse-v2 with LTO
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119009 Bug ID: 119009 Summary: AArch64: Commit 'Node clones share order' causes regression in Snappy workload for -mcpu=neoverse-v2 with LTO Product: gcc Version: 15.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jschmitz at gcc dot gnu.org CC: mjires at gcc dot gnu.org Target Milestone: --- Target: aarch64 Created attachment 60581 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=60581&action=edit Script to reproduce snappy regression The commit 'Node clones share order' (https://gcc.gnu.org/g:0895aef01c64c317b489811dbe4ac55f9c13aab3) causes a performance regression in the Snappy workload for AArch64 with -mcpu=neoverse-v2 and LTO: the test UIOVecSink/0 shows ~25% longer runtime. In the attachment is a script to reproduce the regression. It builds GCC from commits bad3714b and 0895aef0 and runs Snappy with O3 -Wl,-z,muldefs -lm -flto=auto -Wl,-sort-section=name -mcpu=neoverse-v2 Use the script like this: parentdir= ./instructions_to_reproduce.sh
[Bug target/118999] New: AArch64: Switching off early scheduling causes regressions in Snappy workload for -mcpu=neoverse-v2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118999 Bug ID: 118999 Summary: AArch64: Switching off early scheduling causes regressions in Snappy workload for -mcpu=neoverse-v2 Product: gcc Version: 15.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jschmitz at gcc dot gnu.org Target Milestone: --- Created attachment 60573 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=60573&action=edit Script to reproduce snappy regression The commit that switched off early scheduling for AArch64 (https://gcc.gnu.org/g:c5db3f50bdf34ea96fd193a2a66d686401053bd2) causes changes in performance for the Snappy workload for -mcpu=neoverse-v2, including runtime increases of up to 20%. In the attachment is a script to reproduce the regressions. It builds GCC from commit c5db3f50 and runs Snappy with and without the -fschedule-insns option (in addition to the other flags for which we saw the regression). Use it like this: parentdir= ./snappy_script.sh As of today, we observed the following runtime changes (for Ofast_VLA; values are percentages; positive values mean that running Snappy WITHOUT -fschedule-insns has longer runtime than WITH -fschedule-insns): BM_UFlat/5/2 -2.12766 BM_UValidate/1/1 12.9032 BM_UValidate/1/2 13.6905 BM_UValidate/2/1 8.21918 BM_UValidate/2/2 8.3 BM_UValidate/3/1 5.88235 BM_UValidate/3/2 6.12245 BM_UValidate/5/1 12.5 BM_UValidate/5/2 6.10329 BM_UValidate/6/1 18.4906 BM_UValidate/6/2 15.8458 BM_UValidate/7/1 20.3024 BM_UValidate/7/2 16.3934 BM_UValidate/8/1 9.34066 BM_UValidate/8/2 9.49367 BM_UValidate/9/1 8.51852 BM_UValidate/9/2 9.42623 BM_UValidateMedley 2.24829 BM_UIOVecSource/6/1 3.21285 BM_UIOVecSource/7/1 4.2654 BM_UIOVecSource/11/1 2.32558 BM_UIOVecSink/0 21.1726 BM_UIOVecSink/3 4.83871 BM_UFlatSink/11/1 2.02808 BM_ZFlat/6/1 2.03252 BM_ZFlat/7/1 4.2654 In the past, we have also seen regressions in other tests, such as UFlat/3/2 and UFlat/3/1.
[Bug tree-optimization/114999] A few missing optimizations due to `a - b` and `b - a` not being detected as negatives of each other
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114999 --- Comment #13 from Jennifer Schmitz --- Created attachment 60540 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=60540&action=edit Patch for improving codegen of absolute differences of unsigned integers in aarch64 This patch builds on top of the previous one, improving codegen for the same test cases for unsigned integers (32-bit and 64-bit) for aarch64. The patch adds a new define_insn_and_split pattern in the aarch64 backend.
[Bug target/118999] [15 regression] AArch64: Switching off early scheduling (r15-6661-gc5db3f50bdf34e) causes regressions in Snappy workload for -mcpu=neoverse-v2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118999 --- Comment #2 from Jennifer Schmitz --- Thanks for looking into this. The regression looks to have been resolved by: AArch64: Enable early scheduling for -O3 and higher (PR118351) On our machines, the runtimes are back to normal. Do you still see the regressions? If not, feel free to close the ticket.
[Bug ipa/119009] [15 regression] AArch64: Commit 'Node clones share order' (r15-6345-g0895aef01c64c3) causes regression in Snappy workload for -mcpu=neoverse-v2 with LTO
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119009 --- Comment #4 from Jennifer Schmitz --- Thanks for looking into this. Indeed, the runtime has recovered in the meantime. From our side, we can close the PR.
[Bug target/117978] Optimise 128-bit-predicated SVE loads to Advanced SIMD LDRs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117978 --- Comment #6 from Jennifer Schmitz --- Created attachment 60790 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=60790&action=edit Proposed patch for folding SVE load/store with certain ptrue patterns to LDR/STR
[Bug tree-optimization/119706] [12/13/14 regression] ICE in gimple pass 'dom' for -O3 -mcpu=grace --param=aarch64-autovec-preference=sve-only
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119706 --- Comment #7 from Jennifer Schmitz --- Great, thanks a lot for the quick fix!
[Bug tree-optimization/119706] New: [15 regression] ICE in gimple pass 'dom' for -O3 -mcpu=grace --param=aarch64-autovec-preference=sve-only
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119706 Bug ID: 119706 Summary: [15 regression] ICE in gimple pass 'dom' for -O3 -mcpu=grace --param=aarch64-autovec-preference=sve-only Product: gcc Version: 15.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: jschmitz at gcc dot gnu.org Target Milestone: --- Target: aarch64 Created attachment 61057 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=61057&action=edit Test case for reproducing the ICE For the attached test case (reduced from the RAJAPerf kernel Basic_MULTI_REDUCE), there is an ICE when compiling it with -O3 -mcpu=grace --param=aarch64-autovec-preference=sve-only: during GIMPLE pass: dom testcase.i: In member function ‘void a::basic::u::v(cd, a::b)’: testcase.i:170:6: internal compiler error: in maybe_canonicalize_mem_ref_addr, at gimple-fold.cc:6394 170 | void u::v(cd, b) { | ^ 0x2622a9b internal_error(char const*, ...) .././../../src/gcc/diagnostic-global-context.cc:517 0x844b57 fancy_abort(char const*, int, char const*) .././../../src/gcc/diagnostic.cc:1749 0xf4ff5b maybe_canonicalize_mem_ref_addr .././../../src/gcc/gimple-fold.cc:6394 0xf5c117 fold_stmt_1 .././../../src/gcc/gimple-fold.cc:6499 0x1497cd7 dom_opt_dom_walker::optimize_stmt(basic_block_def*, gimple_stmt_iterator*, bool*) .././../../src/gcc/tree-ssa-dom.cc:2352 0x149951f dom_opt_dom_walker::before_dom_children(basic_block_def*) .././../../src/gcc/tree-ssa-dom.cc:1747 0x22f08e3 dom_walker::walk(basic_block_def*) .././../../src/gcc/domwalk.cc:311 0x1499e13 execute .././../../src/gcc/tree-ssa-dom.cc:939 The gimple expression MEM [(double *)POLY_INT_CST [16B, 16B] + ivtmp_97 * 8] does not pass the assertion gcc_checking_assert (TREE_CODE (TREE_OPERAND (*t, 0)) == DEBUG_EXPR_DECL || is_gimple_mem_ref_addr (TREE_OPERAND (*t, 0))).
[Bug tree-optimization/119606] New: [15 regression] Commit 'Optimize string constructor' causes regression in Snappy workload for -mcpu=neoverse-v2 with LTO
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119606 Bug ID: 119606 Summary: [15 regression] Commit 'Optimize string constructor' causes regression in Snappy workload for -mcpu=neoverse-v2 with LTO Product: gcc Version: 15.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: jschmitz at gcc dot gnu.org CC: hubicka at ucw dot cz Target Milestone: --- Target: aarch64 Created attachment 60969 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=60969&action=edit Script to reproduce snappy regression The commit that optimizes string constructors (https://gcc.gnu.org/g:9c5505a35d9d71705464f9254f55407192d31ec3) causes changes in performance for the Snappy workload for -mcpu=neoverse-v2, including some regressions. In the attachment is a script to reproduce the regressions. It builds GCC from commits 37f35ebc and 9c5505a3 and runs Snappy with -O3 -Wl,-z,muldefs -lm -flto=auto -Wl,--sort-section=name -mcpu=neoverse-v2. Use it like this: parentdir= ./snappy_script.sh As of today, we observed the following runtime changes (values are percentages; positive values mean that running Snappy from commit 9c5505a3 has longer runtime than from commit 37f35ebc): BM_UFlat/4/2 2.92308 BM_UValidate/5/2 -2.9106 BM_UValidate/7/1 2.29277 BM_UValidate/11/1 5.47945 BM_UIOVecSource/0/1 4.00891 BM_UIOVecSource/0/2 6.37636 BM_UIOVecSource/2/1 -3.59375 BM_UIOVecSource/2/2 2.8754 BM_UIOVecSource/4/2 4.42478 BM_UIOVecSource/5/2 2.42424 BM_UIOVecSource/10/2 8.71985 BM_UIOVecSink/3 3.1746 BM_UFlatSink/10/2 2.41935 BM_ZFlat/0/1 3.24826 BM_ZFlat/0/2 6.54952 BM_ZFlat/1/2 2.00501 BM_ZFlat/2/2 4.46735 BM_ZFlat/4/2 4.5045 BM_ZFlat/5/2 2.47678 BM_ZFlat/10/2 9.17782 In the past, we have also seen regressions in other tests, such as UFlat/6/1 and UFlat/6/2.
[Bug libstdc++/119606] [15 regression] Commit 'Optimize string constructor' causes regression in Snappy workload for -mcpu=neoverse-v2 with LTO
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119606 --- Comment #6 from Jennifer Schmitz --- (In reply to Jan Hubicka from comment #5) > the patch to string constructor should be kind of orthogonal to PR86590. > I downloaded snappy and perfed it on znver3 machine and while I see there > are some strings involved, I do not see anything obvious. > > Is there a way to localize the problem? > Can I run only one of the benchmarks that changed most? According to the Snappy documentation, there is an option --benchmark_filter that can be added to the execution command, e.g. ./snappy_benchmark --benchmark_filter=BM_ZFlat_10_2 ... Does that work for you?
[Bug libstdc++/119606] [15/16 regression] Commit 'Optimize string constructor' causes regression in Snappy workload for -mcpu=neoverse-v2 with LTO
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119606 --- Comment #7 from Jennifer Schmitz --- For another regression in the Snappy workload (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119910), we found that it was caused by an alignment issue. I added -falign-functions=32 -falign-loops=32 -falign-jumps=32 -falign-labels=32 to the compile flags and could not reproduce the regressions seen below anymore. Perf profiling of a run with BM_ZFlat/10/2 showed that the hot sections have the same assembly sequences, but the addresses are shifted.
[Bug target/119910] [15 regression] Commit 'combine: Allow 2->2 combinations...' causes regression in Snappy workload for -mcpu=neoverse-v2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119910 --- Comment #3 from Jennifer Schmitz --- Yes, it seems to be an alignment problem: I took a look with perf at the hot sections and the assembly sequence is the same. But objdump of the benchmark executable showed that the number of nops differs slightly between the commits and the addresses of the hot sections are shifted. Indeed, adding -falign-functions=32 -falign-loops=32 -falign-jumps=32 -falign-labels=32 to the build flags get rid of the regressions.
[Bug rtl-optimization/119910] New: [15 regression] Commit 'combine: Allow 2->2 combinations...' causes regression in Snappy workload for -mcpu=neoverse-v2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119910 Bug ID: 119910 Summary: [15 regression] Commit 'combine: Allow 2->2 combinations...' causes regression in Snappy workload for -mcpu=neoverse-v2 Product: gcc Version: 15.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: jschmitz at gcc dot gnu.org CC: rsandifo at gcc dot gnu.org Target Milestone: --- Target: aarch64 Created attachment 61177 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=61177&action=edit Script to reproduce snappy regression The commit 'combine: Allow 2->2 combinations, but with a tweak [PR116398]' (https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=4d7a634f6d41029811cdcbd5f7282b5b07890094) causes changes in performance for the Snappy workload for -mcpu=neoverse-v2, including some regressions. In the attachment is a script to reproduce the regressions. It builds GCC from commits 546f28f83ce and 4d7a634f6d4 and runs Snappy with -O3 -Wl,-z,muldefs -lm -mcpu=neoverse-v2. Use it like this: parentdir= ./snappy_script.sh As of today, we observed the following runtime changes (values are percentages; positive values mean that running Snappy from commit 4d7a634f6d4 has longer runtime than from commit 546f28f83ce): BM_UFlat/5/1 -2.39362 BM_UValidate/5/2 2.85714 BM_UValidate/6/2 2.04461 BM_UValidate/10/2 -2.79503 BM_UValidate/11/2 -5.4321 BM_UIOVecSource/0/1 18.0723 BM_UIOVecSource/5/1 5.10949 BM_UIOVecSource/10/1 8.59951 BM_UIOVecSource/11/1 2.39044 BM_UIOVecSink/3 2.2 BM_UFlatSink/7/1 3.22581 BM_UFlatSink/7/2 3.85164 BM_ZFlat/0/1 19.0184 BM_ZFlat/3/1 -2.08333 BM_ZFlat/3/2 -2.51799 BM_ZFlat/5/1 4.41176 BM_ZFlat/10/1 9.02256 BM_ZFlat/11/1 2.39044 In the past, we have also seen regressions in several of the UValidate tests.