[Bug middle-end/26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163 Bug 26163 depends on bug 89430, which changed state. Bug 89430 Summary: A missing ifcvt optimization to generate csel https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89430 What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED
[Bug tree-optimization/89430] A missing ifcvt optimization to generate csel
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89430 Martin Jambor changed: What|Removed |Added Status|NEW |RESOLVED CC||jamborm at gcc dot gnu.org Resolution|--- |FIXED --- Comment #11 from Martin Jambor --- (In reply to Jeffrey A. Law from comment #10) > Fixed on the trunk. So marking as fixed.
[Bug tree-optimization/80635] [8/9/10 regression] std::optional and bogus -Wmaybe-uninitialized warning
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80635 --- Comment #51 from Martin Jambor --- (In reply to Andrew Pinski from comment #48) > This should also work too: > diff --git a/gcc/tree-sra.c b/gcc/tree-sra.c > index ea8594db193..83b1d981439 100644 > --- a/gcc/tree-sra.c > +++ b/gcc/tree-sra.c > @@ -2499,6 +2499,7 @@ analyze_access_subtree (struct access *root, struct > access *parent, > For integral types this means the precision has to match. > Avoid assumptions based on the integral type kind, too. */ >if (INTEGRAL_TYPE_P (root->type) > + && TREE_CODE (root->type) != BOOLEAN_TYPE > && (TREE_CODE (root->type) != INTEGER_TYPE > || TYPE_PRECISION (root->type) != root->size) > /* But leave bitfield accesses alone. */ > > CUT Well, this re-introduces bug PR 52244 and makes the associated testcase fail. PR 52244 fix specifically aimed to disallow boolean replacements. (In reply to Jeffrey A. Law from comment #50) > Reassigning to Martin Jambor since the real fix is to avoid creating the > V_C_E in the first place. I hoped that changing SRA to emit a NOP_EXPR instead of V_C_E would help, but unfortunately it doesn't. I've been looking at this for the whole evening yesterday and ATM I do not see how I could avoid conversion without reintroducing PR 52244 (in the general case - this is another consequence of the fact that SRA is not flow sensitive).
[Bug tree-optimization/93435] [8/9/10 Regression] Hang with -O2 on innocuous looking code with GCC 8.3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93435 --- Comment #8 from Martin Jambor --- The issue actually started with my r8-344-2bba75411e1 and it is basically a perfect SRA bomb, it makes SRA sub-access propagation accross assignments create gazillions of accesses and then replacements, because they facilitate forward propagation (and as ccp3 dumps shows, they do). I already have a patch that simply limits the number of replacements to a param, defaulting to 128, which makes the testcase compilation finish in about 9 seconds on my machine. However, SRA analysis still takes 7 seconds of that, so I'm looking at capping the propagation earlier. That takes more book-keeping, so at least for backports, I'd like to use the simpler approach on released branches.
[Bug tree-optimization/93435] [8/9 Regression] Hang with -O2 on innocuous looking code with GCC 8.3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93435 Martin Jambor changed: What|Removed |Added Summary|[8/9/10 Regression] Hang|[8/9 Regression] Hang with |with -O2 on innocuous |-O2 on innocuous looking |looking code with GCC 8.3 |code with GCC 8.3 --- Comment #10 from Martin Jambor --- Fixed on trunk with https://gcc.gnu.org/pipermail/gcc-patches/2020-March/542390.html
[Bug ipa/94360] New: 6% run-time regression of 502.gcc_r against GCC 9 when compiled with -O2 and both PGO and LTO
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94360 Bug ID: 94360 Summary: 6% run-time regression of 502.gcc_r against GCC 9 when compiled with -O2 and both PGO and LTO Product: gcc Version: 10.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: ipa Assignee: unassigned at gcc dot gnu.org Reporter: jamborm at gcc dot gnu.org CC: hubicka at gcc dot gnu.org, marxin at gcc dot gnu.org Blocks: 26163 Target Milestone: --- Host: x86_64-linux Target: x86_64-linux When built at -O2, generic march/mtune and with both PGO and LTO and current trunk/master, SPEC 2017 INTrate 502.gcc_r is 6% slower when run on and AMD Zen2-based CPU - and about 4.8% slower on Intel Cascade Lake. Looking at how the run-time of the benchmark evolved over the course of GCC 10 development cycle, the first and biggest regression (9%) comes with: commit 2925cad2151842daa387950e62d989090e47c91d Author: Jan Hubicka Date: Thu Oct 3 17:08:21 2019 +0200 params.def (PARAM_INLINE_HEURISTICS_HINT_PERCENT, [...]): New. * params.def (PARAM_INLINE_HEURISTICS_HINT_PERCENT, PARAM_INLINE_HEURISTICS_HINT_PERCENT_O2): New. * doc/invoke.texi (inline-heuristics-hint-percent, inline-heuristics-hint-percent-O2): Document. * tree-inline.c (inline_insns_single, inline_insns_auto): Add new hint attribute. (can_inline_edge_by_limits_p): Use it. From-SVN: r276516 Then between Wed Nov 6 (72d6aeecd95) and Mon Nov 18 (58c036c8354) it improved to about 103% of GCC 9 run-time (I did not exactly found what caused it because in much of this range the compiler was segfaulting in the LTO phase). Eventually, the benchmark regresses to current 106% of GCC 9 run-time with Honza's: - 9340d34599e Convert inliner to function specific param infrastructure, or - 1e83bd7003e Convert inliner to new param infrastructure. The former cannot be built without the latter. Symbol profiles are: trunk (26b3e568a60): OverheadSamples Shared Object Symbol . 4.04% 42371 cpugcc_r_peak.pgolto bitmap_ior_into 2.91% 30281 cpugcc_r_peak.pgolto df_worklist_dataflow 2.24% 23342 cpugcc_r_peak.pgolto df_note_compute 1.92% 20120 cpugcc_r_peak.pgolto bitmap_set_bit 1.75% 18148 cpugcc_r_peak.pgolto rest_of_handle_fast_dce.lto_priv.0 1.58% 16580 libc-2.31.so __memset_avx2_unaligned_erms 1.40% 14514 cpugcc_r_peak.pgolto extract_new_fences_from.lto_priv.0 1.39% 14732 libc-2.31.so _int_malloc 1.33% 13824 cpugcc_r_peak.pgolto bitmap_copy 1.24% 12962 cpugcc_r_peak.pgolto bitmap_bit_p 1.19% 12346 cpugcc_r_peak.pgolto bitmap_and 1.18% 12242 cpugcc_r_peak.pgolto df_lr_local_compute.lto_priv.0 1.02% 10618 cpugcc_r_peak.pgolto cleanup_cfg.isra.0 vs gcc 9 (releases/gcc-9.3.0): OverheadSamples Shared Object Symbol . . 6.81% 66967 cpugcc_r_peak.pgolto df_worklist_dataflow 2.83% 28063 cpugcc_r_peak.pgolto bitmap_ior_into 2.80% 27489 cpugcc_r_peak.pgolto df_note_compute.lto_priv.0 2.17% 21334 cpugcc_r_peak.pgolto rest_of_handle_fast_dce.lto_priv.0 1.69% 16671 libc-2.31.so __memset_avx2_unaligned_erms 1.51% 14876 cpugcc_r_peak.pgolto try_optimize_cfg.lto_priv.0 1.50% 14990 libc-2.31.so _int_malloc 1.50% 14715 cpugcc_r_peak.pgolto extract_new_fences_from.lto_priv.0 1.36% 13406 cpugcc_r_peak.pgolto df_lr_local_compute.lto_priv.0 1.20% 11926 cpugcc_r_peak.pgolto remove_unused_locals 1.06% 10433 cpugcc_r_peak.pgolto sched_analyze_insn 1.04% 10210 cpugcc_r_peak.pgolto init_alias_analysis 1.04% 10188 cpugcc_r_peak.pgolto prescan_insns_for_dce.lto_priv.0 1.00% 9876 cpugcc_r_peak.pgolto compute_transp Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163 [Bug 26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)
[Bug tree-optimization/94364] New: 505.mcf_r is 8% faster when compiled with -mprefer-vector-width=128
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94364 Bug ID: 94364 Summary: 505.mcf_r is 8% faster when compiled with -mprefer-vector-width=128 Product: gcc Version: 10.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: jamborm at gcc dot gnu.org Blocks: 26163 Target Milestone: --- Host: x86_64-linux Target: x86_64-linux SPEC 2017 INTrate benchmark 505.mcf_r, when compiled with options -Ofast -march=native -mtune=native, is 8% slower than when we also use option -mprefer-vector-width=128. I have observed it on both AMD Zen2 and Intel Cascade Lake Server CPUs (using master revision 26b3e568a60). Better vector width selection would therefore bring about noticeable speed-up. Symbol profiles (collected on AMD Rome): -Ofast -march=native -mtune=native: Overhead Samples Shared ObjectSymbol ... 28.64%462302 mcf_r_peak.mine spec_qsort 21.58%348703 mcf_r_peak.mine cost_compare 15.81%255029 mcf_r_peak.mine primal_bea_mpp 15.58%251176 mcf_r_peak.mine replace_weaker_arc 7.37%118646 mcf_r_peak.mine arc_compare 6.53%105337 mcf_r_peak.mine price_out_impl 1.38% 22276 mcf_r_peak.mine update_tree -Ofast -march=native -mtune=native -mprefer-vector-width=128: Overhead Samples Shared ObjectSymbol ... 23.57%354536 mcf_r_peak.mine spec_qsort 23.51%353767 mcf_r_peak.mine cost_compare 16.98%255104 mcf_r_peak.mine primal_bea_mpp 16.65%249891 mcf_r_peak.mine replace_weaker_arc 7.29%109267 mcf_r_peak.mine arc_compare 7.09%106380 mcf_r_peak.mine price_out_impl 1.53% 22968 mcf_r_peak.mine update_tree Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163 [Bug 26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)
[Bug gcov-profile/94369] New: 505.mcf_r is 6-7% slower at -Ofast -march=native with PGO+LTO than with just LTO
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94369 Bug ID: 94369 Summary: 505.mcf_r is 6-7% slower at -Ofast -march=native with PGO+LTO than with just LTO Product: gcc Version: 10.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: gcov-profile Assignee: unassigned at gcc dot gnu.org Reporter: jamborm at gcc dot gnu.org CC: marxin at gcc dot gnu.org Blocks: 26163 Target Milestone: --- Host: x86_64-linux Target: x86_64-linux SPEC 2017 INTrate benchmark 505.mcf_r, when compiled with options -Ofast -march=native -mtune=native, is 6-7% slower when compiled with both PGO and LTO than when built with just LTO. I have observed this on both AMD Zen2 (7%) and Intel Cascade Lake (6%) server CPUs. The train run cannot be very bad because without LTO, PGO improves run-time by 15% on both systems. This is with master revision 26b3e568a60. Profiling results (from an AMD CPU): LTO: OverheadSamples Shared ObjectSymbol . ... 39.53% 518450 mcf_r_peak.mine spec_qsort.constprop.0 22.13% 289745 mcf_r_peak.mine master.constprop.0 19.00% 248641 mcf_r_peak.mine replace_weaker_arc 9.37% 122669 mcf_r_peak.mine main 8.60% 112601 mcf_r_peak.mine spec_qsort.constprop.1 PGO+LTO: OverheadSamples Shared ObjectSymbol . ... ... 40.13% 562770 mcf_r_peak.mine spec_qsort.constprop.0 21.68% 303543 mcf_r_peak.mine master.constprop.0 18.24% 255236 mcf_r_peak.mine replace_weaker_arc 10.32% 144433 mcf_r_peak.mine main 8.07% 112775 mcf_r_peak.mine arc_compare Perhaps I should note that we have patched qsort in the benchmark to work with strict aliasing even with LTO. But the performance gap is there also with -fno-strict-aliasing. Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163 [Bug 26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)
[Bug middle-end/26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163 Bug 26163 depends on bug 90056, which changed state. Bug 90056 Summary: 548.exchange2_r regressions on AMD Zen https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90056 What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |MOVED
[Bug middle-end/90056] 548.exchange2_r regressions on AMD Zen
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90056 Martin Jambor changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |MOVED --- Comment #2 from Martin Jambor --- (In reply to Martin Jambor from comment #0) > As of revision 270053, the 548.exchange2_r benchmark from SPEC 2017 > INTrate suite suffered a number of smaller regressions on AMD Zen > CPUs: > > - At -O2, it is 4.5% slower than when compiled with GCC 7 I am about to file a specific bug about exchange at -O2. > - At -Ofast, it is 4.7% slower than when compiled with GCC 8 This is no longer true. > - At -Ofast -march=native -mutine=native, this difference is 6.9% Again, I will file a more specific bug about -Ofast -march=native in a little while. > - At -Ofast and native tuning, it is 6% slower with PGO than > without it. I can still see this in my measurements on Zen1-based CPU but not in those done on AMD Zen2 or Intel Cascade Lake. So I am not sure if we care. I'll e happy to file a specific bug if we do.
[Bug tree-optimization/94373] New: 548.exchange2_r run time is 7-12% worse than GCC 9 at -O2 and generic march/mtune
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94373 Bug ID: 94373 Summary: 548.exchange2_r run time is 7-12% worse than GCC 9 at -O2 and generic march/mtune Product: gcc Version: 10.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: jamborm at gcc dot gnu.org CC: rguenth at gcc dot gnu.org Blocks: 26163 Target Milestone: --- Host: x86_64-linux Target: x86_64-linux When compiled with just -O2, SPEC 2017 INTrate benchmark 548.exchange2_r runs slower than when compiled with GCC 9.2. It is: - 8% slower on AMD Zen2-based server CPU (rev. 26b3e568a60) - 12% slower on Intel Cascade Lake server CPU (rev. abe13e1847f) - 7% slower on AMD Zen1-based server CPU (rev. 26b3e568a60) During GCC 10 development cycle the benchmark was relatively noisy and the run time was increasing in many small steps, but between October 7 and November 15 we were doing 3% better than GCC 9 (on Zen2). Specifically the following commit brought about the improvement: commit 806bdf4e40d31cf55744c876eb9f17654de36b99 Author: Richard Biener Date: Mon Oct 7 07:53:45 2019 + re PR tree-optimization/91975 (worse code for small array copy using pointer arithmetic than array indexing) 2019-10-07 Richard Biener PR tree-optimization/91975 * tree-ssa-loop-ivcanon.c (constant_after_peeling): Consistently handle invariants. From-SVN: r276645 But it was undone by its revert: commit f0af4848ac40d2342743c9b16416310d61db85b5 Author: Richard Biener Date: Fri Nov 15 09:09:16 2019 + re PR tree-optimization/92039 (Spurious -Warray-bounds warnings building 32-bit glibc) 2019-11-15 Richard Biener PR tree-optimization/92039 PR tree-optimization/91975 * tree-ssa-loop-ivcanon.c (constant_after_peeling): Revert previous change, treat invariants consistently as non-constant. (tree_estimate_loop_size): Ternary ops with just the first op constant are not optimized away. * gcc.dg/tree-ssa/cunroll-2.c: Revert to state previous to unroller adjustment. * g++.dg/tree-ssa/ivopts-3.C: Likewise. From-SVN: r278281 On the Intel machine, reverting the revert fixes the regression too. Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163 [Bug 26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)
[Bug tree-optimization/94375] New: 548.exchange2_r run time is 8-18% worse than GCC 9 at -Ofast -march=native
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94375 Bug ID: 94375 Summary: 548.exchange2_r run time is 8-18% worse than GCC 9 at -Ofast -march=native Product: gcc Version: 10.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: jamborm at gcc dot gnu.org Blocks: 26163 Target Milestone: --- Host: x86_64-linux Target: x86_64-linux When compiled with trunk revision 26b3e568a60 and options -Ofast -march=native -mtune=native, SPEC 2017 INTrate benchmark 548.exchange2_r runs 19% slower on AMD Zen2 and 12% slower on Intel Cascade Lake than when built with GCC 9.2. It appears that the main culprit is the vectorizer, switching it off recovers the performance - it is in fact even some 4% better than GCC 9 on AMD). Side note: with --param ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80 one can exchange that is 25% faster yet but that is a different issue. This started happening in the autumn but not exactly at one point, as the following table of run-times relative to GCC 9.2 shows. Revision: time - d82f38123b5 (Nov 14 2019) 117% d9adca6e663 (Nov 5 2019) 117% bf037872d3c (Oct 24 2019) 101% 77ef339456f (Oct 14 2019) 118% 38a734350fd (Oct 3 2019) 100% d469a71e5a0 (Sep 23 2019) 101% Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163 [Bug 26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)
[Bug middle-end/90056] 548.exchange2_r regressions on AMD Zen
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90056 --- Comment #3 from Martin Jambor --- So replaced with more specific bugs for newer hardware: PR94373 and PR94375.
[Bug middle-end/87528] Popcount changes caused 531.deepsjeng_r run-time regression on Skylake
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87528 --- Comment #8 from Martin Jambor --- Do I understand correctly that this is fixed?
[Bug tree-optimization/94375] 548.exchange2_r run time is 8-18% worse than GCC 9 at -Ofast -march=native
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94375 --- Comment #3 from Martin Jambor --- (In reply to Hongtao.liu from comment #1) > Try -mprefer-vector-width=128,256-bit vectorization is not helpful for 548 > according to our experience. I have seen this helping on one system running SLES 15.1 and with trunk abe13e1847f (Feb 17 2020) but not on another running openSUSE Tumbleweed and with trunk revision 26b3e568a60 (Mar 23 2020). So, from my perspective, perhaps it helps, perhaps it doesn't.
[Bug target/94400] New: 531.deepsjeng_r is 7% slower at -O2 -march=znver2 than GCC 9
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94400 Bug ID: 94400 Summary: 531.deepsjeng_r is 7% slower at -O2 -march=znver2 than GCC 9 Product: gcc Version: 10.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jamborm at gcc dot gnu.org CC: hubicka at gcc dot gnu.org Blocks: 26163 Target Milestone: --- Host: x86_64-linux Target: x86_64-linux When compiled with -O2 -march=native and run on an AMD Zen2 CPU, 531.deepsjeng_r runs about 7% slower. This can be bisected to a single commit: commit a9a4edf0e71bbac9f1b5dcecdcf9250111d16889 Author: Jan Hubicka Date: Sat Nov 30 22:25:24 2019 +0100 Update max_bb_count in execute_fixup_cfg * tree-cfg.c (execute_fixup_cfg): Update also max_bb_count when scaling happen. From-SVN: r278879 Surprisingly, I cannot see a similar problem on an Intel Cascade Lake server CPU, but I have confirmed the above on two different Rome systems (one running SLES, one openSUSE Tumbleweed). Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163 [Bug 26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)
[Bug gcov-profile/94369] 505.mcf_r is 6-7% slower at -Ofast -march=native with PGO+LTO than with just LTO
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94369 --- Comment #3 from Martin Jambor --- I did not save the reported number of samples but from the raw sample numbers and percentage points it seems so: (562770/0.4013)/(518450/0.3953) = 1.069 Nevertheless, I did save separately obtained perf stat numbers which also look similar (and the number of branches might be a clue): LTO: 326083.03 msec task-clock:u #0.999 CPUs utilized 0 context-switches:u#0.000 K/sec 0 cpu-migrations:u #0.000 K/sec 8821 page-faults:u #0.027 K/sec 1080945983089 cycles:u # (83.33%) 21883016095 stalled-cycles-frontend:u #2.02% frontend cycles idle (83.33%) 435184347885 stalled-cycles-backend:u # 40.26% backend cycles idle (83.33%) 847570680279 instructions:u#0.78 insn per cycle #0.51 stalled cycles per insn (83.34%) 147428907202 branches:u# 452.121 M/sec (83.33%) 13395643229 branch-misses:u #9.09% of all branches (83.33%) 326.436794016 seconds time elapsed 325.869528000 seconds user 0.086873000 seconds sys vs. PGO+LTO: 347929.80 msec task-clock:u #0.999 CPUs utilized 0 context-switches:u#0.000 K/sec 0 cpu-migrations:u #0.000 K/sec 8535 page-faults:u #0.025 K/sec 1153803509197 cycles:u # (83.33%) 19911862620 stalled-cycles-frontend:u #1.73% frontend cycles idle (83.33%) 476343319558 stalled-cycles-backend:u # 41.28% backend cycles idle (83.33%) 894092414890 instructions:u#0.77 insn per cycle #0.53 stalled cycles per insn (83.33%) 173999066006 branches:u# 500.098 M/sec (83.33%) 13698979291 branch-misses:u #7.87% of all branches (83.34%) 348.308607033 seconds time elapsed 347.711752000 seconds user 0.090975000 seconds sys
[Bug target/90234] 503.bwaves_r is 6% slower on Zen1 CPUs at -Ofast with native march/mtune than with generic ones
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90234 Martin Jambor changed: What|Removed |Added Summary|503.bwaves_r is 6% slower |503.bwaves_r is 6% slower |on Zen CPUs at -Ofast with |on Zen1 CPUs at -Ofast with |native march/mtune than |native march/mtune than |with generic ones |with generic ones --- Comment #1 from Martin Jambor --- I can still see this issue on a Zen1 machine as of trunk revision abe13e1847f (Feb 17 2020) but not on Zen2 machines (in both cases targeting native ISAs).
[Bug target/94406] New: 503.bwaves_r is 11% slower on Zen2 CPUs than GCC 9 with -Ofast -march=native
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94406 Bug ID: 94406 Summary: 503.bwaves_r is 11% slower on Zen2 CPUs than GCC 9 with -Ofast -march=native Product: gcc Version: 10.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jamborm at gcc dot gnu.org CC: andre.simoesdiasvieira at arm dot com Blocks: 26163 Target Milestone: --- Host: x86_64-linux Target: x86_64-linux SPEC 2017 FPrate benchmark 503.bwaves_r compiled with -Ofast -march=native -mtune=native runs 11% slower on AMD Zen2 CPUs when built with trunk (revision abe13e1847f) than when compiled with GCC 9.2. Bisecting led to commit: commit 1297712fb4af6c6bfd827e0f0a9695b14669f87d Author: Andre Vieira Date: Thu Oct 31 09:49:47 2019 + [vect]Make vect-epilogues-nomask=1 default This patch turns epilogue vectorization on by default for all targets. From-SVN: r277659 If we use current trunk but build also with option --param vect-epilogues-nomask=0 we get run-time on par with GCC 9. This is also the reason why generic march/tuning or building with -mprefer-vector-width=128 currently results in faster code than simple -march=native. Interestingly, I do not see this issue on an Intel Cascade Lake Server CPU, even though the epilogue is created there too - judging by CFG of the hottest function which looks the same. And I am not sure to what extent it tells anything at all, but I accidentally also perf'ed load-to-store-stall events and in the slow version, the reported "samples" was 10% higher and the reported "event count" shot up 2.8 times(!). Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163 [Bug 26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)
[Bug target/94406] 503.bwaves_r is 11% slower on Zen2 CPUs than GCC 9 with -Ofast -march=native
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94406 --- Comment #1 from Martin Jambor --- For the record, the collected profiles both for the traditional "cycles:u" event and (originally unintended) "ls_stlf:u" event are below: -Ofast -march=native -mtune=native # Samples: 894K of event 'cycles:u' # Event count (approx.): 735979402525 # # Overhead Samples Command Shared Object Symbol # ... . # 67.18%599542 bwaves_r_peak.e bwaves_r_peak.experiment-m64 [.] mat_times_vec_ 11.40%102686 bwaves_r_peak.e bwaves_r_peak.experiment-m64 [.] shell_ 11.37%101388 bwaves_r_peak.e bwaves_r_peak.experiment-m64 [.] bi_cgstab_block_ 6.95% 62694 bwaves_r_peak.e bwaves_r_peak.experiment-m64 [.] jacobian_ 1.88% 16957 bwaves_r_peak.e bwaves_r_peak.experiment-m64 [.] flux_ 1.01% 9023 bwaves_r_peak.e libc-2.31.so [.] __memset_avx2_unaligned # Samples: 769K of event 'ls_stlf:u' # Event count (approx.): 154704730574 # # Overhead Samples Command Shared Object Symbol # ... # 94.59%612921 bwaves_r_peak.e bwaves_r_peak.experiment-m64 [.] mat_times_vec_ 1.83% 88259 bwaves_r_peak.e bwaves_r_peak.experiment-m64 [.] shell_ 1.12% 13615 bwaves_r_peak.e bwaves_r_peak.experiment-m64 [.] flux_ 1.11% 43093 bwaves_r_peak.e bwaves_r_peak.experiment-m64 [.] jacobian_ 1.05% 8746 bwaves_r_peak.e libc-2.31.so [.] __memset_avx2_unaligned -Ofast -march=native -mtune=native --param vect-epilogues-nomask=0 # Samples: 816K of event 'cycles:u' # Event count (approx.): 671104061807 # # Overhead Samples Command Shared Object Symbol # ... . # 64.07%521532 bwaves_r_peak.e bwaves_r_peak.experiment-m64 [.] mat_times_vec_ 12.50%102670 bwaves_r_peak.e bwaves_r_peak.experiment-m64 [.] shell_ 12.39%100777 bwaves_r_peak.e bwaves_r_peak.experiment-m64 [.] bi_cgstab_block_ 7.60% 62641 bwaves_r_peak.e bwaves_r_peak.experiment-m64 [.] jacobian_ 2.06% 16925 bwaves_r_peak.e bwaves_r_peak.experiment-m64 [.] flux_ 1.17% 9531 bwaves_r_peak.e libc-2.31.so [.] __memset_avx2_unaligned # Samples: 705K of event 'ls_stlf:u' # Event count (approx.): 55009340780 # # Overhead Samples Command Shared Object Symbol # ... .. # 86.26%532930 bwaves_r_peak.e bwaves_r_peak.experiment-m64 [.] mat_times_vec_ 5.15% 88270 bwaves_r_peak.e bwaves_r_peak.experiment-m64 [.] shell_ 3.17% 13696 bwaves_r_peak.e bwaves_r_peak.experiment-m64 [.] flux_ 3.06% 57149 bwaves_r_peak.e bwaves_r_peak.experiment-m64 [.] jacobian_ 1.59% 9226 bwaves_r_peak.e libc-2.31.so [.] __memset_avx2_unaligned
[Bug target/94406] 503.bwaves_r is 11% slower on Zen2 CPUs than GCC 9 with -Ofast -march=native
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94406 --- Comment #2 from Martin Jambor --- And for completeness, LNT sees this too and has just managed to catch the regression: https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=276.427.0&plot.1=295.427.0&;
[Bug target/94406] 503.bwaves_r is 11% slower on Zen2 CPUs than GCC 9 with -Ofast -march=native
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94406 --- Comment #3 from Martin Jambor --- One more data point, binary compiled for cascadelake does not run on Zen2, but one for znver2 runs on Cascade Lake and it makes no difference in run-time. If disapling epilogues helps on Intel, the difference is less than 2%.
[Bug gcov-profile/94410] New: 511.povray_r is 11% slower built at -O2 PGO+LTO than with GCC 9 and same options
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94410 Bug ID: 94410 Summary: 511.povray_r is 11% slower built at -O2 PGO+LTO than with GCC 9 and same options Product: gcc Version: 10.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: gcov-profile Assignee: unassigned at gcc dot gnu.org Reporter: jamborm at gcc dot gnu.org CC: hubicka at gcc dot gnu.org, marxin at gcc dot gnu.org Blocks: 26163 Target Milestone: --- Host: x86_64-linux Target: x86_64-linux SPEC 2017 FPrate benchmark 511.povray_r runs 11 % slower on AMD Zen2 CPU and 10% slower on Intel Cascade Lake server CPU when built with -O2 (generic march/tuning) and both PGO and LTO with trunk (revision 26b3e568a60) than when compiled with the same options with GCC 9. Bisecting revealed that the slowdown was introduced with: commit 2925cad2151842daa387950e62d989090e47c91d Author: Jan Hubicka Date: Thu Oct 3 17:08:21 2019 +0200 params.def (PARAM_INLINE_HEURISTICS_HINT_PERCENT, [...]): New. * params.def (PARAM_INLINE_HEURISTICS_HINT_PERCENT, PARAM_INLINE_HEURISTICS_HINT_PERCENT_O2): New. * doc/invoke.texi (inline-heuristics-hint-percent, inline-heuristics-hint-percent-O2): Document. * tree-inline.c (inline_insns_single, inline_insns_auto): Add new hint attribute. (can_inline_edge_by_limits_p): Use it. From-SVN: r276516 The revision just before it was even 9% and 7% faster than GCC 9 on AMD and Intel respectively. Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163 [Bug 26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)
[Bug gcov-profile/94410] 511.povray_r is 11% slower built at -O2 PGO+LTO than with GCC 9 and same options
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94410 Martin Jambor changed: What|Removed |Added See Also||https://gcc.gnu.org/bugzill ||a/show_bug.cgi?id=94360 --- Comment #1 from Martin Jambor --- PR94360 is another O2 PGO+LTO bug where the commit caused a slowdown.
[Bug ipa/94360] 6% run-time regression of 502.gcc_r against GCC 9 when compiled with -O2 and both PGO and LTO
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94360 --- Comment #2 from Martin Jambor --- PR94410 is another O2 PGO+LTO bug where g:2925cad2151 caused a slowdown.
[Bug gcov-profile/90364] 521.wrf_r is 8-17% slower with PGO at -Ofast and native march/mtune
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90364 Martin Jambor changed: What|Removed |Added Last reconfirmed|2019-05-06 00:00:00 |2020-3-30 Summary|521.wrf_r is 9.5 % slower |521.wrf_r is 8-17% slower |with PGO on Zen CPUs at |with PGO at -Ofast and |-Ofast and native |native march/mtune |march/mtune | --- Comment #9 from Martin Jambor --- The problem still persists accross the board, causing: - 17% regression against non-PGO on AMD Zen2 CPU, - 8% regression against non-PGO on AMD Zen1 CPU, and - 12% regression against non-PGO on Intel Cascade Lake server CPU. All of the above is at -Ofast -march=native, by the way, at just -O2 (and generic -march) PGO actually helps by 25-27% on all three systems, so I would double check before blaming specinvoke (though of course it might be the culprit).
[Bug middle-end/90283] 519.lbm_r is 7%-10% slower with -Ofast -march=native and both LTO and PGO than with GCC 8
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90283 --- Comment #5 from Martin Jambor --- The numbers from this year are: - on Intel Cascade Lake server CPU the regression disappeared, if there ever was one, I don't have Skylake numbers this year. - On AMD Zen1 CPU, the measured regression is 20% compared to GCC 8 (15% compared to GCC 9) but that most likely means we hit the known code-placement problem again. - On AMD Zen2 CPU, there is actually 6.8% regression compared to GCC 8 (and only negligible one compared to GCC 9). It may or may not be the same problem we were looking at last year. In any event, probably not very pressing, given the behavior of the benchmark :-/
[Bug gcov-profile/94410] 511.povray_r is 11% slower built at -O2 PGO+LTO than with GCC 9 and same options
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94410 --- Comment #2 from Martin Jambor --- For the record, SPEC 2006 453.povray is similarly affected, the commit makes it run 26% slower.
[Bug ipa/90151] 554.roms_r regression on x86_64 at -O2 and generic march/mtune
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90151 --- Comment #1 from Martin Jambor --- This year's numbers: - on AMD Zen1, we are still 7.2% worse than GCC 7 - on AMD Zen2, the reegression is 4.6% - in Intel Cascade Lake server CPU, it is 5.4% This is all -O2, so perhaps not that important for a Fortran benchmark.
[Bug target/94406] 503.bwaves_r is 11% slower on Zen2 CPUs than GCC 9 with -Ofast -march=native
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94406 --- Comment #4 from Martin Jambor --- For the record, on AMD Zen2 at least, SPEC 2006 410.bwaves also runs about 12% faster with --param vect-epilogues-nomask=0 (and otherwise with -Ofast -march=native -mtune=native).
[Bug tree-optimization/94427] New: 456.hmmer is 8-17% slower when compiled at -Ofast than with GCC 9
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94427 Bug ID: 94427 Summary: 456.hmmer is 8-17% slower when compiled at -Ofast than with GCC 9 Product: gcc Version: 10.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: jamborm at gcc dot gnu.org CC: rguenth at gcc dot gnu.org Blocks: 26163 Target Milestone: --- Host: x86_64-linux Target: x86_64-linux SPECINT 2006 benchmark 456.hmmer runs 18% slower on AMD Zen2 CPUs, 15% on AMD Zen1 CPUs and 8% on Intel Cascade Lake server CPUs when built with trunk (revision 26b3e568a60) and just -Ofast (so with generic march/mtune) than when compiled wth GCC 9. Bisecting the regression leads to commit: commit 14ec49a7537004633b7fff859178cbebd288ca1d Author: Richard Biener Date: Tue Jul 2 07:35:23 2019 + re PR tree-optimization/58483 (missing optimization opportunity for const std::vector compared to std::array) 2019-07-02 Richard Biener PR tree-optimization/58483 * tree-ssa-scopedtables.c (avail_expr_hash): Use OEP_ADDRESS_OF for MEM_REF base hashing. (equal_mem_array_ref_p): Likewise for base comparison. * gcc.dg/tree-ssa/ssa-dom-cse-8.c: New testcase. From-SVN: r272922 Collected profiles are weird, almost the other way round I would expect them to be, because the *slow* version spends less time in cold section - but both spend IMHO too much time there. The following data were collected on AMD Zen2 but those from Intel are similar in this regard. What is different is that on Intel perf stat reports doubling of branch misses - and because it has older perf it does not report front/back-end stalls. Before the aforementioned revision: Performance counter stats for 'numactl -C 0 -l specinvoke': 163360.87 msec task-clock:u #0.992 CPUs utilized 0 context-switches:u#0.000 K/sec 0 cpu-migrations:u #0.000 K/sec 7639 page-faults:u #0.047 K/sec 525635661818 cycles:u # 809847511 stalled-cycles-frontend:u #0.15% frontend cycles idle (83.35%) 299331255326 stalled-cycles-backend:u # 56.95% backend cycles idle (83.30%) 1757801907547 instructions:u#3.34 insn per cycle #0.17 stalled cycles per insn (83.34%) 133496985084 branches:u# 817.191 M/sec (83.35%) 682351923 branch-misses:u #0.51% of all branches (83.31%) 164.659685804 seconds time elapsed 163.32542 seconds user 0.022183000 seconds sys # Samples: 637K of event 'cycles:u' # Event count (approx.): 527143782584 # # Overhead Samples Shared ObjectSymbol # ... # 58.43%372284 hmmer_peak.mine-std-gen [.] P7Viterbi 35.12%223887 hmmer_peak.mine-std-gen [.] P7Viterbi.cold 2.59% 16418 hmmer_peak.mine-std-gen [.] FChoose 2.51% 15906 hmmer_peak.mine-std-gen [.] sre_random At the aforementioned revision: Performance counter stats for 'numactl -C 0 -l specinvoke': 191483.84 msec task-clock:u #0.994 CPUs utilized 0 context-switches:u#0.000 K/sec 0 cpu-migrations:u #0.000 K/sec 7639 page-faults:u #0.040 K/sec 622159384711 cycles:u # 817604010 stalled-cycles-frontend:u #0.13% frontend cycles idle (83.31%) 439972264588 stalled-cycles-backend:u # 70.72% backend cycles idle (83.34%) 1707838992202 instructions:u#2.75 insn per cycle #0.26 stalled cycles per insn (83.35%) 91309384910 branches:u# 476.852 M/sec (83.32%) 655463713 branch-misses:u #0.72% of all branches (83.33%) 192.564513355 seconds time elapsed 191.443774000 seconds user 0.023978000 seconds sys # Samples: 752K of event 'cycles:u' # Event count (approx.): 622947549968 # # Overhead Samples Shared Object Symbol # # 83.68%629645 hmmer_peak.small-std-gen
[Bug tree-optimization/94427] 456.hmmer is 8-17% slower when compiled at -Ofast than with GCC 9
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94427 --- Comment #1 from Martin Jambor --- OK, so it turns out the identified commit only allows us to shoot ourselves in the foot - and there one too few branches, not too many. The hottest loop, consuming most of the time is: Percent Instructions 0.03 │ fb0:┌─+add -0x8(%r9,%rcx,4),%eax 5.03 │ │ mov %eax,-0x4(%r13,%rcx,4) 2.48 │ │ mov -0x8(%r8,%rcx,4),%esi 0.02 │ │ add -0x8(%rdx,%rcx,4),%esi 0.06 │ │ cmp %eax,%esi 4.49 │ │ cmovge %esi,%eax 17.17 │ │ mov %ecx,%esi 0.03 │ │ cmp $0xc521974f,%eax 3.50 │ │ cmovl %ebx,%eax <--- this used to be a branch 21.84 │ │ mov %eax,-0x4(%r13,%rcx,4) 3.88 │ │ add $0x1,%rcx 0.00 │ │ cmp %rdi,%rcx 0.04 │ └──jne fb0 where the marked conditional move was a branch one revision before, because, after fwprop3 the IL looked like: [local count: 955630217]: # cstore_281 = PHI <[fast_algorithms.c:142:53] sc_223(14), [fast_algorithms.c:142:53] cstore_249(15)> [fast_algorithms.c:142:49] MEM [(void *)_72] = cstore_281; [fast_algorithms.c:143:13] _78 = [fast_algorithms.c:143:13] *_72; [fast_algorithms.c:143:10] if (_78 < -987654321) goto ; [50.00%] else goto ; [50.00%] [local count: 477815109]: [local count: 955630217]: # cstore_250 = PHI <[fast_algorithms.c:143:33] -987654321(16), [fast_algorithms.c:143:33] cstore_281(17)> [fast_algorithms.c:143:29] MEM [(void *)_72] = cstore_250; The aforementioned revision turned this into more optimized code: [local count: 955630217]: # cstore_281 = PHI <[fast_algorithms.c:142:53] sc_223(14), [fast_algorithms.c:142:53] _73(15)> [fast_algorithms.c:143:10] if (cstore_281 < -987654321) goto ; [50.00%] else goto ; [50.00%] [local count: 477815109]: [local count: 955630217]: # cstore_250 = PHI <[fast_algorithms.c:143:33] -987654321(16), [fast_algorithms.c:143:33] cstore_281(17)> [fast_algorithms.c:143:29] MEM [(void *)_72] = cstore_250; Which then phiopt3 changed to: cstore_248 = MAX_EXPR ; [fast_algorithms.c:143:29] MEM [(void *)_72] = cstore_248; and expander apparently always expands MAX_EXPR into a conditional move if it can(?). When I hacked phiopt not to do the transformation for - ehm - any GIMPLE_COND statement originating from source line 143, I recovered the original run-time of the benchmark. On both AMD and Intel.
[Bug target/94364] 505.mcf_r is 8% faster when compiled with -mprefer-vector-width=128
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94364 --- Comment #2 from Martin Jambor --- (In reply to Richard Biener from comment #1) > Huh, looks like this is the (patched by us) memory copying done in > spec_qsort? Yes > I wonder if you can re-measure with our patching undone but then with > -fno-strict-aliasing (though I think that only was required with LTO). > The difference indeed goes away :-/ The current code we're benchmarking (when not using LTO) is slower in both cases :-/ > How large are the objects sorted in mcf? It's always pointers, 8 bytes.
[Bug tree-optimization/94375] 548.exchange2_r run time is 8-18% worse than GCC 9 at -Ofast -march=native
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94375 --- Comment #6 from Martin Jambor --- (In reply to Richard Biener from comment #2) > Do we ever hit the vectorized paths? What's the best way to find out? If I open the disassembled code in perf report and search for ymm, some of these (groups of) instructions have (very few) samples, but more often they don't.
[Bug target/94364] 505.mcf_r is 8% faster when compiled with -mprefer-vector-width=128
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94364 Martin Jambor changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |WONTFIX --- Comment #6 from Martin Jambor --- OK, I'm going to close this given that this problem is specific to our mcf patch which we decided to change and the issue cannot easily be avoided in the compiler.
[Bug middle-end/26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163 Bug 26163 depends on bug 94364, which changed state. Bug 94364 Summary: 505.mcf_r is 8% faster when compiled with -mprefer-vector-width=128 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94364 What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |WONTFIX
[Bug tree-optimization/53947] [meta-bug] vectorizer missed-optimizations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 Bug 53947 depends on bug 94364, which changed state. Bug 94364 Summary: 505.mcf_r is 8% faster when compiled with -mprefer-vector-width=128 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94364 What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |WONTFIX
[Bug ipa/92676] [10 Regression] lto1: error: comdat-local function called by construct.constprop outside its comdat since r278669
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92676 Martin Jambor changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #5 from Martin Jambor --- Fixed.
[Bug ipa/93621] [10 Regression] ICE in redirect_call_stmt_to_callee, at cgraph.c:1443 since r10-5567
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93621 Martin Jambor changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |jamborm at gcc dot gnu.org Status|NEW |ASSIGNED --- Comment #8 from Martin Jambor --- Let me have a look
[Bug ipa/93621] [10 Regression] ICE in redirect_call_stmt_to_callee, at cgraph.c:1443 since r10-5567
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93621 --- Comment #9 from Martin Jambor --- (In reply to Jan Hubicka from comment #3) > The testcase builds for me now, but this is Martin's code that's questionable :-) Git blame points correctly to me but before new IPA-SRA the assert used to be: gcc_assert (!node || !node->clone.combined_args_to_skip); and was added by Honza in 2012 (in 66a20fc2a7de). > (apparently > checking that we did not forget to apply param adjustments) AFAIU no, quite the opposite, it checks that we are not going to apply param adjustment twice to a call, which is in a way what we are about to do. We find ourselves looking at a call statement with parameters already adjusted and the decl in the statement being the IPA-CP created one. In the cgraph edge, however, the callee's decl is one created during save_inline_function_body. Because redirect_call_stmt_to_callee decides whether it has to do anything by comparing decls, it thinks it has to redirect and remove params and... BOOM. When I wrote that the call had already been adjusted that actually was not entirely true. The call was already created that way in expand_thunk, because it is in an expanded artificial thunk of the IPA-CP clone. The assumption was that because the decl would be the correct one from the start, no additional redirection would be taking place. That perhaps wasn't the best idea as save_inline_function_body can clearly violate that (and in future some IPA pass might want to redirect the edge too). Having said that, I am not sure where to best fix this so late in the GCC 10 development cycle.
[Bug ipa/93621] [10 Regression] ICE in redirect_call_stmt_to_callee, at cgraph.c:1443 since r10-5567
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93621 --- Comment #13 from Martin Jambor --- (In reply to Jan Hubicka from comment #12) > > Having said that, I am not sure where to best fix this so late in the > > GCC 10 development cycle. > > So the problem is that thunk is expanded on the adjusted decl but we > still keep the adjustments and later fail to apply them? > > I guess we have two options: > 1) force thunk expansion to happen on original decls (before cloning) > so the body ends up being same as for ordinary function I was thinking about this too. I will try to look into expand_thunk whether I can leave the call statement mostly alone (apart from the thunk transform itself, of course). > 2) remove the adjustments after expansion - this should IMO work > under the assumption that optimization passes don't insert > non-trivial code into the thunk before they expand the thunk (i.e. > if you want to adjust it in ipa-sra you will want to first produce > the thunk and then do adjustement) > It seems to me that 2 should be not that hard to implement > Does that make sense? Unfortunately I don't think so. The adjustment is attached to the callee (just like in the past the skip_args bitmap was - and we're only skipping arguments in the testcase), so you cannot just remove it in one caller. Or am I missing something?
[Bug ipa/93621] [10 Regression] ICE in redirect_call_stmt_to_callee, at cgraph.c:1443 since r10-5567
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93621 --- Comment #14 from Martin Jambor --- Actually, we should be able to simply skip applying adjustments, if e->caller->former_thunk_p(). I'm playing with a patch.
[Bug ipa/93621] [10 Regression] ICE in redirect_call_stmt_to_callee, at cgraph.c:1443 since r10-5567
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93621 --- Comment #15 from Martin Jambor --- It turns out that no, recursive inlining will happily put an adjusted and not yet adjusted call into the same function which was formerly a thunk.
[Bug gcov-profile/94472] New: 400.perlbench is slower when compiled at -O2 with both PGO and LTO on AMD Zen CPUs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94472 Bug ID: 94472 Summary: 400.perlbench is slower when compiled at -O2 with both PGO and LTO on AMD Zen CPUs Product: gcc Version: 10.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: gcov-profile Assignee: unassigned at gcc dot gnu.org Reporter: jamborm at gcc dot gnu.org CC: hubicka at gcc dot gnu.org, marxin at gcc dot gnu.org Blocks: 26163 Target Milestone: --- Host: x86_64-linux Target: x86_64-linux 400.perlbench is slower when compiled at -O2 (and generic march/mtune) with both PGO and LTO when compiled with master (26b3e568a60) than when built with GCC 9, on Zen2 by 13% and on Zen1 by 7%. The performance is comparable on Intel Cascade Lake server CPU. I attempted bisecting the problems on the Zen2 CPU but was only partially successful because a lot of the slowdown seemed to have happened gradually. The first bigger slowdown - almost 4% - came with: 562d1e9556777988ae46c5d1357af2636bc272ea is the first bad commit commit 562d1e9556777988ae46c5d1357af2636bc272ea Author: Jan Hubicka Date: Wed Oct 2 16:01:47 2019 + cif-code.def (MAX_INLINE_INSNS_SINGLE_O2_LIMIT, [...]): New. * cif-code.def (MAX_INLINE_INSNS_SINGLE_O2_LIMIT, MAX_INLINE_INSNS_AUTO_O2_LIMIT): New. ... From-SVN: r276469 About the same performance loss was then introduced by: commit 2925cad2151842daa387950e62d989090e47c91d Author: Jan Hubicka Date: Thu Oct 3 17:08:21 2019 +0200 params.def (PARAM_INLINE_HEURISTICS_HINT_PERCENT, [...]): New. * params.def (PARAM_INLINE_HEURISTICS_HINT_PERCENT, PARAM_INLINE_HEURISTICS_HINT_PERCENT_O2): New. * doc/invoke.texi (inline-heuristics-hint-percent, inline-heuristics-hint-percent-O2): Document. * tree-inline.c (inline_insns_single, inline_insns_auto): Add new hint attribute. (can_inline_edge_by_limits_p): Use it. And finally throughout March the benchmark is quite jumpy but finally ended again ended up about 5% slower than at the beginning of the month. Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163 [Bug 26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)
[Bug ipa/93621] [10 Regression] ICE in redirect_call_stmt_to_callee, at cgraph.c:1443 since r10-5567
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93621 --- Comment #16 from Martin Jambor --- The following workaround works for the testcase but would need to be generalized for a chain of former_decl_of's to be universal, I'm afraid: diff --git a/gcc/cgraph.c b/gcc/cgraph.c index 6b780f80eb3..241b996151a 100644 --- a/gcc/cgraph.c +++ b/gcc/cgraph.c @@ -1467,7 +1467,8 @@ cgraph_edge::redirect_call_stmt_to_callee (cgraph_edge *e) if (e->indirect_unknown_callee - || decl == e->callee->decl) + || decl == e->callee->decl + || decl == e->callee->former_clone_of) return e->call_stmt; if (flag_checking && decl) diff --git a/gcc/ipa-inline-transform.c b/gcc/ipa-inline-transform.c index eed992d314d..a6675768552 100644 --- a/gcc/ipa-inline-transform.c +++ b/gcc/ipa-inline-transform.c @@ -588,6 +588,7 @@ save_inline_function_body (struct cgraph_node *node) first_clone->next_sibling_clone = NULL; gcc_assert (!first_clone->prev_sibling_clone); } + first_clone->former_clone_of = node->decl; first_clone->clone_of = NULL; /* Now node in question has no clones. */
[Bug tree-optimization/93435] [8/9 Regression] Hang with -O2 on innocuous looking code with GCC 8.3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93435 --- Comment #13 from Martin Jambor --- The problematic behavior of SRA is now fixed on master and both opened release branches so I consider my work done here. I'm leaving the bug opened in case Jeff wants to add some DSE limiter like he wrote in comment #5.
[Bug ipa/93621] [10 Regression] ICE in redirect_call_stmt_to_callee, at cgraph.c:1443 since r10-5567
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93621 --- Comment #17 from Martin Jambor --- Created attachment 48208 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48208&action=edit WIP patch This is the current version of my patch to fix this. I think that at least for the purposes of JIT I need to find a place to deallocate the new summary - but that can only happen after all inlining is done. Then I'll add that, re-base and submit it to the mailing list.
[Bug tree-optimization/94482] [8/9/10 Regression] Inserting into vector with optimization enabled on x86 generates incorrect result
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94482 --- Comment #21 from Martin Jambor --- As Richi already found out, the path in sra_modify_expr handling type incompatible replacement does not work when the replaced expr comes from within a BIT_FIELD_REF - it does only half of what is necessary. A conservative (not yet much tested) fix would be to emit a full RMW: *** /tmp/UTN9NX_tree-sra.c Mon Apr 6 15:28:23 2020 --- gcc/tree-sra.c Mon Apr 6 15:22:40 2020 *** sra_modify_expr (tree *expr, gimple_stmt *** 3742,3768 ref = build_ref_for_model (loc, orig_expr, 0, access, gsi, false); ! if (write) { gassign *stmt; if (access->grp_partial_lhs) ! ref = force_gimple_operand_gsi (gsi, ref, true, NULL_TREE, !false, GSI_NEW_STMT); ! stmt = gimple_build_assign (repl, ref); gimple_set_location (stmt, loc); ! gsi_insert_after (gsi, stmt, GSI_NEW_STMT); } ! else { gassign *stmt; if (access->grp_partial_lhs) ! repl = force_gimple_operand_gsi (gsi, repl, true, NULL_TREE, !true, GSI_SAME_STMT); ! stmt = gimple_build_assign (ref, repl); gimple_set_location (stmt, loc); ! gsi_insert_before (gsi, stmt, GSI_SAME_STMT); } } else --- 3742,3771 ref = build_ref_for_model (loc, orig_expr, 0, access, gsi, false); ! if (!write || bfr) { gassign *stmt; + tree src = repl; if (access->grp_partial_lhs) ! src = force_gimple_operand_gsi (gsi, repl, true, NULL_TREE, !true, GSI_SAME_STMT); ! stmt = gimple_build_assign (ref, src); gimple_set_location (stmt, loc); ! gsi_insert_before (gsi, stmt, GSI_SAME_STMT); } ! if (bfr) ! ref = unshare_expr (ref); ! if (write || bfr) { gassign *stmt; if (access->grp_partial_lhs) ! ref = force_gimple_operand_gsi (gsi, ref, true, NULL_TREE, !false, GSI_NEW_STMT); ! stmt = gimple_build_assign (repl, ref); gimple_set_location (stmt, loc); ! gsi_insert_after (gsi, stmt, GSI_NEW_STMT); } } else But I wonder whether we care about type incompatibility within a B_F_R at all - isn't B_F_R also an implicit V_C_E, always looking at the binary image? So perhaps something as simple as the following might work? diff --git a/gcc/tree-sra.c b/gcc/tree-sra.c index b2056b58750..d22b03814d2 100644 --- a/gcc/tree-sra.c +++ b/gcc/tree-sra.c @@ -3736,7 +3736,7 @@ sra_modify_expr (tree *expr, gimple_stmt_iterator *gsi, bool write) be accessed as a different type too, potentially creating a need for type conversion (see PR42196) and when scalarized unions are involved in assembler statements (see PR42398). */ - if (!useless_type_conversion_p (type, access->type)) + if (!bfr && !useless_type_conversion_p (type, access->type)) { tree ref; I'll test both options ...and it seems we need the RMW one to handle REALPART_EXPR and IMAGPART_EXPR.
[Bug ipa/94434] [AArch64][SVE] ICE caused by incompatibility of SRA and svst3 builtin-function
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94434 Martin Jambor changed: What|Removed |Added Ever confirmed|0 |1 Last reconfirmed||2020-04-09 Component|tree-optimization |ipa Status|UNCONFIRMED |ASSIGNED CC||jamborm at gcc dot gnu.org, ||marxin at gcc dot gnu.org
[Bug tree-optimization/94482] [8/9 Regression] Inserting into vector with optimization enabled on x86 generates incorrect result
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94482 Martin Jambor changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |jamborm at gcc dot gnu.org Status|NEW |ASSIGNED --- Comment #24 from Martin Jambor --- Fixed on trunk, will backport in a week or so.
[Bug ipa/94434] [AArch64][SVE] ICE caused by incompatibility of SRA and svst3 builtin-function
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94434 Martin Jambor changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |jamborm at gcc dot gnu.org --- Comment #1 from Martin Jambor --- Created attachment 48248 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48248&action=edit Proposed fix After our discussion on the mailing list, I'm currently testing this patch
[Bug ipa/94434] [AArch64][SVE] ICE caused by incompatibility of SRA and svst3 builtin-function
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94434 Martin Jambor changed: What|Removed |Added Attachment #48248|0 |1 is obsolete|| --- Comment #2 from Martin Jambor --- Created attachment 48249 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48249&action=edit Proposed fix without a stupid pasto The previous attachment had an obviou pasto in it, this is what I'm testing.
[Bug ipa/92550] [10 Regression] FAIL: gcc.dg/ipa/ipa-sra-8.c execution test
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92550 --- Comment #3 from Martin Jambor --- Almost certainly started with new IPA-SRA (r275982 or as we now call it gcc-10-3311-gff6686d2e5f). I looked at dumps from a cross-compiler and the funny bit is, however, that new IPA-SRA simply does nothing. That is not as it should be. Because foo is not versionable, the pass does not even look at it and then cannot do anything because it has not seen a call to get_a. But of course it should still analyze outgoing calls to allow IPA-SRA of callees. But that is merely a missed optimization, not this miscompilation. I looks almost as if it was simply the expand of misaligned structure copy that is broken on (this?) strict-aliasing target. I also believe the test case does not successfuly run when compiled with earlier revisions and option -fno-ipa-sra.
[Bug target/92550] [10 Regression] FAIL: gcc.dg/ipa/ipa-sra-8.c execution test
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92550 Martin Jambor changed: What|Removed |Added Component|ipa |target --- Comment #4 from Martin Jambor --- Not an IPA issue.
[Bug ipa/94434] [AArch64][SVE] ICE caused by incompatibility of SRA and svst3 builtin-function
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94434 --- Comment #3 from Martin Jambor --- I have proposed the patch on the mailing list: https://gcc.gnu.org/pipermail/gcc-patches/2020-April/543658.html
[Bug ipa/94434] [AArch64][SVE] ICE caused by incompatibility of SRA and svst3 builtin-function
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94434 Martin Jambor changed: What|Removed |Added Resolution|--- |FIXED Status|ASSIGNED|RESOLVED --- Comment #5 from Martin Jambor --- Fixed.
[Bug ipa/93621] [10 Regression] ICE in redirect_call_stmt_to_callee, at cgraph.c:1443 since r10-5567
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93621 --- Comment #18 from Martin Jambor --- I posted a patch to fix this for review to the mailing list: https://gcc.gnu.org/pipermail/gcc-patches/2020-April/543659.html
[Bug tree-optimization/94598] [10 Regression] ICE in verify_sra_access_forest, at tree-sra.c:2360 with -O1 or higher since r10-6321-g636e80eea24b780f
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94598 --- Comment #2 from Martin Jambor --- For arrays of size 1, get_ref_base_and_extent knows that the expression can only access the one element even if the index is a variable. It seems it does not happen if the ARRAY_REF is within a COMPONENT_REF, an expression created by new total scalarization. I'll adjust the assert for GCC 10 but will also have a look at why get_ref_base_and_extent does that.
[Bug tree-optimization/94598] [10 Regression] ICE in verify_sra_access_forest, at tree-sra.c:2360 with -O1 or higher since r10-6321-g636e80eea24b780f
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94598 --- Comment #3 from Martin Jambor --- I'm going to test the following: --- a/gcc/tree-sra.c +++ b/gcc/tree-sra.c @@ -2357,9 +2357,11 @@ verify_sra_access_forest (struct access *root) gcc_assert (base == first_base); gcc_assert (offset == access->offset); gcc_assert (access->grp_unscalarizable_region + || access->grp_total_scalarization || size == max_size); - gcc_assert (!is_gimple_reg_type (access->type) - || max_size == access->size); + gcc_assert (!access->grp_unscalarizable_region + || !is_gimple_reg_type (access->type) + || size == access->size); gcc_assert (reverse == access->reverse); if (access->first_child)
[Bug tree-optimization/94598] [10 Regression] ICE in verify_sra_access_forest, at tree-sra.c:2360 with -O1 or higher since r10-6321-g636e80eea24b780f
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94598 --- Comment #4 from Martin Jambor --- I proposed the fix on the mailing list: https://gcc.gnu.org/pipermail/gcc-patches/2020-April/543909.html (Note that the one in comment #3 has a small but important typo.)
[Bug tree-optimization/94598] [10 Regression] ICE in verify_sra_access_forest, at tree-sra.c:2360 with -O1 or higher since r10-6321-g636e80eea24b780f
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94598 Martin Jambor changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #6 from Martin Jambor --- Fixed, thanks for reporting.
[Bug ipa/93621] [10 Regression] ICE in redirect_call_stmt_to_callee, at cgraph.c:1443 since r10-5567
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93621 Martin Jambor changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #20 from Martin Jambor --- Fixed for GCC 10, see the review email thread for caveats/future plans about this.
[Bug ipa/93385] [10 Regression] wrong code with u128 modulo at -O2 -fno-dce -fno-ipa-cp -fno-tree-dce
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93385 --- Comment #17 from Martin Jambor --- Created attachment 48302 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48302&action=edit Untested fix I'm playing with this - only very mildly tested - fix.
[Bug ipa/93385] [10 Regression] wrong code with u128 modulo at -O2 -fno-dce -fno-ipa-cp -fno-tree-dce
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93385 --- Comment #22 from Martin Jambor --- (In reply to Jakub Jelinek from comment #18) > Comment on attachment 48302 [details] > Untested fix > > + /* IPA-SRA does not analyze other types of statements. */ > + gcc_unreachable (); > Won't this ICE on any is_gimple_debug stmt? Those should be just ignored > and normal SSA_NAME handling should DTRT for those. Yeah, it most probably will, I wrote it was only very mildly tested (i.e. I only ran IPA testcases on it) - I wanted to post what I had before I had to stop working on this for a few hours. > As for PHIs, can you just gsi_remove them? > Looking at tree-ssa-dce.c, it uses remove_phi_node rather than > gsi_remove for PHIs. And for non-PHIs, it calls release_defs after > gsi_remove. You are again most probably right, I keep forgetting about this. > > Plus, I think in isra_track_scalar_value_uses for non-is_gimple_{debug,call} > we should punt if !flag_tree_dce, i.e. when user asked not to perform dead > code elimination. Though, guess that hunk should be added only after this > is tested (and perhaps the testcase or its copy should use > -fdisable-tree-dce or whatever other way to avoid doing DCE even when > flag_tree_dce is non-zero. OK, that makes sense. I'd slightly prefer the patch in comment #11 for this so that direct passes of a parameter to another function without any modification is still not considered as doing DCE - but I also do not really care too much.
[Bug ipa/93385] [10 Regression] wrong code with u128 modulo at -O2 -fno-dce -fno-ipa-cp -fno-tree-dce
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93385 --- Comment #25 from Martin Jambor --- (In reply to rguent...@suse.de from comment #21) > Btw, I'd much prefer to not first copy the stmts and then remove them. > Instead the DCE "analysis" can be done on the original IL and stmts > be "marked" to be elided during copying. That saves generating > SSA names and gimple stmts rather than needing to remove them after the > fact. It is of course easy to change the patch to do the analysis on the original and just create a hash_set of statements/SSA_NAMES to not copy. I'll do that. As far as remapping the removed values to ERROR_MARK, I'm not sure. We'd need to remap some SSA_NAMES of the same DECL differently than other names (e.g. default-definition of the removed PARM_DECL would get remapped to ERROR_MARK but not other SSA_NAMES and similarly for other SSA_NAMES derived from those default-defs) ...and ATM I do not know to what extent that is a problem. But I can try.
[Bug ipa/93385] [10 Regression] wrong code with u128 modulo at -O2 -fno-dce -fno-ipa-cp -fno-tree-dce
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93385 --- Comment #30 from Martin Jambor --- Created attachment 48320 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48320&action=edit Todays WIP patch This is my todays (still very much) WIP patch. - It marks statements which should not be copied before copying them and then skipping them. - It does map SSA_NAMEs which should not survive to error_mark_node. - Processing of calls is however still necessary, we cannot leave error_mark_nodes in the IL (until call redirection deals with it based on callee info). But: - It ICEs on gcc.dg/torture/pr48063.c. I understand the problem, IPA-CP attempts to replace a floating-point parameter with an integer constant and fails but this fools the new DCE thingy into thinking some analysis declared the parameter unused even though it is used. I'll have to make ipa_param_body_adjustments aware of tree_map. (The original idea was to make it part of tree_map but for some reason I gave up on that.) - There are three libgomp C++ ICEs that I know about which I have not even looked at. I have not attempted any bootstrap yet. I have not yet tested anything other than C/C++/Fortran. - The new hash maps, or at least the one for statements, might be better placed in copy_body_data, the current place is just more convenient for the moment. I do not care too much. - Information currently stored in m_dead_ssas might be obtainable from decl_map in copy_body_data. - I have not thought about debug statements yet and just ignored them for now. I do want to handle them after other things work. Any feedback welcome.
[Bug tree-optimization/94482] [8/9 Regression] Inserting into vector with optimization enabled on x86 generates incorrect result
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94482 Martin Jambor changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #29 from Martin Jambor --- So this particular bug is fixed on trunk and both opened release branches. Evan, if the issue you described in comment #25 persists even with a patched compiler, I suggest you open a new bug.
[Bug ipa/94472] 400.perlbench is slower when compiled at -O2 with both PGO and LTO on AMD Zen CPUs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94472 --- Comment #3 from Martin Jambor --- My benchmarking setup is currently gone so unfortunately no, not easily. I'll be re-measuring everything on a different computer with a slightly different CPU model soon, so after that I guess I could. But it is most likely the limits, yes.
[Bug ipa/94856] [10 Regression] ICE: Segmentation fault (in clone_of_p); or ICE: verify_cgraph_node failed (error: edge points to wrong declaration) since r10-4944-g1e83bd7003e03160
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94856 --- Comment #7 from Martin Jambor --- The "edge points to wrong decl" case is a verifier error. We have a method which (in the course of IPA-CP) loses its this pointer because it is unused and the pass then does not clone all the this adjusting thunks and just makes the calls go straight to the new clone - and then the verifier complains that the edge does not seem to point to a clone of what it used to. This looked weird because the verifier actually has logic detecting this case but it turns out that it is confused by inliner body-saving mechanism which invents a new decl for the base function. Inlining body-saving mechanism should correctly set former_clone_of and then we can detect this case too. Then we pass this particular round of verification but the subsequent one fails because we have inlined the function into its former thunk - which subsequently does not have any callees, but the verifier still access them and segfaults just like in the original -fopenacc case. That is why the following (yet untested) patch most likely fixes that case too: diff --git a/gcc/cgraph.c b/gcc/cgraph.c index 72d7cb54301..2a9813df2d9 100644 --- a/gcc/cgraph.c +++ b/gcc/cgraph.c @@ -3104,15 +3104,17 @@ clone_of_p (cgraph_node *node, cgraph_node *node2) return false; /* In case of instrumented expanded thunks, which can have multiple calls in them, we do not know how to continue and just have to be -optimistic. */ - if (node->callees->next_callee) +optimistic. The same applies if all calls have already been inlined +into the thunk. */ + if (!node->callees || node->callees->next_callee) return true; node = node->callees->callee->ultimate_alias_target (); if (!node2->clone.param_adjustments || node2->clone.param_adjustments->first_param_intact_p ()) return false; - if (node2->former_clone_of == node->decl) + if (node2->former_clone_of == node->decl + || node2->former_clone_of == node->former_clone_of) return true; cgraph_node *n2 = node2; diff --git a/gcc/ipa-inline-transform.c b/gcc/ipa-inline-transform.c index be60bbccb5c..e9e21cc0296 100644 --- a/gcc/ipa-inline-transform.c +++ b/gcc/ipa-inline-transform.c @@ -607,6 +607,8 @@ save_inline_function_body (struct cgraph_node *node) } } *ipa_saved_clone_sources->get_create (first_clone) = prev_body_holder; + first_clone->former_clone_of += node->former_clone_of ? node->former_clone_of : node->decl; first_clone->clone_of = NULL; /* Now node in question has no clones. */
[Bug ipa/94856] [10 Regression] ICE: Segmentation fault (in clone_of_p); or ICE: verify_cgraph_node failed (error: edge points to wrong declaration) since r10-4944-g1e83bd7003e03160
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94856 --- Comment #8 from Martin Jambor --- I proposed the patch on the mailing list: https://gcc.gnu.org/pipermail/gcc-patches/2020-April/544943.html
[Bug ipa/94856] [10/11 Regression] ICE: Segmentation fault (in clone_of_p); or ICE: verify_cgraph_node failed (error: edge points to wrong declaration) since r10-4944-g1e83bd7003e03160
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94856 Martin Jambor changed: What|Removed |Added Resolution|--- |FIXED Status|NEW |RESOLVED --- Comment #11 from Martin Jambor --- Fixed on both master and the newly created gcc-10 branch.
[Bug libgomp/68033] OpenMP: ICE with teams distribute
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68033 Martin Jambor changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #4 from Martin Jambor --- Confirmed, this got fixed at some point in the GCC 7 development cycle. So let's close the bug. Thanks for having a look.
[Bug target/95336] Bad code gen omnetpp_r aarch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95336 --- Comment #6 from Martin Jambor --- (In reply to Erick Ochoa from comment #0) [...] > I did a bisection from > > commit f47f687a97260b1a1305cbf2d7ee3d74b2916a74 > Author: Richard Biener > Date: Thu Apr 25 17:58:56 2019 + > > to: > > commit 4945b4c2c8628bdd61b348ea5bd1f9b72537a36e (HEAD) > Author: Martin Liska > Date: Tue May 26 09:01:41 2020 +0200 > > and I found that the following commit may have introduced the error: > > commit ff6686d2e5f797d6c6a36ad14a7084bc1dc350e4 > Author: Martin Jambor > Date: Fri Sep 20 00:25:04 2019 +0200 > Can you please try the previous revision (6889a3acfee) but with option -fno-ipa-sra ? If it fails, it means that the previous implementation of IPA-SRA hid some other error (we have already had an aliasing bug like that) - in that case it would be great if you could bisect again, this time with this option.
[Bug debug/95343] New: IPA-SRA can result in bad debug info about removed function arguments
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95343 Bug ID: 95343 Summary: IPA-SRA can result in bad debug info about removed function arguments Product: gcc Version: 10.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: debug Assignee: unassigned at gcc dot gnu.org Reporter: jamborm at gcc dot gnu.org Target Milestone: --- Host: x86_64-linux Target: x86_64-linux Created attachment 48608 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48608&action=edit Testcase ipa_param_adjustments::modify_call does not properly account for extra arguments left over from clone materialization when recording debug info. Therefore, when the attached testcase is compiled with -O2 or higher and run in gdb with a breakpoint is set at line 20 where we examine the value of parameter i, it incorrectly reports 4, even though it should be 2.
[Bug debug/95343] IPA-SRA can result in wrong debug info about removed function arguments
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95343 Martin Jambor changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |jamborm at gcc dot gnu.org Summary|IPA-SRA can result in bad |IPA-SRA can result in wrong |debug info about removed|debug info about removed |function arguments |function arguments Status|UNCONFIRMED |ASSIGNED Last reconfirmed||2020-05-26 Ever confirmed|0 |1 --- Comment #1 from Martin Jambor --- The simplest fix which will make i reported as "optimized out" is the following. But I am testing a patch which can make gdb actually show the correct 4. Still, the following is usable for gcc 10 if the full patch is deemed too risky: diff --git a/gcc/ipa-param-manipulation.c b/gcc/ipa-param-manipulation.c index 978916057f0..2a04f7b3ce5 100644 --- a/gcc/ipa-param-manipulation.c +++ b/gcc/ipa-param-manipulation.c @@ -787,7 +787,12 @@ ipa_param_adjustments::modify_call (gcall *stmt, if (!is_gimple_reg (old_parm) || kept[i]) continue; tree origin = DECL_ORIGIN (old_parm); - tree arg = gimple_call_arg (stmt, i); + int index; + if (transitive_remapping) + index = index_map[i]; + else + index = i; + tree arg = gimple_call_arg (stmt, index); if (!useless_type_conversion_p (TREE_TYPE (origin), TREE_TYPE (arg))) {
[Bug debug/95343] IPA-SRA can result in wrong debug info about removed function arguments
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95343 --- Comment #2 from Martin Jambor --- (In reply to Martin Jambor from comment #1) > ...I am testing a patch which can make gdb actually show > the correct 4. I meant the correct value 2, of course.
[Bug web/95380] ipcp-unit-growth was renamed to ipa-cp-unit-growth
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95380 --- Comment #4 from Martin Jambor --- (In reply to Martin Liška from comment #3) > Fixed for master, not planning to backport that. Why not? Are any of the parameters only in GCC 11? Should I prepare a special GCC 10 patch just to address the ipcp-unit-growth -> ipa-cp-unit-growth change then?
[Bug ipa/93385] [10/11 Regression] wrong code with u128 modulo at -O2 -fno-dce -fno-ipa-cp -fno-tree-dce
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93385 --- Comment #35 from Martin Jambor --- I have proposed a patch series that deals with this issue, including proper adjustments to debug info, on the mailing list: https://gcc.gnu.org/pipermail/gcc-patches/2020-May/546702.html
[Bug tree-optimization/95113] [10/11 Regression] Wrong code w/ -O2 -fexceptions -fnon-call-exceptions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95113 --- Comment #4 from Martin Jambor --- (In reply to Arseny Solokha from comment #3) > > Indeed, -fno-ipa-sra fixes it. So, a duplicate of PR93385? Similar, but not quite the same. I have proposed a fix on the mailing list: https://gcc.gnu.org/pipermail/gcc-patches/2020-May/546703.html
[Bug debug/95343] IPA-SRA can result in wrong debug info about removed function arguments
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95343 --- Comment #3 from Martin Jambor --- I have proposed a patch series on the mailing list to address PR 93385 and the last patch in it also addresses this issue and allows gdb to print the correct value of the removed parameter: https://gcc.gnu.org/pipermail/gcc-patches/2020-May/546705.html
[Bug tree-optimization/95113] [10/11 Regression] Wrong code w/ -O2 -fexceptions -fnon-call-exceptions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95113 --- Comment #7 from Martin Jambor --- Fixed. Thanks for reporting.
[Bug ipa/93385] [10/11 Regression] wrong code with u128 modulo at -O2 -fno-dce -fno-ipa-cp -fno-tree-dce
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93385 Bug 93385 depends on bug 95113, which changed state. Bug 95113 Summary: [10/11 Regression] Wrong code w/ -O2 -fexceptions -fnon-call-exceptions https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95113 What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED
[Bug tree-optimization/95113] [10/11 Regression] Wrong code w/ -O2 -fexceptions -fnon-call-exceptions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95113 Martin Jambor changed: What|Removed |Added Resolution|--- |FIXED Status|NEW |RESOLVED --- Comment #8 from Martin Jambor --- ...and marking it as such.
[Bug bootstrap/95970] gcc/go/gofrontend/types.cc:1474:34: warning: ‘this’ pointer null
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95970 Martin Jambor changed: What|Removed |Added Last reconfirmed||2020-06-29 Ever confirmed|0 |1 Status|UNCONFIRMED |NEW CC||ian at airs dot com --- Comment #1 from Martin Jambor --- I hit this today too (and it indeed prevents go bootstrap), so I guess it's confirmed. Ian, can you have a look whether the warning is correct? I glanced at the code only for a little while but it looks so to me.
[Bug ipa/96040] [10/11 Regression] Compiled code causes SIGBUS at -O2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96040 Martin Jambor changed: What|Removed |Added Status|NEW |ASSIGNED Assignee|unassigned at gcc dot gnu.org |jamborm at gcc dot gnu.org --- Comment #4 from Martin Jambor --- I'll have a look
[Bug ipa/96040] [10/11 Regression] Compiled code causes SIGBUS at -O2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96040 --- Comment #5 from Martin Jambor --- IPA-split puts the double access to the union in the .part function and keeps only the long int access in the "original" function. IPA-SRA thinks it can work with that but the code in "transitive" call parameter splitting apparently does not handle this case properly. The easiest fix and probably the one most suitable for backporting is to prevent splitting of such unions with the following: --- a/gcc/ipa-sra.c +++ b/gcc/ipa-sra.c @@ -3271,7 +3271,9 @@ all_callee_accesses_present_p (isra_param_desc *param_desc, continue; param_access *pacc = find_param_access (param_desc, argacc->unit_offset, argacc->unit_size); - if (!pacc || !pacc->certain) + if (!pacc + || !pacc->certain + || !types_compatible_p (argacc->type, pacc->type)) return false; } return true; Alternatively, we can of course handle the type mismatch and insert appropriate V_C_E: diff --git a/gcc/ipa-param-manipulation.c b/gcc/ipa-param-manipulation.c index 2cc4bc79dc1..de9bad78712 100644 --- a/gcc/ipa-param-manipulation.c +++ b/gcc/ipa-param-manipulation.c @@ -641,6 +641,12 @@ ipa_param_adjustments::modify_call (gcall *stmt, && trans_map[j].unit_offset == apm->unit_offset) { repl = trans_map[j].repl; + if (!useless_type_conversion_p (apm->type, TREE_TYPE (repl))) + { + repl = build1 (VIEW_CONVERT_EXPR, apm->type, repl); + repl = force_gimple_operand_gsi (&gsi, repl, true, NULL, true, +GSI_SAME_STMT); + } break; } if (repl)
[Bug ipa/96040] [10/11 Regression] Compiled code causes SIGBUS at -O2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96040 --- Comment #7 from Martin Jambor --- Yes, IPA-SRA identifies accesses by both offset and size, so the situation would not have happened if the size was different.
[Bug ipa/96040] [10/11 Regression] Compiled code causes SIGBUS at -O2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96040 --- Comment #9 from Martin Jambor --- True. Richi expressed preference for avoiding the transform when there are type mismatches, so I'm currently bootstrapping that. I guess we can always revisit the decision if we ever discover it would be really beneficial to perform the split.
[Bug ipa/96040] [10/11 Regression] Compiled code causes SIGBUS at -O2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96040 Martin Jambor changed: What|Removed |Added Resolution|--- |FIXED Status|ASSIGNED|RESOLVED --- Comment #12 from Martin Jambor --- Fixed.
[Bug ipa/96291] [10/11 Regression] -flto fails as "internal compiler error: Segmentation fault" during IPA pass: cp incall_for_symbol_thunks_and_aliases()
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96291 Martin Jambor changed: What|Removed |Added Status|NEW |ASSIGNED Assignee|unassigned at gcc dot gnu.org |jamborm at gcc dot gnu.org --- Comment #2 from Martin Jambor --- I guess I should take a look
[Bug ipa/96235] Segmentation fault with "-Og -fno-dce -fno-tree-dce -finline-small-functions -fipa-sra"
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96235 --- Comment #6 from Martin Jambor --- (In reply to Martin Liška from comment #4) > It seems to me something related to IPA SRA. > @Martin: Can you please take a look? I will but -fno-dce -fno-tree-dce strongly suggest this is a duplicate of PR 93385.
[Bug ipa/96291] [10/11 Regression] -flto fails as "internal compiler error: Segmentation fault" during IPA pass: cp incall_for_symbol_thunks_and_aliases()
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96291 Martin Jambor changed: What|Removed |Added Assignee|jamborm at gcc dot gnu.org |slyfox at inbox dot ru --- Comment #8 from Martin Jambor --- Sergei's patch is correct (I just suggested to write the condition differently).
[Bug ipa/96235] Segmentation fault with "-Og -fno-dce -fno-tree-dce -finline-small-functions -fipa-sra"
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96235 Martin Jambor changed: What|Removed |Added Resolution|--- |DUPLICATE Status|NEW |RESOLVED --- Comment #8 from Martin Jambor --- It is clearly a duplicate of PR 93385. What was the reason to switch off DCE in the first place? Was it just meant as a stress test for the compiler? I'll try to come up with somewhat less controversial patch for the problem. *** This bug has been marked as a duplicate of bug 93385 ***
[Bug ipa/93385] [10/11 Regression] wrong code with u128 modulo at -O2 -fno-dce -fno-ipa-cp -fno-tree-dce
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93385 Martin Jambor changed: What|Removed |Added CC||suochenyao at 163 dot com --- Comment #37 from Martin Jambor --- *** Bug 96235 has been marked as a duplicate of this bug. ***
[Bug target/84481] [8/9/10/11 Regression] 429.mcf with -O2 regresses by ~6% and ~4%, depending on tuning, on Zen compared to GCC 7.2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84481 --- Comment #12 from Martin Jambor --- I can once again confirm the slowdown on a zen1-based machine (commit 6e1e0decc9e vs gcc 7.5) but it is not present on a zen2-based one. I wonder whether the bug should me marked as WONTFIX.
[Bug target/84490] [8/9/10/11 regression] 436.cactusADM regressed by 6-8% percent with -Ofast on Zen and Haswell, compared to gcc 7.2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84490 --- Comment #15 from Martin Jambor --- The problem sometimes is still there, sometimes it isn't: https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=37.100.0&plot.1=27.100.0&; I wonder whether we should keep this bug opened, the benchmark seems too erratic.
[Bug target/90234] 503.bwaves_r is 6% slower on Zen1/Zen2 CPUs at -Ofast with native march/mtune than with generic ones
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90234 Martin Jambor changed: What|Removed |Added Summary|503.bwaves_r is 6% slower |503.bwaves_r is 6% slower |on Zen1 CPUs at -Ofast with |on Zen1/Zen2 CPUs at -Ofast |native march/mtune than |with native march/mtune |with generic ones |than with generic ones --- Comment #2 from Martin Jambor --- I spoke too soon, I can see this in May gcc 10.1 data on zen1 machine and also in current master (6e1e0decc9e) on a zen-2 machine, still about 6% in both cases. (Gcc9 does not have this problem on zen2 but does on zen1 so it looks a bit fragile).
[Bug tree-optimization/96730] [10/11 Regression] ICE on x86_64-linux-gnu with `-O1` to `-O3` (in verify_sra_access_forest, at tree-sra.c:2352) since r10-6320-g5b9e89c922dc2e7e
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96730 Martin Jambor changed: What|Removed |Added Status|NEW |ASSIGNED Assignee|unassigned at gcc dot gnu.org |jamborm at gcc dot gnu.org --- Comment #2 from Martin Jambor --- Mine.
[Bug tree-optimization/96730] [10/11 Regression] ICE on x86_64-linux-gnu with `-O1` to `-O3` (in verify_sra_access_forest, at tree-sra.c:2352) since r10-6320-g5b9e89c922dc2e7e
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96730 --- Comment #3 from Martin Jambor --- I have proposed a fix on the mailing list: https://gcc.gnu.org/pipermail/gcc-patches/2020-August/552488.html
[Bug tree-optimization/96730] [10/11 Regression] ICE on x86_64-linux-gnu with `-O1` to `-O3` (in verify_sra_access_forest, at tree-sra.c:2352) since r10-6320-g5b9e89c922dc2e7e
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96730 Martin Jambor changed: What|Removed |Added Resolution|--- |FIXED Status|ASSIGNED|RESOLVED --- Comment #6 from Martin Jambor --- Fixed, thanks for reporting.