[Bug ipa/92372] [10 Regression] ICE in ipa_update_overall_fn_summary at gcc/ipa-fnsummary.c:3671 since r277780
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92372 Jan Hubicka changed: What|Removed |Added Resolution|--- |FIXED Status|ASSIGNED|RESOLVED --- Comment #9 from Jan Hubicka --- Fixed.
[Bug ipa/93351] [10 Regression] ICE in ipa_update_overall_fn_summary at gcc/ipa-fnsummary.c:4014
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93351 Jan Hubicka changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |DUPLICATE --- Comment #2 from Jan Hubicka --- This crash is also due to flatten attribute on alias. It takes really long time to build since it is inline bomb. It produces tons of template instantiations and them flattens them. Template instantiations consumes 2GB of memory, inlining 3GB. It would be interesting to check if clang behaves better, but it does not like the preprocessed file. *** This bug has been marked as a duplicate of bug 92372 ***
[Bug ipa/92372] [10 Regression] ICE in ipa_update_overall_fn_summary at gcc/ipa-fnsummary.c:3671 since r277780
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92372 --- Comment #10 from Jan Hubicka --- *** Bug 93351 has been marked as a duplicate of this bug. ***
[Bug ipa/93369] [10 regression] g++.dg/lto/pr64076 fails
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93369 --- Comment #15 from Jan Hubicka --- The testcase has an ODR violation that makes comdat groups go out of sync. So I guess it is just about finding way to not make verifier to ICE. With release settings the testcase will however quietly compile this I do not think this is release blocker (P1).
[Bug ipa/94202] [10 Regression] ICE in do_estimate_edge_time, at ipa-inline-analysis.c:222
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94202 Jan Hubicka changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #3 from Jan Hubicka --- Fixed. Probably not important enough to backport even though the bug is present in all active branches.
[Bug ipa/93621] [10 Regression] ICE in redirect_call_stmt_to_callee, at cgraph.c:1443 since r10-5567
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93621 Jan Hubicka changed: What|Removed |Added CC||mjambor at suse dot cz --- Comment #3 from Jan Hubicka --- The testcase builds for me now, but this is Martin's code (apparently checking that we did not forget to apply param adjustments) Martin, was this fixed? Honza
[Bug ipa/93347] [10 Regression] ICE: verify_cgraph_node failed (error: calls_comdat_local is set outside of a comdat group)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93347 Jan Hubicka changed: What|Removed |Added Resolution|--- |FIXED Status|ASSIGNED|RESOLVED --- Comment #3 from Jan Hubicka --- Fixed. I noticed that some of the tests are not devirtualized, so we may move that into new PR.
[Bug c++/94243] Missed C++ front-end devirtualizations from Clang testsuite
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94243 Jan Hubicka changed: What|Removed |Added CC||jason at redhat dot com --- Comment #1 from Jan Hubicka --- Jason, I wonder if those are all valid transformations? Honza
[Bug c++/94243] New: Missed C++ front-end devirtualizations from Clang testsuite
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94243 Bug ID: 94243 Summary: Missed C++ front-end devirtualizations from Clang testsuite Product: gcc Version: 10.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: hubicka at gcc dot gnu.org Target Milestone: --- While working on PR93347 I noticed that we do not devirtualize the following testcases that clang's testsuite tests to be devirtualized: namespace Test2a { struct A { virtual ~A() final {} virtual int f(); }; // CHECK-LABEL: define i32 @_ZN6Test2a1fEPNS_1AE int f(A *a) { // CHECK: call i32 @_ZN6Test2a1A1fEv return a->f(); } } Here I guess the final destructor makes the whole class final? namespace Test4 { struct A { virtual void f(); virtual int operator-(); }; struct B final : A { virtual void f(); virtual int operator-(); }; // CHECK-LABEL: define void @_ZN5Test41fEPNS_1BE void f(B* d) { // CHECK: call void @_ZN5Test41B1fEv static_cast(d)->f(); // CHECK: call i32 @_ZN5Test41BngEv -static_cast(*d); } } Her I am not sure, I think parameter d may point to instance of struct A, so is it Clang's bug to devirtualize? namespace Test5 { struct A { virtual void f(); virtual int operator-(); }; struct B : A { virtual void f(); virtual int operator-(); }; struct C final : B { }; // CHECK-LABEL: define void @_ZN5Test51fEPNS_1CE void f(C* d) { // FIXME: It should be possible to devirtualize this case, but that is // not implemented yet. // CHECK: getelementptr // CHECK-NEXT: %[[FUNC:.*]] = load // CHECK-NEXT: call void %[[FUNC]] static_cast(d)->f(); } // CHECK-LABEL: define void @_ZN5Test53fopEPNS_1CE void fop(C* d) { // FIXME: It should be possible to devirtualize this case, but that is // not implemented yet. // CHECK: getelementptr // CHECK-NEXT: %[[FUNC:.*]] = load // CHECK-NEXT: call i32 %[[FUNC]] -static_cast(*d); } } this seems similar to me. namespace Test7 { struct foo { virtual void g() {} }; struct bar { virtual int f() { return 0; } }; struct zed final : public foo, public bar { int z; virtual int f() {return z;} }; // CHECK-LABEL: define i32 @_ZN5Test71fEPNS_3zedE int f(zed *z) { // CHECK: alloca // CHECK-NEXT: store // CHECK-NEXT: load // CHECK-NEXT: call i32 @_ZN5Test73zed1fEv // CHECK-NEXT: ret return static_cast(z)->f(); } } namespace Test8 { struct A { virtual ~A() {} }; struct B { int b; virtual int foo() { return b; } }; struct C final : A, B { }; // CHECK-LABEL: define i32 @_ZN5Test84testEPNS_1CE int test(C *c) { // CHECK: %[[THIS:.*]] = phi // CHECK-NEXT: call i32 @_ZN5Test81B3fooEv(%"struct.Test8::B"* %[[THIS]]) return static_cast(c)->foo(); } } namespace Test9 { struct A { int a; }; struct B { int b; }; struct C : public B, public A { }; struct RA { virtual A *f() { return 0; } virtual A *operator-() { return 0; } }; struct RC final : public RA { virtual C *f() { C *x = new C(); x->a = 1; x->b = 2; return x; } virtual C *operator-() { C *x = new C(); x->a = 1; x->b = 2; return x; } }; // CHECK: define {{.*}} @_ZN5Test91fEPNS_2RCE A *f(RC *x) { // FIXME: It should be possible to devirtualize this case, but that is // not implemented yet. // CHECK: load // CHECK: bitcast // CHECK: [[F_PTR_RA:%.+]] = bitcast // CHECK: [[VTABLE:%.+]] = load {{.+}} [[F_PTR_RA]] // CHECK: [[VFN:%.+]] = getelementptr inbounds {{.+}} [[VTABLE]], i{{[0-9]+}} 0 // CHECK-NEXT: %[[FUNC:.*]] = load {{.+}} [[VFN]] return static_cast(x)->f(); } // CHECK: define {{.*}} @_ZN5Test93fopEPNS_2RCE A *fop(RC *x) { // FIXME: It should be possible to devirtualize this case, but that is // not implemented yet. // CHECK: load // CHECK: bitcast // CHECK: [[F_PTR_RA:%.+]] = bitcast // CHECK: [[VTABLE:%.+]] = load {{.+}} [[F_PTR_RA]] // CHECK: [[VFN:%.+]] = getelementptr inbounds {{.+}} [[VTABLE]], i{{[0-9]+}} 1 // CHECK-NEXT: %[[FUNC:.*]] = load {{.+}} [[VFN]] // CHECK-NEXT: = call {{.*}} %[[FUNC]] return -static_cast(*x); } } namespace Test10 { struct A { virtual int f(); }; struct B : A { int f() final; }; // CHECK-LABEL: define i32 @_ZN6Test101fEPNS_1BE int f(B *b) { // CHECK: call i32 @_ZN6Test101B1fEv return static_cast(b)->f(); } }
[Bug lto/91028] [10 Regression] g++.dg/lto/alias-2 FAILs with -fno-use-linker-plugin
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91028 Jan Hubicka changed: What|Removed |Added Assignee|hubicka at gcc dot gnu.org |unassigned at gcc dot gnu.org Status|ASSIGNED|WAITING --- Comment #3 from Jan Hubicka --- I believe this was fixed a while ago by adding the loop. It no longer fails with -fno-use-linker-plugin. Is it OK on Solaris?
[Bug ipa/62051] [8/9/10 Regression] Undefined reference to vtable with -O2 and -fdevirtualize-speculatively
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62051 Jan Hubicka changed: What|Removed |Added Target Milestone|8.5 |11.0 --- Comment #23 from Jan Hubicka --- This is bit of a grey area of what we can/can not refer in presence of visibilities and I hope codebases are now adopted for GCC behaviour. I think we could delay this post GCC10, so re-taretting.
[Bug tree-optimization/91322] [10 regression] alias-4 test failure
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91322 --- Comment #8 from Jan Hubicka --- Do we have compile farm machine where this can be reproduced?
[Bug ipa/93369] [10 regression] g++.dg/lto/pr64076 fails
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93369 --- Comment #17 from Jan Hubicka --- Note that to fully fix the problem we need to resolve the way aliases works. In this case ODR violation makes one COMDAT section to contain only ctor, while other contains ctor and its thunk. The first COMDAT wins which makes the thunk to call alias of a symbol prevailed by different COMDAT. This still work w/o LTO and to immitate what happens in linker correctly we need ability to load both constructors https://gcc.gnu.org/pipermail/gcc-patches/2020-March/542733.html For invalid code like this that does not matter much, but the patch has also a valid testcase. I can also however patch around and silence the verifier ICE, but it would be just symptomatic workaround
[Bug tree-optimization/91322] [10 regression] alias-4 test failure
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91322 --- Comment #11 from Jan Hubicka --- The problem is that on ARM sizeof (short) == sizeof (int) and LTO will glob all short and int pointers together. So this is missed optimization only. We do this globing sort of by design. For GCC11 I plan to refine type merging again a bit but until then we could either xfail this testcase or change int to long which is 4 bytes. Not a release blocker though. I would welcome if someone could test the testcase adjustment (I was doing LTO by hand) diff --git a/gcc/testsuite/g++.dg/lto/alias-4_0.C b/gcc/testsuite/g++.dg/lto/alias-4_0.C index 410c3140baf..0ab12adef5b 100644 --- a/gcc/testsuite/g++.dg/lto/alias-4_0.C +++ b/gcc/testsuite/g++.dg/lto/alias-4_0.C @@ -5,7 +5,7 @@ short *ptr_init, **ptr=&ptr_init; __attribute__ ((used)) struct a { - int *aptr; + long *aptr; } a, *aptr=&a; void
[Bug ipa/93369] [10 regression] g++.dg/lto/pr64076 fails
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93369 --- Comment #19 from Jan Hubicka --- The reason why we get link failure is that we behave differently to mismatched comdats. While linker choose comdat that wins and eliminate other one we keep the other symbol and end up compiling it which leads to interesting issues with "half comdat" I am aiming to solve with the patch for proper handling of aliases. I think updating the testcase with -shared is a way to go for this P1 and I we can discuss the alias issue (probably for 10.2, since it is bit involved and very old) Honza
[Bug tree-optimization/91322] [10 regression] g++.dg/lto/alias-4_0.C test failure
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91322 Jan Hubicka changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #17 from Jan Hubicka --- So this turned out to be disabled ODR based TBAA for this struct since on ARM the builtin va_list type has same structure. I fixed the fialure by adjusting the structure and next stage1 we can make ODR TBAA to not give up in this case.
[Bug middle-end/94539] [10 Regression] gcc.dg/alias-14.c fails on gcc 10, succeeds on gcc 9, when turned into an execution test
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94539 --- Comment #2 from Jan Hubicka --- Hmm, the testcase is mine so I will take a look (and make it dg-do-run :) Honza
[Bug gcov-profile/93401] [9 regression] It is no longer possible to use -fprofile-generate= on setups with different instrumentation and feedback directories
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93401 Jan Hubicka changed: What|Removed |Added Status|ASSIGNED|RESOLVED Summary|[9/10 regression] It is no |[9 regression] It is no |longer possible to use |longer possible to use |-fprofile-generate= on |-fprofile-generate= on |setups with different |setups with different |instrumentation and |instrumentation and |feedback directories|feedback directories Resolution|--- |FIXED --- Comment #14 from Jan Hubicka --- Resolved on 10 so far. It may make sense to backport this to 9 and possibly earlier branches.
[Bug c++/94955] New: ICE in to_wide
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94955 Bug ID: 94955 Summary: ICE in to_wide Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: hubicka at gcc dot gnu.org Target Milestone: --- Created attachment 48454 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48454&action=edit proposed patch This was reported to me by Mark Williams (who also did the testcase and proposed patch) % g++ -std=gnu++17 bug.ii -S -o bug.s bug.ii: In function �void d()�: bug.ii:6:32: internal compiler error: in sign_mask, at wide-int.h:855 6 | void d() { short e = e >> b::c(); } |^ 0xa56be2 generic_wide_int >::sign_mask() const /home/engshare/third-party2/gcc/9.x/src/gcc-10.x/gcc/wide-int.h:855 0xa56be2 bool wi::neg_p > >(generic_wide_int > const&, signop) /home/engshare/third-party2/gcc/9.x/src/gcc-10.x/gcc/wide-int.h:1836 0xa56be2 tree_int_cst_sgn(tree_node const*) /home/engshare/third-party2/gcc/9.x/src/gcc-10.x/gcc/tree.c:7386 0x1065e85 cp_build_binary_op(op_location_t const&, tree_code, tree_node*, tree_node*, int) /home/engshare/third-party2/gcc/9.x/src/gcc-10.x/gcc/cp/typeck.c:5613 0xfa9629 build_new_op_1 /home/engshare/third-party2/gcc/9.x/src/gcc-10.x/gcc/cp/call.c:6501 0xfa931d build_new_op(op_location_t const&, tree_code, int, tree_node*, tree_node*, tree_node*, tree_node**, int) /home/engshare/third-party2/gcc/9.x/src/gcc-10.x/gcc/cp/call.c:6547 0x106267f build_x_binary_op(op_location_t const&, tree_code, tree_node*, tree_code, tree_node*, tree_code, tree_node**, int) /home/engshare/third-party2/gcc/9.x/src/gcc-10.x/gcc/cp/typeck.c:4248 0x10162f0 cp_parser_binary_expression /home/engshare/third-party2/gcc/9.x/src/gcc-10.x/gcc/cp/parser.c:9684 0x10157a4 cp_parser_assignment_expression /home/engshare/third-party2/gcc/9.x/src/gcc-10.x/gcc/cp/parser.c:9824 0x1015380 cp_parser_constant_expression /home/engshare/third-party2/gcc/9.x/src/gcc-10.x/gcc/cp/parser.c:10118 0x1015380 cp_parser_initializer_clause /home/engshare/third-party2/gcc/9.x/src/gcc-10.x/gcc/cp/parser.c:23148 0x1015380 cp_parser_initializer /home/engshare/third-party2/gcc/9.x/src/gcc-10.x/gcc/cp/parser.c:23086 0x1008ab0 cp_parser_init_declarator /home/engshare/third-party2/gcc/9.x/src/gcc-10.x/gcc/cp/parser.c:20780 0x1006144 cp_parser_simple_declaration /home/engshare/third-party2/gcc/9.x/src/gcc-10.x/gcc/cp/parser.c:13689 0x101df42 cp_parser_declaration_statement /home/engshare/third-party2/gcc/9.x/src/gcc-10.x/gcc/cp/parser.c:13121 0x101a67c cp_parser_statement /home/engshare/third-party2/gcc/9.x/src/gcc-10.x/gcc/cp/parser.c:11434 0x101a38a cp_parser_statement_seq_opt /home/engshare/third-party2/gcc/9.x/src/gcc-10.x/gcc/cp/parser.c:11800 0x101a38a cp_parser_compound_statement /home/engshare/third-party2/gcc/9.x/src/gcc-10.x/gcc/cp/parser.c:11750 0x101a0f9 cp_parser_function_body /home/engshare/third-party2/gcc/9.x/src/gcc-10.x/gcc/cp/parser.c:22992 0x101a0f9 cp_parser_ctor_initializer_opt_and_function_body /home/engshare/third-party2/gcc/9.x/src/gcc-10.x/gcc/cp/parser.c:23043 The problem was that it had previously used fold_for_warn to find an INTEGER_CST, and assumed the cp_fold_rvalue would too. But fold_for_warn handles some edge cases that cp_fold_rvalue does not, and in this case we end up with a NOP_EXPR instead of the INTEGER_CST
[Bug c++/94955] [10 regression] ICE in to_wide
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94955 --- Comment #2 from Jan Hubicka --- Created attachment 48455 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48455&action=edit testcase
[Bug c++/94955] [10 regression] ICE in to_wide
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94955 Jan Hubicka changed: What|Removed |Added Status|WAITING |NEW
[Bug lto/48200] Implement function attribute for symbol versioning (.symver)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=48200 --- Comment #44 from Jan Hubicka --- Thanks, I am happy we now have real-world use of symver attribute. I have WIP patch for better control over the symbol visibility, but I have run into problems with gas limitations which was fixed by HJ about two weeks ago. I will try to update the patch and aim for backporting to gcc 10.2.
[Bug tree-optimization/95539] New: Vectorizer ICE in dr_misalignment, at tree-vectorizer.h:1433
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95539 Bug ID: 95539 Summary: Vectorizer ICE in dr_misalignment, at tree-vectorizer.h:1433 Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: hubicka at gcc dot gnu.org Target Milestone: --- Created attachment 48675 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48675&action=edit testcase Building the testcase with -O3 leads to: /aux/hubicka/firefox-2019-3/security/nss/lib/freebl/gcm-x86.c:35:1: internal compiler error: in dr_misalignment, at tree-vectorizer.h:1433 35 | gcm_HashMult_hw(gcmHashContext *ghash, const unsigned char *buf, | ^~~ 0xccc57e dr_misalignment(dr_vec_info*) [clone .isra.0] ../../gcc/tree-vectorizer.h:1433 0xb5ea92 aligned_access_p ../../gcc/tree-vectorizer.h:1451 0xb5ea92 vect_supportable_dr_alignment(vec_info*, dr_vec_info*, bool) ../../gcc/tree-vect-data-refs.c:6512 0x933803 vect_get_load_cost(vec_info*, _stmt_vec_info*, int, bool, unsigned int*, unsigned int*, vec*, vec*, bool) ../../gcc/tree-vect-stmts.c:1211 0x950966 vect_model_load_cost ../../gcc/tree-vect-stmts.c:1185 0x950966 vectorizable_load ../../gcc/tree-vect-stmts.c:8877 0x964260 vect_analyze_stmt(vec_info*, _stmt_vec_info*, bool*, _slp_tree*, _slp_instance*, vec*) ../../gcc/tree-vect-stmts.c:11126 0x972af1 vect_slp_analyze_node_operations_1 ../../gcc/tree-vect-slp.c:2697 0x972af1 vect_slp_analyze_node_operations ../../gcc/tree-vect-slp.c:2858 0x9728fe vect_slp_analyze_node_operations ../../gcc/tree-vect-slp.c:2816 0x9728fe vect_slp_analyze_node_operations ../../gcc/tree-vect-slp.c:2816 0x9728fe vect_slp_analyze_node_operations ../../gcc/tree-vect-slp.c:2816 0x972f76 vect_slp_analyze_operations(vec_info*) ../../gcc/tree-vect-slp.c:2937 0x9812e4 vect_slp_analyze_bb_1 ../../gcc/tree-vect-slp.c:3264 0x9812e4 vect_slp_bb_region ../../gcc/tree-vect-slp.c:3325 0x9812e4 vect_slp_bb(basic_block_def*) ../../gcc/tree-vect-slp.c:3460 0x981c32 execute ../../gcc/tree-vectorizer.c:1320
[Bug ipa/96337] [10/11 Regression] GCC 10.2: twice as slow for -O2 -march=x86-64 vs. GCC 9.3/8.4
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96337 --- Comment #4 from Jan Hubicka --- There was changes to -O2 inliner. I have - enabled auto-inlininig - reduced early inlining a bit - reduced limits for inlining functions declared inline The second two was needed to keep code size under control and did well on overall -O2 spec and Firefox performance (without FDO, with FDO we indeed had some performance loss and code size gains, which I plan to revisit). This should not be visible on linux kernel though since it does always inline. The linked patch to enable -O3 by default does not make too much sense to me. I will see if I can reproduce phoronix benchmarks - indeed those workloads are not typical -O2 workloads and may be affected by the inline limits. Honza
[Bug ipa/96337] [10/11 Regression] GCC 10.2: twice as slow for -O2 -march=x86-64 vs. GCC 9.3/8.4
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96337 --- Comment #5 from Jan Hubicka --- OK, I started with checking Himeno where phoronix reports 4377->2681 on my notebook (Intel(R) Core(TM) i7-6600U CPU) there may be around 1-5% regression that is not inliner related GCC 10 Loop executed for 7445 times Gosa : 2.924613e-08 MFLOPS measured : 2346.645663 cpu : 50.172505 Score based on Pentium III 600MHz using Fortran 77: 28.617630 GCC 9 Loop executed for 8253 times Gosa : 9.062229e-09 MFLOPS measured : 2454.019320 cpu : 53.184180 Score based on Pentium III 600MHz using Fortran 77: 29.927065 The internal loops and inlining looks almost identical.
[Bug ipa/96337] [10/11 Regression] GCC 10.2: twice as slow for -O2 -march=x86-64 vs. GCC 9.3/8.4
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96337 --- Comment #6 from Jan Hubicka --- Coremark. GCC 9 run1: CoreMark Size: 666 Total ticks : 12310 Total time (secs): 12.31 Iterations/Sec : 24370.430544 Iterations : 30 Compiler version : GCC9.3.1 20200406 [revision 6db837a5288ee3ca5ec504fbd5a765817e556ac2] Compiler flags : -O2 -DPERFORMANCE_RUN=1 -lrt GCC 9 run2: CoreMark Size: 666 Total ticks : 12471 Total time (secs): 12.471000 Iterations/Sec : 24055.809478 Iterations : 30 Compiler version : GCC9.3.1 20200406 [revision 6db837a5288ee3ca5ec504fbd5a765817e556ac2] Compiler flags : -O2 -DPERFORMANCE_RUN=1 -lrt GCC 10 run1: CoreMark Size: 666 Total ticks : 15269 Total time (secs): 15.269000 Iterations/Sec : 26196.869474 Iterations : 40 Compiler version : GCC10.1.1 20200507 [revision dd38686d9c810cecbaa80bb82ed91caaa58ad635] Compiler flags : -O2 -DPERFORMANCE_RUN=1 -lrt GCC 10 run2: CoreMark Size: 666 Total ticks : 11770 Total time (secs): 11.77 Iterations/Sec : 25488.530161 Iterations : 30 Compiler version : GCC10.1.1 20200507 [revision dd38686d9c810cecbaa80bb82ed91caaa58ad635] Compiler flags : -O2 -DPERFORMANCE_RUN=1 -lrt
[Bug ipa/96337] [10/11 Regression] GCC 10.2: twice as slow for -O2 -march=x86-64 vs. GCC 9.3/8.4
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96337 --- Comment #7 from Jan Hubicka --- X265 GCC 9: y4m [info]: 1920x1080 fps 30/1 i420p8 frames 0 - 599 of 600 raw [info]: output file: /dev/null x265 [info]: HEVC encoder version 3.1.2+1-76650bab70f9 x265 [info]: build info [Linux][GCC 9.3.1][64 bit][noasm] 8bit x265 [info]: using cpu capabilities: none! x265 [info]: Main profile, Level-4 (Main tier) x265 [info]: Thread pool created using 4 threads x265 [info]: Slices : 1 x265 [info]: frame threads / pool features : 2 / wpp(17 rows) x265 [info]: Coding QT: max CU size, min CU size : 64 / 8 x265 [info]: Residual QT: max TU size, max depth : 32 / 1 inter / 1 intra x265 [info]: ME / range / subpel / merge : hex / 57 / 2 / 3 x265 [info]: Keyframe min / max / scenecut / bias: 25 / 250 / 40 / 5.00 x265 [info]: Lookahead / bframes / badapt: 20 / 4 / 2 x265 [info]: b-pyramid / weightp / weightb : 1 / 1 / 0 x265 [info]: References / ref-limit cu / depth : 3 / off / on x265 [info]: AQ: mode / str / qg-size / cu-tree : 2 / 1.0 / 32 / 1 x265 [info]: Rate Control / qCompress: CRF-28.0 / 0.60 x265 [info]: tools: rd=3 psy-rd=2.00 early-skip rskip signhide tmvp b-intra x265 [info]: tools: strong-intra-smoothing lslices=6 deblock sao x265 [info]: frame I: 3, Avg QP:27.57 kb/s: 14018.64 x265 [info]: frame P:146, Avg QP:28.84 kb/s: 4313.98 x265 [info]: frame B:451, Avg QP:35.29 kb/s: 204.06 x265 [info]: Weighted P-Frames: Y:0.0% UV:0.0% x265 [info]: consecutive B-frames: 0.7% 0.0% 0.0% 94.6% 4.7% encoded 600 frames in 279.98s (2.14 fps), 1273.22 kb/s, Avg QP:33.68 1056.04user 1.31system 4:40.01elapsed 377%CPU (0avgtext+0avgdata 432688maxresident)k 0inputs+0outputs (0major+102385minor)pagefaults 0swaps GCC 10: y4m [info]: 1920x1080 fps 30/1 i420p8 frames 0 - 599 of 600 raw [info]: output file: /dev/null x265 [info]: HEVC encoder version 3.1.2+1-76650bab70f9 x265 [info]: build info [Linux][GCC 10.1.1][64 bit][noasm] 8bit x265 [info]: using cpu capabilities: none! x265 [info]: Main profile, Level-4 (Main tier) x265 [info]: Thread pool created using 4 threads x265 [info]: Slices : 1 x265 [info]: frame threads / pool features : 2 / wpp(17 rows) x265 [info]: Coding QT: max CU size, min CU size : 64 / 8 x265 [info]: Residual QT: max TU size, max depth : 32 / 1 inter / 1 intra x265 [info]: ME / range / subpel / merge : hex / 57 / 2 / 3 x265 [info]: Keyframe min / max / scenecut / bias: 25 / 250 / 40 / 5.00 x265 [info]: Lookahead / bframes / badapt: 20 / 4 / 2 x265 [info]: b-pyramid / weightp / weightb : 1 / 1 / 0 x265 [info]: References / ref-limit cu / depth : 3 / off / on x265 [info]: AQ: mode / str / qg-size / cu-tree : 2 / 1.0 / 32 / 1 x265 [info]: Rate Control / qCompress: CRF-28.0 / 0.60 x265 [info]: tools: rd=3 psy-rd=2.00 early-skip rskip signhide tmvp b-intra x265 [info]: tools: strong-intra-smoothing lslices=6 deblock sao x265 [info]: frame I: 3, Avg QP:27.57 kb/s: 14018.64 x265 [info]: frame P:146, Avg QP:28.84 kb/s: 4313.98 x265 [info]: frame B:451, Avg QP:35.29 kb/s: 204.06 x265 [info]: Weighted P-Frames: Y:0.0% UV:0.0% x265 [info]: consecutive B-frames: 0.7% 0.0% 0.0% 94.6% 4.7% encoded 600 frames in 292.63s (2.05 fps), 1273.22 kb/s, Avg QP:33.68 1079.80user 1.76system 4:52.65elapsed 369%CPU (0avgtext+0avgdata 427464maxresident)k 0inputs+0outputs (0major+73644minor)pagefaults 0swaps So 5% difference instead of 50%. This is a codebase that I would build with -O3. Looking at perf reports there is a difference in inlining. GCC 9: 8.74% x265 libx265.so.176 [.] (anonymous namespace)::satd_8x4 5.67% x265 libx265.so.176 [.] (anonymous namespace)::filterVertical_sp_c<8> 4.44% x265 libx265.so.176 [.] (anonymous namespace)::pixelavg_pp<8, 8> 4.11% x265 libx265.so.176 [.] (anonymous namespace)::psyCost_pp<3> 3.81% x265 libx265.so.176 [.] (anonymous namespace)::interp_horiz_ps_c<8, 64, 64> 3.33% x265 libx265.so.176 [.] (anonymous namespace)::sad<8, 8> 3.29% x265 libx265.so.176 [.] partialButterfly32 GCC 10: 9.17% x265 libx265.so.176 [.] (anonymous namespace)::_sa8d_8x8 8.70% x265 libx265.so.176 [.] (anonymous namespace)::satd_8x4 5.80% x265 libx265.so.176 [.] (anonymous namespace)::pixelavg_pp<8, 8> 5.55% x265 libx265.so.176 [.] (anonymous namespace)::filterVertical_sp_c<8> 3.90% x265 libx265.so.176 [.] (anonymous namespace)::sad<8, 8> 3.71% x265 libx265.so.176 [.] (anonymous namespace)::interp_horiz_ps_c<8, 64, 64> 3.48% x265 libx265.so.176 [.] (anonymous namespace)::sad_x4<8, 8> I build with cmake ../source/ -DCMAKE_CXX_FLAGS=-O2 -DCMAKE_CXX_
[Bug ipa/96337] [10/11 Regression] GCC 10.2: twice as slow for -O2 -march=x86-64 vs. GCC 9.3/8.4
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96337 --- Comment #8 from Jan Hubicka --- This is the built withour release flags override as seems to be done by phoronix: GCC 9: y4m [info]: 1920x1080 fps 30/1 i420p8 frames 0 - 599 of 600 raw [info]: output file: /dev/null x265 [info]: HEVC encoder version 3.1.2+1-76650bab70f9 x265 [info]: build info [Linux][GCC 9.3.1][64 bit][noasm] 8bit x265 [info]: using cpu capabilities: none! x265 [info]: Main profile, Level-4 (Main tier) x265 [info]: Thread pool created using 4 threads x265 [info]: Slices : 1 x265 [info]: frame threads / pool features : 2 / wpp(17 rows) x265 [info]: Coding QT: max CU size, min CU size : 64 / 8 x265 [info]: Residual QT: max TU size, max depth : 32 / 1 inter / 1 intra x265 [info]: ME / range / subpel / merge : hex / 57 / 2 / 3 x265 [info]: Keyframe min / max / scenecut / bias: 25 / 250 / 40 / 5.00 x265 [info]: Lookahead / bframes / badapt: 20 / 4 / 2 x265 [info]: b-pyramid / weightp / weightb : 1 / 1 / 0 x265 [info]: References / ref-limit cu / depth : 3 / off / on x265 [info]: AQ: mode / str / qg-size / cu-tree : 2 / 1.0 / 32 / 1 x265 [info]: Rate Control / qCompress: CRF-28.0 / 0.60 x265 [info]: tools: rd=3 psy-rd=2.00 early-skip rskip signhide tmvp b-intra x265 [info]: tools: strong-intra-smoothing lslices=6 deblock sao x265 [info]: frame I: 3, Avg QP:27.57 kb/s: 14018.64 x265 [info]: frame P:146, Avg QP:28.84 kb/s: 4313.98 x265 [info]: frame B:451, Avg QP:35.29 kb/s: 204.06 x265 [info]: Weighted P-Frames: Y:0.0% UV:0.0% x265 [info]: consecutive B-frames: 0.7% 0.0% 0.0% 94.6% 4.7% encoded 600 frames in 171.30s (3.50 fps), 1273.22 kb/s, Avg QP:33.68 599.58user 1.62system 2:51.33elapsed 350%CPU (0avgtext+0avgdata 416976maxresident)k 225384inputs+0outputs (0major+95380minor)pagefaults 0swaps GCC 10: y4m [info]: 1920x1080 fps 30/1 i420p8 frames 0 - 599 of 600 raw [info]: output file: /dev/null x265 [info]: HEVC encoder version 3.1.2+1-76650bab70f9 x265 [info]: build info [Linux][GCC 10.1.1][64 bit][noasm] 8bit x265 [info]: using cpu capabilities: none! x265 [info]: Main profile, Level-4 (Main tier) x265 [info]: Thread pool created using 4 threads x265 [info]: Slices : 1 x265 [info]: frame threads / pool features : 2 / wpp(17 rows) x265 [info]: Coding QT: max CU size, min CU size : 64 / 8 x265 [info]: Residual QT: max TU size, max depth : 32 / 1 inter / 1 intra x265 [info]: ME / range / subpel / merge : hex / 57 / 2 / 3 x265 [info]: Keyframe min / max / scenecut / bias: 25 / 250 / 40 / 5.00 x265 [info]: Lookahead / bframes / badapt: 20 / 4 / 2 x265 [info]: b-pyramid / weightp / weightb : 1 / 1 / 0 x265 [info]: References / ref-limit cu / depth : 3 / off / on x265 [info]: AQ: mode / str / qg-size / cu-tree : 2 / 1.0 / 32 / 1 x265 [info]: Rate Control / qCompress: CRF-28.0 / 0.60 x265 [info]: tools: rd=3 psy-rd=2.00 early-skip rskip signhide tmvp b-intra x265 [info]: tools: strong-intra-smoothing lslices=6 deblock sao x265 [info]: frame I: 3, Avg QP:27.57 kb/s: 14018.64 x265 [info]: frame P:146, Avg QP:28.84 kb/s: 4313.98 x265 [info]: frame B:451, Avg QP:35.29 kb/s: 204.06 x265 [info]: Weighted P-Frames: Y:0.0% UV:0.0% x265 [info]: consecutive B-frames: 0.7% 0.0% 0.0% 94.6% 4.7% encoded 600 frames in 168.97s (3.55 fps), 1273.22 kb/s, Avg QP:33.68 592.69user 1.89system 2:49.00elapsed 351%CPU (0avgtext+0avgdata 416184maxresident)k 476408inputs+0outputs (1major+95191minor)pagefaults 0swaps So a small improvement.
[Bug ipa/96337] [10/11 Regression] GCC 10.2: twice as slow for -O2 -march=x86-64 vs. GCC 9.3/8.4
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96337 --- Comment #9 from Jan Hubicka --- scimark GCC 9: ** ** ** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark ** ** for details. (Results can be submitted to p...@nist.gov) ** ** ** Using 2.00 seconds min time per kenel. Composite Score: 1062.28 FFT Mflops: 189.17(N=1048576) SOR Mflops: 947.53(1000 x 1000) MonteCarlo: Mflops: 710.10 Sparse matmult Mflops: 1402.08(N=10, nz=100) LU Mflops: 2062.49(M=1000, N=1000) GCC 10: ** ** ** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark ** ** for details. (Results can be submitted to p...@nist.gov) ** ** ** Using 2.00 seconds min time per kenel. Composite Score: 1176.22 FFT Mflops: 201.17(N=1048576) SOR Mflops: 961.33(1000 x 1000) MonteCarlo: Mflops: 708.62 Sparse matmult Mflops: 1639.66(N=10, nz=100) LU Mflops: 2370.30(M=1000, N=1000) So again around 10% improvement for gcc10
[Bug ipa/96482] [10/11 Regression] Combination of -finline-small-functions and ipa-cp optimisations causes incorrect values being passed to a function since r279523
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96482 --- Comment #4 from Jan Hubicka --- that patch makes ccp to actually use the bit info ipa-cp determines. Before we used it only to detect pointer alignments if I remember correctly. So it looks like propagation bug uncovered by the change. Smaller testcase or reproduction steps would be indeed welcome.
[Bug ipa/96337] [10/11 Regression] GCC 10.2: twice as slow for -O2 -march=x86-64 vs. GCC 9.3/8.4
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96337 Jan Hubicka changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |WORKSFORME --- Comment #16 from Jan Hubicka --- It seems that the benchmarks was flawed. We could reopen if phoronix suceeds to reporduce them.
[Bug ipa/92074] [10 regression] 26% performance regression on Spec2017 548.exchange2_r
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92074 --- Comment #6 from Jan Hubicka --- Author: hubicka Date: Wed Oct 23 14:45:24 2019 New Revision: 277333 URL: https://gcc.gnu.org/viewcvs?rev=277333&root=gcc&view=rev Log: PR ipa/92074 * params.def (inline-heuristics-hint-percent): Set to 600. Modified: trunk/gcc/ChangeLog trunk/gcc/params.def
[Bug middle-end/92153] [10 Regression] ICE / segmentation fault, use-after-free at gcc/ggc-page.c:1159
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92153 --- Comment #3 from Jan Hubicka --- Author: hubicka Date: Fri Oct 25 11:17:38 2019 New Revision: 277443 URL: https://gcc.gnu.org/viewcvs?rev=277443&root=gcc&view=rev Log: Backport ggc_trim Backport from mainline 2019-10-18 Jakub Jelinek PR middle-end/92153 * ggc-page.c (release_pages): Read g->alloc_size before free rather than after it. 2019-10-11 Jan Hubicka * ggc-page.c (release_pages): Output statistics when !quiet_flag. (ggc_collect): Dump later to not interfere with release_page dump. (ggc_trim): New function. * ggc-none.c (ggc_trim): New. * ggc.h (ggc_trim): Declare. * lto-partition.c (add_symbol_to_partition_1): Update. (undo_parittion): Update. Modified: branches/gcc-9-branch/gcc/ChangeLog branches/gcc-9-branch/gcc/ggc-none.c branches/gcc-9-branch/gcc/ggc-page.c branches/gcc-9-branch/gcc/ggc.h branches/gcc-9-branch/gcc/lto/ChangeLog branches/gcc-9-branch/gcc/lto/lto.c
[Bug ipa/92242] [10 regression] LTO ICE in ipa_get_cs_argument_count ipa-prop.h:598
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92242 --- Comment #3 from Jan Hubicka --- Author: hubicka Date: Mon Oct 28 08:19:56 2019 New Revision: 277504 URL: https://gcc.gnu.org/viewcvs?rev=277504&root=gcc&view=rev Log: PR ipa/92242 * ipa-fnsummary.c (ipa_merge_fn_summary_after_inlining): Check for missing EDGE_REF * ipa-prop.c (update_jump_functions_after_inlining): Likewise. Modified: trunk/gcc/ChangeLog trunk/gcc/ipa-fnsummary.c trunk/gcc/ipa-prop.c
[Bug ipa/92242] [10 regression] LTO ICE in ipa_get_cs_argument_count ipa-prop.h:598
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92242 Jan Hubicka changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |FIXED --- Comment #5 from Jan Hubicka --- Thanks for confirmation (and testcase). Sadly I am not sure how to put it into testsuite but given that other tests also broke I hope this patch is tested sufficiently. Honza
[Bug ipa/92278] [10 regression] LTO ICE ipa_get_ith_polymorhic_call_context ipa-prop.h:616
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92278 Jan Hubicka changed: What|Removed |Added CC||mjambor at suse dot cz --- Comment #3 from Jan Hubicka --- Since there is no -O0 code here involved I am not sure why the summary gone missing. We probably should debug that. I think my today patch silences the ICE however. Martin, do you have any idea?
[Bug ipa/92254] [10 regression] ICE LTO in inline_small_functions, at ipa-inline.c:2000
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92254 Jan Hubicka changed: What|Removed |Added CC||mjambor at suse dot cz --- Comment #3 from Jan Hubicka --- Similarly here. It seems like previoulsy latent bug showing up now.
[Bug ipa/92394] New: operand_equal_p should compare as base+offset when comparing addresses
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92394 Bug ID: 92394 Summary: operand_equal_p should compare as base+offset when comparing addresses Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: ipa Assignee: unassigned at gcc dot gnu.org Reporter: hubicka at gcc dot gnu.org CC: marxin at gcc dot gnu.org Target Milestone: --- Compiling firefox one gets many of: false returned: '' in operand_equal_p at ../../gcc/ipa-icf-gimple.c:259 false returned: 'operand_equal_p failed' in compare_operand at ../../gcc/ipa-icf-gimple.c:303 false returned: 'memory operands are different' in compare_gimple_assign at ../../gcc/ipa-icf-gimple.c:621 different statement for code: GIMPLE_ASSIGN (compare_bb:468): _6 = &self_5->D.1557805.D.1541362.D.1218628.D.20474; _6 = &self_5->D.1593155; false returned: '' in equals_private at ../../gcc/ipa-icf.c:885 Equals called for: _finalize/10691342:_finalize/10809461 with result: false here operand_equal_p seems overly conservative (assuming that base+offset match). When comparing stuff in ADDR_EXPR it does not need to care about actual access path.
[Bug ipa/92394] operand_equal_p should compare as base+offset when comparing addresses
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92394 --- Comment #1 from Jan Hubicka --- Following testcase is mergeable: struct a {int a; int b;}; struct b {int c; short d;}; void * retadr1(struct a *a) { return &a->b; } void * retadr2(struct b *a) { return &a->d; }
[Bug ipa/92394] operand_equal_p should compare as base+offset when comparing addresses
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92394 Jan Hubicka changed: What|Removed |Added Status|NEW |UNCONFIRMED Last reconfirmed|2019-11-06 00:00:00 | Version|10.0|unknown Assignee|marxin at gcc dot gnu.org |unassigned at gcc dot gnu.org Target Milestone|10.0|--- Ever confirmed|1 |0 --- Comment #2 from Jan Hubicka --- this is statistics of reason why ICF failes: 6523 false returned: 'different tree types' in compatible_types_p at ../../gcc/ipa-icf-gimple.c:203 7521 false returned: 'parameter types are not compatible' in equals_wpa at ../../gcc/ipa-icf.c:637 12973 false returned: 'memory operands are different' in compare_gimple_call at ../../gcc/ipa-icf-gimple.c:582 14799 false returned: 'decl_or_type flags are different' in equals_wpa at ../../gcc/ipa-icf.c:570 16052 false returned: 'inline attributes are different' in compare_referenced_symbol_properties at ../../gcc/ipa-icf.c:344 20962 false returned: 'references to virtual tables cannot be merged' in compare_referenced_symbol_properties at ../../gcc/ipa-icf.c:364 72431 false returned: 'call function types are not compatible' in compare_gimple_call at ../../gcc/ipa-icf-gimple.c:566 80695 false returned: 'result types are different' in equals_wpa at ../../gcc/ipa-icf.c:619 84475 false returned: 'types are not compatible' in compatible_types_p at ../../gcc/ipa-icf-gimple.c:209 117458 false returned: '' in compare_gimple_call at ../../gcc/ipa-icf-gimple.c:545 388866 false returned: 'THIS pointer ODR type mismatch' in equals_wpa at ../../gcc/ipa-icf.c:675 391183 false returned: 'types are not same for ODR' in compatible_polymorphic_types_p at ../../gcc/ipa-icf-gimple.c:194 618107 false returned: '' in operand_equal_p at ../../gcc/ipa-icf-gimple.c:259 2953032 false returned: 'memory operands are different' in compare_gimple_assign at ../../gcc/ipa-icf-gimple.c:621 3083711 false returned: 'operand_equal_p failed' in compare_operand at ../../gcc/ipa-icf-gimple.c:303 3156681 false returned: '' in equals_private at ../../gcc/ipa-icf.c:885 so 2.9M functions are streamed in for memory operands being different. Honza
[Bug ipa/92394] operand_equal_p should compare as base+offset when comparing addresses
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92394 --- Comment #3 from Jan Hubicka --- This is corresponding stats from gcc 9, so we definitly load a lot more bodies now 13228 false returned: 'memory operands are different' (compare_gimple_call:785) 14011 false returned: 'decl_or_type flags are different' (equals_wpa:577) 15619 false returned: 'types are not compatible' (compatible_types_p:233) 16877 false returned: (compare_cst_or_decl:341) 17365 false returned: 'references to virtual tables cannot be merged' (compare_referenced_symbol_properties:370) 19423 false returned: (compare_operand:478) 28816 false returned: (compare_operand:509) 87413 false returned: 'memory operands are different' (compare_gimple_assign:824) 199751 false returned: 'THIS pointer ODR type mismatch' (equals_wpa:682) 201097 false returned: 'types are not same for ODR' (compatible_polymorphic_types_p:218) 375744 false returned: 'parameter type is not compatible' (compatible_parm_types_p:509) 457840 false returned: '' (equals_private:890) 783534 false returned: 'alias sets are different' (compatible_types_p:244) gcc 9 merges 40k functions, while trunk 30k.
[Bug lto/92406] [10 Regression] ICE in ipa_call_summary at ipa-fnsummary.h:253 with lto and pgo
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92406 --- Comment #4 from Jan Hubicka --- Created attachment 47193 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=47193&action=edit Proposed patch Hi, does this patch fix the problem? Honza
[Bug lto/92406] [10 Regression] ICE in ipa_call_summary at ipa-fnsummary.h:253 with lto and pgo
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92406 --- Comment #7 from Jan Hubicka --- Author: hubicka Date: Thu Nov 7 17:08:11 2019 New Revision: 277927 URL: https://gcc.gnu.org/viewcvs?rev=277927&root=gcc&view=rev Log: PR ipa/92406 * ipa-fnsummary.c (analyze_function_body): Use get_create to copy summary. Modified: trunk/gcc/ChangeLog trunk/gcc/ipa-fnsummary.c
[Bug ipa/92471] [ICE] lto1 segmentation fault: ipa-profile.c ipa_get_cs_argument_count (args=0x0)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92471 Jan Hubicka changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #6 from Jan Hubicka --- Fixed.
[Bug ipa/92471] [ICE] lto1 segmentation fault: ipa-profile.c ipa_get_cs_argument_count (args=0x0)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92471 --- Comment #5 from Jan Hubicka --- Author: hubicka Date: Tue Nov 12 19:31:04 2019 New Revision: 278100 URL: https://gcc.gnu.org/viewcvs?rev=278100&root=gcc&view=rev Log: PR ipa/92471 * ipa-profile.c (check_argument_count): Break out from ...; watch for missing summaries. (ipa_profile): Here. Modified: trunk/gcc/ChangeLog trunk/gcc/ipa-profile.c
[Bug ipa/92498] [10 regression] gcc.dg/tree-prof/crossmodule-indircall-1.c fails starting with r278100
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92498 --- Comment #1 from Jan Hubicka --- Author: hubicka Date: Wed Nov 13 19:44:35 2019 New Revision: 278157 URL: https://gcc.gnu.org/viewcvs?rev=278157&root=gcc&view=rev Log: PR ipa/92498 * ipa-profile.c (check_argument_count): Do not ICE when descriptors is NULL. (ipa_profile): Fix reversed test. Modified: trunk/gcc/ChangeLog trunk/gcc/ipa-profile.c
[Bug ipa/92421] [10 Regression] ICE in inline_small_functions, at ipa-inline.c:2001 since r277759
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92421 --- Comment #6 from Jan Hubicka --- Author: hubicka Date: Wed Nov 13 21:02:11 2019 New Revision: 278159 URL: https://gcc.gnu.org/viewcvs?rev=278159&root=gcc&view=rev Log: PR c++/92421 * ipa-prop.c (update_indirect_edges_after_inlining): Mark parameter as used. * ipa-inline.c (recursive_inlining): Reset node cache after inlining. (inline_small_functions): Remove checking ifdef. * ipa-inline-analysis.c (do_estimate_edge_time): Verify cache consistency. * g++.dg/torture/pr92421.C: New testcase. Added: trunk/gcc/testsuite/g++.dg/torture/pr92421.C Modified: trunk/gcc/ChangeLog trunk/gcc/ipa-inline-analysis.c trunk/gcc/ipa-inline.c trunk/gcc/ipa-prop.c trunk/gcc/testsuite/ChangeLog
[Bug c/66825] RFE: Add attributes for symbol versioning.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66825 Jan Hubicka changed: What|Removed |Added Status|NEW |RESOLVED CC||hubicka at gcc dot gnu.org Resolution|--- |DUPLICATE --- Comment #2 from Jan Hubicka --- We have earlier bug on this. I am going to attach WIP patch there. *** This bug has been marked as a duplicate of bug 48200 ***
[Bug lto/48200] Implement function attribute for symbol versioning (.symver)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=48200 Jan Hubicka changed: What|Removed |Added CC||carlos at redhat dot com --- Comment #37 from Jan Hubicka --- *** Bug 66825 has been marked as a duplicate of this bug. ***
[Bug testsuite/92520] [10 Regression] new test case gcc/testsuite/gcc.dg/ipa/inline-9.c in r278220 is unresolved
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92520 --- Comment #1 from Jan Hubicka --- Author: hubicka Date: Fri Nov 15 08:19:16 2019 New Revision: 278279 URL: https://gcc.gnu.org/viewcvs?rev=278279&root=gcc&view=rev Log: PR testsuite/92520 * gcc.dg/ipa/inline-9.c: Fix template. Modified: trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/gcc.dg/ipa/inline-9.c
[Bug testsuite/92520] [10 Regression] new test case gcc/testsuite/gcc.dg/ipa/inline-9.c in r278220 is unresolved
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92520 Jan Hubicka changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |FIXED --- Comment #2 from Jan Hubicka --- I have fixed the testcase in r278279
[Bug lto/48200] Implement function attribute for symbol versioning (.symver)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=48200 --- Comment #40 from Jan Hubicka --- I posted initial patch here https://gcc.gnu.org/ml/gcc-patches/2019-11/msg01334.html
[Bug ipa/92528] [10 Regression] ICE in ipa_get_parm_lattices since r278219
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92528 Jan Hubicka changed: What|Removed |Added Assignee|hubicka at gcc dot gnu.org |fxue at os dot amperecomputing.com --- Comment #6 from Jan Hubicka --- This is the same issue as I hit in Firefox build and we discussed at: https://gcc.gnu.org/ml/gcc-patches/2019-11/msg01351.html Feng is right that ipa_set_jf_unknown is missing clear of agg. > I checked update_jump_functions_after_inlining(), and found one suspicious > place: > for (i = 0; i < count; i++) >{ > struct ipa_jump_func *dst = ipa_get_ith_jump_func (args, i); > if (!top) >{ > ipa_set_jf_unknown (dst); > <<<<<<<<<<<<<<<<< we should also invalidate dst->agg.items. Yes following patch fixes it: Index: ipa-prop.c === --- ipa-prop.c (revision 278222) +++ ipa-prop.c (working copy) @@ -514,6 +514,8 @@ ipa_set_jf_unknown (struct ipa_jump_func jfunc->type = IPA_JF_UNKNOWN; jfunc->bits = NULL; jfunc->m_vr = NULL; + jfunc->agg.by_ref = false; + jfunc->agg.items = NULL; } /* Set JFUNC to be a copy of another jmp (to be used by jump function > continue; >} > class ipa_polymorphic_call_context *dst_ctx >= ipa_get_ith_polymorhic_call_context (args, i); <<<< An irrelevant > point: and should we also do some kind of cleaning on dst_ctx? There is no need to clear polymorphic call context. It does not refer to the parameters of caller. If it was valid for all possible contexts it is still valid. So I think ipa_set_jf_unknown shall not clear bits and m_vr. Honza
[Bug ipa/92508] [10 Regression] ICE in do_estimate_edge_time, at ipa-inline-analysis.c:223 since r278159
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92508 --- Comment #8 from Jan Hubicka --- Aha, that makes sense for sreal it is not sure that a == a * 1 / 1 and the code was inconsistent about guaring the noop scales. Thanks for tracking this down! I suppose it would also make sense to pre-compute 1/1 and use it instead of divisions. I will look into it after fixing other issues. Honza
[Bug ipa/92535] New: [10 regression] ICF is relatively expensive and became less effective
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92535 Bug ID: 92535 Summary: [10 regression] ICF is relatively expensive and became less effective Product: gcc Version: 10.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: ipa Assignee: unassigned at gcc dot gnu.org Reporter: hubicka at gcc dot gnu.org CC: marxin at gcc dot gnu.org Target Milestone: --- Created attachment 47274 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=47274&action=edit Memory use graph for linktime for GCC10 ICF currently is very conservative optimizing libxul.so saving only about 1.5% of text segment: $ bloaty libxul.so -- libxul.so.old2 VM SIZE FILE SIZE ++ GROWING++ +1.5% +1.21Mi .text +1.21Mi +1.5% +4.4% +351Ki .eh_frame +351Ki +4.4% +6.0% +102Ki .eh_frame_hdr +102Ki +6.0% [ = ] 0 .strtab+62.4Ki +0.2% +0.5% +52.6Ki .rela.dyn +52.6Ki +0.5% +0.1% +19.6Ki .rodata+19.6Ki +0.1% +0.4% +13.2Ki .data.rel.ro.local +13.2Ki +0.4% +1.3% +9.97Ki .data.rel.ro +9.97Ki +1.3% +0.2% +12 .gcc_except_table +12 +0.2% -- SHRINKING -- [ = ] 0 .symtab-10.0Ki -0.1% -0.0% -64 .data -64 -0.0% -0.0% -16 .bss 0 [ = ] -+-+-+-+-+-+-+ MIXED +-+-+-+-+-+-+- +76%+124 [Unmapped] -3.04Ki -77.5% +1.3% +1.75Mi TOTAL +1.79Mi +0.9% This used to be 7% in GCC5 (at Firefox from 2015) At the same time it is relatively expensive memory wise and compile time wise. It increases peak memory use from 6GB to 7.5GB and compile time from: real8m57.454s user91m8.020s sys 6m20.372s to real9m41.361s user91m47.076s sys 6m16.760s For GCC 9 the code size improvement is 2.3%, build time change is: real7m53.778s user76m10.368s sys 6m55.324s to real8m14.613s user72m57.932s sys 6m32.792s and peak memory use is from 8gm to 10gb.
[Bug ipa/92535] [10 regression] ICF is relatively expensive and became less effective
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92535 --- Comment #1 from Jan Hubicka --- Created attachment 47275 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=47275&action=edit memory use of GCC10 with icf disabled
[Bug ipa/92535] [10 regression] ICF is relatively expensive and became less effective
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92535 --- Comment #3 from Jan Hubicka --- Created attachment 47277 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=47277&action=edit Meory use of gcc9 with ICF disabled
[Bug ipa/92535] [10 regression] ICF is relatively expensive and became less effective
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92535 --- Comment #2 from Jan Hubicka --- Created attachment 47276 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=47276&action=edit Memory use of gcc9
[Bug ipa/92535] [10 regression] ICF is relatively expensive and became less effective
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92535 --- Comment #4 from Jan Hubicka --- Forgot bloaty report for GCC9 and disabling ICF $ bloaty libxul.so -- libxul.so.old VM SIZE FILE SIZE ++ GROWING++ +2.3% +1.87Mi .text +1.87Mi +2.3% +5.4% +423Ki .eh_frame +423Ki +5.4% +7.1% +122Ki .eh_frame_hdr +122Ki +7.1% +0.6% +61.3Ki .rela.dyn +61.3Ki +0.6% +0.2% +29.8Ki .rodata+29.8Ki +0.2% +0.4% +14.1Ki .data.rel.ro.local +14.1Ki +0.4% +1.5% +12.0Ki .data.rel.ro +12.0Ki +1.5% +0.1%+224 .data +224 +0.1% -- SHRINKING -- [ = ] 0 .strtab -291Ki -0.7% [ = ] 0 .symtab-46.6Ki -0.3% -69.6%-240 [Unmapped] -3.12Ki -73.1% -0.0%-120 .bss 0 [ = ] +1.8% +2.51Mi TOTAL +2.18Mi +1.1%
[Bug ipa/92508] [10 Regression] ICE in do_estimate_edge_time, at ipa-inline-analysis.c:223 since r278159
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92508 --- Comment #15 from Jan Hubicka --- Author: hubicka Date: Mon Nov 18 19:28:53 2019 New Revision: 278419 URL: https://gcc.gnu.org/viewcvs?rev=278419&root=gcc&view=rev Log: PR ipa/92508 * ipa-inline.c (inline_small_functions): Add new edges after reseting caches. * ipa-inline-analysis.c (do_estimate_edge_time): Fix sanity check. Modified: trunk/gcc/ChangeLog trunk/gcc/ipa-inline-analysis.c trunk/gcc/ipa-inline.c
[Bug ipa/92508] [10 Regression] ICE in do_estimate_edge_time, at ipa-inline-analysis.c:223 since r278159
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92508 Jan Hubicka changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #16 from Jan Hubicka --- Fixed all three problems.
[Bug ipa/92476] [10 regression] SEGV in cgraph_edge_brings_value_p
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92476 Jan Hubicka changed: What|Removed |Added Assignee|hubicka at gcc dot gnu.org |mjambor at suse dot cz --- Comment #3 from Jan Hubicka --- Martin, this problem is caused by ipa-cp deciding to clone function which has thunk associated to it. create_virtual_clone then copies thunk (which is your code) and and expands all thunks. This turn thunk into real function and because ipa-cp does not produce summaries for thunks we now ICE because summary is missing. I tried the following to compute the missing summary: Index: cgraphclones.c === --- cgraphclones.c (revision 278390) +++ cgraphclones.c (working copy) @@ -80,6 +80,11 @@ along with GCC; see the file COPYING3. #include "tree-inline.h" #include "dumpfile.h" #include "gimple-pretty-print.h" +#include "alloc-pool.h" +#include "symbol-summary.h" +#include "tree-vrp.h" +#include "ipa-prop.h" +#include "ipa-fnsummary.h" /* Create clone of edge in the node N represented by CALL_EXPR the callgraph. */ @@ -268,6 +273,8 @@ cgraph_node::expand_all_artificial_thunk thunk->thunk.thunk_p = false; thunk->analyze (); } + ipa_analyze_node (thunk); + inline_analyze_function (thunk); thunk->expand_all_artificial_thunks (); } else but that moves the ICE later: hubicka@lomikamen-jh:/aux/hubicka/trunk5/build-lto/gcc$ ./xgcc -B ./ -O2 a.C -m32 during IPA pass: cp a.C:40:1: internal compiler error: in ipa_get_parm_lattices, at ipa-cp.c:388 40 | } | ^ 0x234db7d ipa_get_parm_lattices ../../gcc/ipa-cp.c:388 0x23595b1 ipcp_store_bits_results ../../gcc/ipa-cp.c:5417 0x2359c7a ipcp_driver ../../gcc/ipa-cp.c:5558 0x2359e58 execute ../../gcc/ipa-cp.c:5647 Please submit a full bug report, which is caued by fact that we have no lattices for that function (since it was not considered by the propagator). I also wonder why these seems to show with 32bit only. Honza
[Bug c++/55135] Segfault of gcc on a big file
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55135 --- Comment #30 from Jan Hubicka --- Reconfirmed that we still take ages to build the testcase (early inliner is still running for me) The early inliner issue here is caused by tree-inline removing individual clones one by one. Each time a clone is removed a new clone becomes a root of the clone tree and it takes long time to update all pointers.
[Bug ipa/44563] GCC uses a lot of RAM when compiling a large numbers of functions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=44563 Jan Hubicka changed: What|Removed |Added Status|ASSIGNED|NEW Assignee|hubicka at gcc dot gnu.org |unassigned at gcc dot gnu.org --- Comment #38 from Jan Hubicka --- it is GCC10 but I finally managed to implement the incremental update here. Memory use is about 1.1GB but inliner finishes quite quickly: Time variable usr sys wall GGC phase setup: 0.00 ( 0%) 0.00 ( 0%) 0.00 ( 0%) 1237 kB ( 0%) phase parsing : 1.29 ( 2%) 1.24 ( 6%) 2.54 ( 3%) 247897 kB ( 6%) phase lang. deferred : 0.01 ( 0%) 0.00 ( 0%) 0.01 ( 0%) 0 kB ( 0%) phase opt and generate : 56.81 ( 98%) 19.35 ( 94%) 76.27 ( 97%) 3859026 kB ( 94%) garbage collection : 0.84 ( 1%) 0.10 ( 0%) 0.93 ( 1%) 0 kB ( 0%) dump files : 3.28 ( 6%) 1.85 ( 9%) 5.30 ( 7%) 0 kB ( 0%) callgraph construction : 0.70 ( 1%) 0.28 ( 1%) 1.07 ( 1%) 99328 kB ( 2%) callgraph optimization : 1.38 ( 2%) 0.74 ( 4%) 2.03 ( 3%) 1026 kB ( 0%) callgraph functions expansion : 47.27 ( 81%) 15.51 ( 75%) 62.89 ( 80%) 2827825 kB ( 69%) callgraph ipa passes : 8.19 ( 14%) 3.26 ( 16%) 11.45 ( 15%) 709147 kB ( 17%) ipa function summary : 0.34 ( 1%) 0.08 ( 0%) 0.43 ( 1%) 97794 kB ( 2%) ipa dead code removal : 0.25 ( 0%) 0.01 ( 0%) 0.27 ( 0%) 0 kB ( 0%) ipa inheritance graph : 0.01 ( 0%) 0.00 ( 0%) 0.02 ( 0%) 0 kB ( 0%) ipa devirtualization : 0.04 ( 0%) 0.00 ( 0%) 0.04 ( 0%) 0 kB ( 0%) ipa cp : 0.23 ( 0%) 0.02 ( 0%) 0.27 ( 0%) 7169 kB ( 0%) ipa inlining heuristics: 0.19 ( 0%) 0.00 ( 0%) 0.22 ( 0%) 0 kB ( 0%) ipa function splitting : 0.02 ( 0%) 0.01 ( 0%) 0.06 ( 0%) 0 kB ( 0%) ipa comdats: 0.05 ( 0%) 0.00 ( 0%) 0.05 ( 0%) 0 kB ( 0%) ipa various optimizations : 0.06 ( 0%) 0.00 ( 0%) 0.06 ( 0%) 0 kB ( 0%) ipa reference : 0.10 ( 0%) 0.00 ( 0%) 0.11 ( 0%) 0 kB ( 0%) ipa profile: 0.07 ( 0%) 0.00 ( 0%) 0.06 ( 0%) 0 kB ( 0%) ipa pure const : 0.45 ( 1%) 0.15 ( 1%) 0.47 ( 1%) 0 kB ( 0%) ipa icf: 0.22 ( 0%) 0.01 ( 0%) 0.23 ( 0%) 0 kB ( 0%) ipa SRA: 0.13 ( 0%) 0.00 ( 0%) 0.14 ( 0%) 5120 kB ( 0%) ipa free lang data : 0.04 ( 0%) 0.00 ( 0%) 0.04 ( 0%) 0 kB ( 0%) ipa free inline summary: 0.08 ( 0%) 0.00 ( 0%) 0.07 ( 0%) 0 kB ( 0%) cfg construction : 0.07 ( 0%) 0.01 ( 0%) 0.19 ( 0%) 0 kB ( 0%) cfg cleanup: 0.73 ( 1%) 0.23 ( 1%) 0.95 ( 1%) 0 kB ( 0%) trivially dead code: 0.30 ( 1%) 0.06 ( 0%) 0.30 ( 0%) 0 kB ( 0%) df scan insns : 0.81 ( 1%) 0.21 ( 1%) 0.93 ( 1%) 3072 kB ( 0%) df multiple defs : 0.28 ( 0%) 0.06 ( 0%) 0.41 ( 1%) 0 kB ( 0%) df reaching defs : 1.48 ( 3%) 0.20 ( 1%) 1.63 ( 2%) 0 kB ( 0%) df live regs : 1.12 ( 2%) 0.26 ( 1%) 1.33 ( 2%) 0 kB ( 0%) df live&initialized regs : 0.51 ( 1%) 0.19 ( 1%) 0.66 ( 1%) 0 kB ( 0%) df must-initialized regs : 0.11 ( 0%) 0.06 ( 0%) 0.14 ( 0%) 0 kB ( 0%) df use-def / def-use chains: 0.36 ( 1%) 0.04 ( 0%) 0.43 ( 1%) 0 kB ( 0%) df reg dead/unused notes : 1.69 ( 3%) 0.20 ( 1%) 1.81 ( 2%) 12288 kB ( 0%) register information : 0.38 ( 1%) 0.04 ( 0%) 0.39 ( 0%) 0 kB ( 0%) alias analysis : 0.82 ( 1%) 0.17 ( 1%) 1.15 ( 1%) 36865 kB ( 1%) alias stmt walking : 0.06 ( 0%) 0.04 ( 0%) 0.07 ( 0%) 0 kB ( 0%) register scan : 0.07 ( 0%) 0.03 ( 0%) 0.11 ( 0%) 0 kB ( 0%) rebuild jump labels: 0.16 ( 0%) 0.06 ( 0%) 0.14 ( 0%) 0 kB ( 0%) preprocessing : 0.39 ( 1%) 0.32 ( 2%) 0.49 ( 1%) 44508 kB ( 1%) lexical analysis : 0.32 ( 1%) 0.39 ( 2%) 0.73 ( 1%) 0 kB ( 0%) parser (global): 0.11 ( 0%)
[Bug ipa/60243] IPA is slow on large cgraph tree
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60243 --- Comment #27 from Jan Hubicka --- profile_estimate issue is still here, inliner and early inliner issues seems solved. Seems that ipa_profile just orders the nodes for propagation in wrong way - we propagate from callers to callees while toposorter is for propagation opoposite way. operand_scan seems slow too. Time variable usr sys wall GGC phase setup: 0.00 ( 0%) 0.00 ( 0%) 0.00 ( 0%) 1237 kB ( 0%) phase parsing : 6.63 ( 9%) 6.77 ( 77%) 13.41 ( 17%) 655497 kB ( 20%) phase opt and generate : 64.47 ( 91%) 2.07 ( 23%) 66.57 ( 83%) 2603397 kB ( 80%) garbage collection : 0.64 ( 1%) 0.00 ( 0%) 0.65 ( 1%) 0 kB ( 0%) dump files : 0.05 ( 0%) 0.01 ( 0%) 0.04 ( 0%) 0 kB ( 0%) callgraph construction : 0.91 ( 1%) 0.01 ( 0%) 0.83 ( 1%) 399235 kB ( 12%) callgraph optimization : 0.37 ( 1%) 0.00 ( 0%) 0.43 ( 1%) 0 kB ( 0%) callgraph functions expansion : 15.98 ( 22%) 1.20 ( 14%) 17.18 ( 21%) 297309 kB ( 9%) callgraph ipa passes : 40.57 ( 57%) 0.40 ( 5%) 40.99 ( 51%) 617751 kB ( 19%) ipa function summary : 0.14 ( 0%) 0.00 ( 0%) 0.14 ( 0%) 1807 kB ( 0%) ipa dead code removal : 0.22 ( 0%) 0.00 ( 0%) 0.24 ( 0%) 0 kB ( 0%) ipa cp : 0.97 ( 1%) 0.03 ( 0%) 1.03 ( 1%) 327514 kB ( 10%) ipa inlining heuristics: 0.72 ( 1%) 0.00 ( 0%) 0.63 ( 1%) 84183 kB ( 3%) ipa function splitting : 0.02 ( 0%) 0.00 ( 0%) 0.05 ( 0%) 0 kB ( 0%) ipa various optimizations : 0.69 ( 1%) 0.20 ( 2%) 0.89 ( 1%) 128398 kB ( 4%) ipa reference : 0.05 ( 0%) 0.00 ( 0%) 0.05 ( 0%) 0 kB ( 0%) ipa profile: 18.24 ( 26%) 0.00 ( 0%) 18.25 ( 23%) 0 kB ( 0%) ipa pure const : 0.45 ( 1%) 0.00 ( 0%) 0.46 ( 1%) 0 kB ( 0%) ipa icf: 0.17 ( 0%) 0.02 ( 0%) 0.17 ( 0%) 0 kB ( 0%) ipa SRA: 0.21 ( 0%) 0.00 ( 0%) 0.21 ( 0%) 102 kB ( 0%) ipa free inline summary: 0.03 ( 0%) 0.00 ( 0%) 0.04 ( 0%) 0 kB ( 0%) cfg cleanup: 0.00 ( 0%) 0.01 ( 0%) 0.02 ( 0%) 0 kB ( 0%) trivially dead code: 0.12 ( 0%) 0.03 ( 0%) 0.12 ( 0%) 0 kB ( 0%) df scan insns : 0.85 ( 1%) 0.14 ( 2%) 1.28 ( 2%) 46 kB ( 0%) df multiple defs : 0.30 ( 0%) 0.06 ( 1%) 0.31 ( 0%) 0 kB ( 0%) df reaching defs : 0.69 ( 1%) 0.05 ( 1%) 0.63 ( 1%) 0 kB ( 0%) df live regs : 0.49 ( 1%) 0.02 ( 0%) 0.57 ( 1%) 0 kB ( 0%) df live&initialized regs : 0.19 ( 0%) 0.01 ( 0%) 0.12 ( 0%) 0 kB ( 0%) df must-initialized regs : 0.10 ( 0%) 0.00 ( 0%) 0.10 ( 0%) 0 kB ( 0%) df use-def / def-use chains: 0.44 ( 1%) 0.05 ( 1%) 0.40 ( 1%) 0 kB ( 0%) df reg dead/unused notes : 1.35 ( 2%) 0.09 ( 1%) 1.15 ( 1%) 747 kB ( 0%) register information : 0.16 ( 0%) 0.00 ( 0%) 0.18 ( 0%) 0 kB ( 0%) alias analysis : 0.16 ( 0%) 0.00 ( 0%) 0.11 ( 0%) 436 kB ( 0%) alias stmt walking : 0.49 ( 1%) 0.07 ( 1%) 0.67 ( 1%) 0 kB ( 0%) register scan : 0.04 ( 0%) 0.00 ( 0%) 0.01 ( 0%) 0 kB ( 0%) rebuild jump labels: 0.00 ( 0%) 0.00 ( 0%) 0.01 ( 0%) 0 kB ( 0%) preprocessing : 2.37 ( 3%) 2.37 ( 27%) 4.49 ( 6%) 383477 kB ( 12%) lexical analysis : 1.88 ( 3%) 2.13 ( 24%) 4.20 ( 5%) 0 kB ( 0%) parser (global): 0.01 ( 0%) 0.01 ( 0%) 0.03 ( 0%) 1442 kB ( 0%) parser function body : 2.19 ( 3%) 2.26 ( 26%) 4.50 ( 6%) 270577 kB ( 8%) early inlining heuristics : 2.80 ( 4%) 0.03 ( 0%) 2.81 ( 4%) 3076 kB ( 0%) inline parameters : 6.43 ( 9%) 0.14 ( 2%) 6.74 ( 8%) 31127 kB ( 1%) integration: 0.17 ( 0%) 0.00 ( 0%) 0.08 ( 0%) 6789 kB ( 0%) tree gimplify : 1.01 ( 1%) 0.03 ( 0%) 1.15 ( 1%) 610970 kB ( 19%) tree eh: 0.50 ( 1%) 0.03 ( 0%) 0.44 ( 1%) 0 kB ( 0%) tree CFG construction : 3.50 ( 5%) 0.02 ( 0%) 3.74 ( 5%) 628087 kB ( 19%) tree CFG cleanup
[Bug tree-optimization/92632] New: Calculix regression
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92632 Bug ID: 92632 Summary: Calculix regression Product: gcc Version: 10.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: hubicka at gcc dot gnu.org Target Milestone: --- LNT testing show 137% regression of calculix with LTO and PGO https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=288.170.0 The range is between Revision: fbbadf0018292a93 (2019-11-15 03:28) and Revision: 1e9cd853b7ecae82 (2019-11-18 02:22) The diff from this range is: +2019-11-18 Hongtao Liu + + PR target/92448 + * config/i386/i386-expand.c (ix86_expand_set_or_cpymem): + Replace TARGET_AVX128_OPTIMAL with TARGET_AVX256_SPLIT_REGS. + * config/i386/i386-option.c (ix86_vec_cost): Ditto. + (ix86_reassociation_width): Ditto. + * config/i386/i386-options.c (ix86_option_override_internal): + Replace TARGET_AVX128_OPTIAML with + ix86_tune_features[X86_TUNE_AVX128_OPTIMAL] + * config/i386/i386.h (TARGET_AVX256_SPLIT_REGS): New macro. + (TARGET_AVX128_OPTIMAL): Deleted. + * config/i386/x86-tune.def (X86_TUNE_AVX256_SPLIT_REGS): New + DEF_TUNE. + +2019-11-16 Segher Boessenkool + + * config/rs6000/rs6000.md (cceq_ior_compare): Rename to... + (@cceq_ior_compare_ for GPR): ... this. Allow GPR instead of + just SI. + (cceq_rev_compare): Rename to... + (@cceq_rev_compare_ for GPR): ... this. Allow GPR instead of + just SI. + (define_split for tf_): Add SImode first argument to + gen_cceq_ior_compare. + +2019-11-16 Segher Boessenkool + + * common/config/powerpcspe: Delete. + +2019-11-16 Richard Sandiford + + * config/aarch64/aarch64-sve.md (aarch64_wrffr): Wrap the FFRT + output in UNSPEC_WRFFR. + +2019-11-16 Richard Sandiford + + * tree-data-ref.c (create_intersect_range_checks_index): Rewrite + the index tests to have the form (unsigned T) (B - A + bias) <= limit. + +2019-11-16 Richard Sandiford + + * tree-data-ref.c (create_intersect_range_checks_index) + (create_intersect_range_checks): Print dump messages. + +2019-11-16 Richard Sandiford + + * tree-data-ref.c (dump_alias_pair): New function. + (prune_runtime_alias_test_list): Use it to dump each merged alias pair. + +2019-11-16 Richard Sandiford + + * tree-data-ref.h (DR_ALIAS_MIXED_STEPS): New flag. + * tree-data-ref.c (prune_runtime_alias_test_list): Set it when + merging data references with different steps. + (create_intersect_range_checks_index): Take a + dr_with_seg_len_pair_t instead of two dr_with_seg_lens. + Bail out if DR_ALIAS_MIXED_STEPS is set. + (create_intersect_range_checks): Take a dr_with_seg_len_pair_t + instead of two dr_with_seg_lens. Update call to + create_intersect_range_checks_index. + (create_runtime_alias_checks): Update call accordingly. + +2019-11-16 Richard Sandiford + + * tree-data-ref.h (DR_ALIAS_RAW, DR_ALIAS_WAR, DR_ALIAS_WAW) + (DR_ALIAS_ARBITRARY, DR_ALIAS_SWAPPED, DR_ALIAS_UNSWAPPED): New flags. + (dr_with_seg_len_pair_t::sequencing): New enum. + (dr_with_seg_len_pair_t::flags): New member variable. + (dr_with_seg_len_pair_t::dr_with_seg_len_pair_t): Take a sequencing + parameter and initialize the flags member variable. + * tree-loop-distribution.c (compute_alias_check_pairs): Update + call accordingly. + * tree-vect-data-refs.c (vect_prune_runtime_alias_test_list): Likewise. + Ensure the two data references in an alias pair are in statement + order, if there is a defined order. + * tree-data-ref.c (prune_runtime_alias_test_list): Use + DR_ALIAS_SWAPPED and DR_ALIAS_UNSWAPPED to record whether we've + swapped the references in a dr_with_seg_len_pair_t. OR together + the flags when merging two dr_with_seg_len_pair_ts. After merging, + try to restore the original dr_with_seg_len order, updating the + flags if that fails. + +2019-11-16 Richard Sandiford + + * tree-data-ref.c (prune_runtime_alias_test_list): Delay + swapping the dr_as based on init values until we've decided + whether to merge them. + +2019-11-16 Richard Sandiford + + * tree-data-ref.c (prune_runtime_alias_test_list): Sort the + two accesses in each dr_with_seg_len_pair_t before trying to + combine separate dr_with_seg_len_pair_ts. + * tree-loop-distribution.c (compute_alias_check_pairs): Don't do + that here. + * tree-vect-data-refs.c (vect_prune_runtime_alias_test_list): Likewise. + +2019-11-16 Richard Sandiford + + * config/aarch64/aarch64-sve.md + (scatter_store): Extend to... + (scatt
[Bug tree-optimization/92645] New: Hand written vector code is 450 times slower when compiled with GCC compared to Clang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92645 Bug ID: 92645 Summary: Hand written vector code is 450 times slower when compiled with GCC compared to Clang Product: gcc Version: 10.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: hubicka at gcc dot gnu.org Target Milestone: --- Hi, the attached are preprocessed files for Skia where Clang ifdefs was removed so we get roughly same file for GCC and Clang. The internal loop of _ZN3hsw16blit_row_color32EPjPKjij, _ZN3hsw16blit_row_color32EPjPKjij, _ZN3hsw16blit_row_color32EPjPKjij and _ZN3hsw16blit_row_color32EPjPKjij looks a lot worse when compiled by GCC then by clang. I also added flatten to eliminate the inlining difference. Clang has heuristics that makes functions with hand written vector code hot. GCC code packs via stack: 0.43 â mov %ax,0xae(%rsp) 0.03 â movzbl 0x78(%rsp),%eax 0.02 â mov %cx,0xd8(%rsp) 0.02 â mov %ax,0xb0(%rsp) 0.54 â vpextrb $0x9,%xmm5,%eax 0.16 â mov %ax,0xb2(%rsp) 0.51 â vpextrb $0xa,%xmm5,%eax 0.21 â mov %ax,0xb4(%rsp) 0.16 â vpextrb $0xb,%xmm5,%eax 0.46 â mov %ax,0xb6(%rsp) 0.24 â vpextrb $0xc,%xmm5,%eax 0.28 â mov %ax,0xb8(%rsp) 0.41 â vpextrb $0xd,%xmm5,%eax 0.20 â mov %ax,0xba(%rsp) 0.47 â vpextrb $0xe,%xmm5,%eax 0.92 â mov %ax,0xbc(%rsp) 0.72 â vpextrb $0xf,%xmm5,%eax 1.24 â mov %ax,0xbe(%rsp) 10.94 â vmovdqa 0xa0(%rsp),%ymm4 0.02 â mov %cx,0xda(%rsp) 0.00 â mov %cx,0xdc(%rsp) â mov %cx,0xde(%rsp) 10.34 â vpmullw 0xc0(%rsp),%ymm4,%ymm0 2.05 â vpaddw %ymm1,%ymm0,%ymm0 0.50 â vpaddw %ymm3,%ymm0,%ymm0 0.00 â mov %r9,0x58(%rsp) 0.52 â vpsrlw $0x8,%ymm0,%ymm0 0.39 â vpextrw $0x0,%xmm0,%eax 0.69 â mov %al,%r8b 0.17 â vpextrw $0x1,%xmm0,%eax 0.51 â mov %r8,0x50(%rsp) 6.87 â vmovdqa 0x50(%rsp),%xmm5 1.08 â vpinsrb $0x1,%eax,%xmm5,%xmm1 0.00 â vpextrw $0x2,%xmm0,%eax 0.73 â vpinsrb $0x2,%eax,%xmm1,%xmm1 0.02 â vpextrw $0x3,%xmm0,%eax 0.75 â vpinsrb $0x3,%eax,%xmm1,%xmm1 0.10 â vpextrw $0x4,%xmm0,%eax 0.98 â vpinsrb $0x4,%eax,%xmm1,%xmm1 0.16 â vpextrw $0x5,%xmm0,%eax 1.00 â vpinsrb $0x5,%eax,%xmm1,%xmm1 0.22 â vpextrw $0x6,%xmm0,%eax 1.10 â vpinsrb $0x6,%eax,%xmm1,%xmm1 0.30 â vpextrw $0x7,%xmm0,%eax 0.31 â vextracti128 $0x1,%ymm0,%xmm0 0.90 â vpinsrb $0x7,%eax,%xmm1,%xmm6 0.21 â vpextrw $0x0,%xmm0,%eax 0.35 â vmovaps %xmm6,0x50(%rsp) 1.15 â mov 0x58(%rsp),%r9 0.13 â mov 0x50(%rsp),%r8 0.29 â mov %al,%r9b 0.49 â mov %r8,0x50(%rsp) 0.07 â vpextrw $0x1,%xmm0,%eax 0.45 â mov %r9,0x58(%rsp) 7.08 â vmovdqa 0x50(%rsp),%xmm7 1.19 â vpinsrb $0x9,%eax,%xmm7,%xmm1 0.00 â vpextrw $0x2,%xmm0,%eax 0.78 â vpinsrb $0xa,%eax,%xmm1,%xmm1 0.00 â vpextrw $0x3,%xmm0,%eax 0.77 â vpinsrb $0xb,%eax,%xmm1,%xmm1 0.01 â vpextrw $0x4,%xmm0,%eax 0.86 â vpinsrb $0xc,%eax,%xmm1,%xmm1 0.03 â vpextrw $0x5,%xmm0,%eax 0.88 â vpinsrb $0xd,%eax,%xmm1,%xmm1 0.04 â vpextrw $0x6,%xmm0,%eax 0.97 â vpinsrb $0xe,%eax,%xmm1,%xmm1 0.08 â vpextrw $0x7,%xmm0,%eax 1.44 â vpinsrb $0xf,%eax,%xmm1,%xmm0 1.37 â vpextrd $0x1,%xmm0,%eax 0.13 â vinsertps$0xe,%xmm0,%xmm0,%xmm1 0.02 â vmovaps %xmm0,0x50(%rsp) 2.17 â vpinsrd $0x1,%eax,%xmm1,%xmm1 Clang code: Percentâ vpmullw %ymm0,%ymm2,%ymm2 â vpaddw %ymm1,%ymm2,%ymm2 â vpsrlw $0x8,%ymm2,%ymm2 â vextracti128 $0x1,%ymm2,%xmm3 â vpackuswb%xmm3,%xmm2,%xmm2 â vmovdqu %xmm2,(%rdi) â add $0x10,%rsi â add $0x10,%rdi â mov %r9d,%eax â cmp $0x4,%r9d â â jae 39179b0 â â jmp 3917a02 â mov %edx,%eax 0.29 â cmp $0x4,%r9d 0.00 â â jb 3917a02 0.07 â nop 3.95 â vpmovzxbw(%rsi),%ymm2 13.41 â vpmullw %ymm0,%ymm2,%ymm2 13.87
[Bug tree-optimization/92645] Hand written vector code is 450 times slower when compiled with GCC compared to Clang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92645 --- Comment #1 from Jan Hubicka --- Created attachment 47340 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=47340&action=edit Clang source
[Bug tree-optimization/92645] Hand written vector code is 450 times slower when compiled with GCC compared to Clang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92645 --- Comment #2 from Jan Hubicka --- Created attachment 47341 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=47341&action=edit clang output with -O2 -mavx2 -mf16c -mfma
[Bug tree-optimization/92645] Hand written vector code is 450 times slower when compiled with GCC compared to Clang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92645 --- Comment #3 from Jan Hubicka --- Created attachment 47342 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=47342&action=edit GCC source
[Bug tree-optimization/92645] Hand written vector code is 450 times slower when compiled with GCC compared to Clang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92645 --- Comment #4 from Jan Hubicka --- Created attachment 47343 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=47343&action=edit GCC 10 output
[Bug bootstrap/92680] New: PGO bootstrap is broken with --with-build-config=bootstrap-lto-lean and in-itree mpfr
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92680 Bug ID: 92680 Summary: PGO bootstrap is broken with --with-build-config=bootstrap-lto-lean and in-itree mpfr Product: gcc Version: 10.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: bootstrap Assignee: unassigned at gcc dot gnu.org Reporter: hubicka at gcc dot gnu.org Target Milestone: --- Build with bootstrap-lto-lean with in-tree mpfr fails in profile mismatch on set_d.o. This is caused by fact that mpfr actually misconfigures itself with LTO. Its configure script scans assembly to detect format of long double and this gives wrong answer with LTO leading to suboptimal configuration.
[Bug other/92681] New: PGO bootstrap is broken with --with-build-config=bootstrap-lto-lean is not training non-C++ frontends
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92681 Bug ID: 92681 Summary: PGO bootstrap is broken with --with-build-config=bootstrap-lto-lean is not training non-C++ frontends Product: gcc Version: 10.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: other Assignee: unassigned at gcc dot gnu.org Reporter: hubicka at gcc dot gnu.org Target Milestone: --- This definitly leads to suboptimal compile time experience with Ada, Fortran, go, etc.
[Bug tree-optimization/92711] New: GCC 10 libxul.so -fprofile-generate binary is 360MB while clang needs only 163MB.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92711 Bug ID: 92711 Summary: GCC 10 libxul.so -fprofile-generate binary is 360MB while clang needs only 163MB. Product: gcc Version: 10.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: hubicka at gcc dot gnu.org Target Milestone: --- It seems that profiling became more expensive in GCC10 compared to clang or previous GCC releases. Clang binary is here https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/H_iSouCVTha9mEw9y5XO5Q/runs/0/artifacts/public/build/target.tar.bz2 more or less comparable GCC build is here https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/NOUqVShcSMaJn5j3g5nEYg/runs/0/artifacts/public/build/target.tar.bz2 It also seems that profile streaming is slower in GCC build (which is important since Firefox forks multiple times on startup and then when creating new tab and that triggers profile data streamout).
[Bug tree-optimization/92711] GCC 10 libxul.so -fprofile-generate binary is 360MB while clang needs only 163MB.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92711 --- Comment #1 from Jan Hubicka --- https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/ObkoHsHHSriQdU0Twc12Wg/runs/0/artifacts/public/build/target.tar.bz2 This is GCC9 build. 310MB, so still a lot bigger than clang, but better than gcc10.
[Bug tree-optimization/92711] GCC 10 libxul.so -fprofile-generate binary is 360MB while clang needs only 163MB.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92711 Jan Hubicka changed: What|Removed |Added CC||mliska at suse dot cz Blocks||45375 --- Comment #2 from Jan Hubicka --- Actually what I thought is GCC9 build is actually GCC10 build. Seems that today profile fixes made the binary noticeably smaller which seems promising. But it is still very large. Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=45375 [Bug 45375] [meta-bug] Issues with building Mozilla (i.e. Firefox) with LTO
[Bug tree-optimization/92711] GCC 10 libxul.so -fprofile-generate binary is 360MB while clang needs only 163MB.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92711 --- Comment #3 from Jan Hubicka --- Proper GCC 9 -fprofile-generate build is 296MB https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/aMGsffWPQ1qzjgj4LIqcwQ/runs/0/artifacts/public/build/target.tar.bz2 So about 5% regression compared to gcc9
[Bug ipa/92737] New: cgraph_node and varpool_node needs explicit constructor
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92737 Bug ID: 92737 Summary: cgraph_node and varpool_node needs explicit constructor Product: gcc Version: 10.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: ipa Assignee: unassigned at gcc dot gnu.org Reporter: hubicka at gcc dot gnu.org CC: marxin at gcc dot gnu.org Target Milestone: --- cgraph_node and varpool_node are non-pods, but still allocated via alloc_cleared and we rely on various flags to be set to 0.
[Bug tree-optimization/92738] New: [10 regression] Large code size growth for -O2 binaries between 2019-05-19...2019-05-29
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92738 Bug ID: 92738 Summary: [10 regression] Large code size growth for -O2 binaries between 2019-05-19...2019-05-29 Product: gcc Version: 10.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: hubicka at gcc dot gnu.org Target Milestone: ---
[Bug tree-optimization/92738] [10 regression] Large code size growth for -O2 binaries between 2019-05-19...2019-05-29
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92738 --- Comment #1 from Jan Hubicka --- This is seen on https://lnt.opensuse.org/db_default/v4/SPEC/graph?highlight_run=7361&plot.574=31.574.4
[Bug tree-optimization/92738] [10 regression] Large code size growth for -O2 binaries between 2019-05-19...2019-05-29
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92738 --- Comment #2 from Jan Hubicka --- https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=10.542.4&highlight_run=7354 shows shorter range +2019-05-24 Jakub Jelinek + + * tree-core.h (enum omp_clause_code): Add OMP_CLAUSE__CONDTEMP_. + * tree.h (OMP_CLAUSE_DECL): Use OMP_CLAUSE__CONDTEMP_ instead of + OMP_CLAUSE__REDUCTEMP_. + * tree.c (omp_clause_num_ops, omp_clause_code_name): Add + OMP_CLAUSE__CONDTEMP_. +2019-05-19 Segher Boessenkool + + * config/rs6000/constraints.md (define_register_constraint "wo"): + Delete. + * config/rs6000/rs6000.h (enum r6000_reg_class_enum): Delete + RS6000_CONSTRAINT_wo. + * config/rs6000/rs6000.c (rs6000_debug_reg_global): Adjust. + (rs6000_init_hard_regno_mode_ok): Adjust. + * config/rs6000/rs6000.md: Replace "wo" constraint by "wa" with "p9v". + * config/rs6000/altivec.md: Ditto. + * doc/md.texi (Machine Constraints): Adjust. + 2019-05-18 Iain Sandoe It may be easy to bisect.
[Bug tree-optimization/92740] New: induct2 (from polyhedron) regresses 267% with -O2 -ftree-vectorize -ftree-slp-vectorize -fvect-cost-modes=dynamic compared to -O2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92740 Bug ID: 92740 Summary: induct2 (from polyhedron) regresses 267% with -O2 -ftree-vectorize -ftree-slp-vectorize -fvect-cost-modes=dynamic compared to -O2 Product: gcc Version: 10.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: hubicka at gcc dot gnu.org Target Milestone: --- This is on zen2 hardware.
[Bug tree-optimization/92740] induct2 (from polyhedron) regresses 267% with -O2 -ftree-vectorize -ftree-slp-vectorize -fvect-cost-modes=dynamic compared to -O2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92740 --- Comment #1 from Jan Hubicka --- There is also 75% regression on fft2 and 5% on rnflow2. Induct2 reproduces on kaby lake, fft2 and rnflow seems zen specific.
[Bug tree-optimization/92825] New: Unnecesary stack protection and missed SLP vectorization in Firefox's LightPixel.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92825 Bug ID: 92825 Summary: Unnecesary stack protection and missed SLP vectorization in Firefox's LightPixel. Product: gcc Version: 10.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: hubicka at gcc dot gnu.org Target Milestone: --- Created attachment 47428 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=47428&action=edit full testcase uint32_t DiffuseLightingSoftware::LightPixel(const Point3D& aNormal, const Point3D& aVectorToLight, uint32_t aColor) { Float dotNL = std::max(0.0f, aNormal.DotProduct(aVectorToLight)); Float diffuseNL = mDiffuseConstant * dotNL; union { uint32_t bgra; uint8_t components[4]; } color = {aColor}; color.components[B8G8R8A8_COMPONENT_BYTEOFFSET_B] = umin( uint32_t(diffuseNL * color.components[B8G8R8A8_COMPONENT_BYTEOFFSET_B]), 255U); color.components[B8G8R8A8_COMPONENT_BYTEOFFSET_G] = umin( uint32_t(diffuseNL * color.components[B8G8R8A8_COMPONENT_BYTEOFFSET_G]), 255U); color.components[B8G8R8A8_COMPONENT_BYTEOFFSET_R] = umin( uint32_t(diffuseNL * color.components[B8G8R8A8_COMPONENT_BYTEOFFSET_R]), 255U); color.components[B8G8R8A8_COMPONENT_BYTEOFFSET_A] = 255; return color.bgra; } (full testcase attached) Built with -O3 -fstack-protection-strong results in slower code with gcc10 than with gcc9 or clang. GCC produces: │ 04390e20 const&, │ _ZN7mozilla3gfx12_GLOBAL__N_124SpecularLightingSoftware10LightPixelERKNS0_12Point3DTypedINS0_12UnknownUnitsEfEES7_j(): 0.19 │ push %rbp 0.60 │ pxor %xmm5,%xmm5 0.05 │ mov %rsp,%rbp 0.12 │ push %rbx 0.65 │ sub $0x18,%rsp 0.33 │ movss 0x4(%rdx),%xmm0 0.10 │ movss (%rdx),%xmm1 0.58 │ mov %fs:0x28,%rax 0.03 │ mov %rax,-0x18(%rbp) 0.22 │ xor %eax,%eax 0.07 │ movss pw_32+0x1588,%xmm3 1.58 │ addss 0x8(%rdx),%xmm3 0.67 │ addss %xmm5,%xmm0 0.23 │ addss %xmm5,%xmm1 │ movaps%xmm0,%xmm2 0.41 │ movaps%xmm1,%xmm4 0.87 │ mulss %xmm0,%xmm2 0.28 │ mulss %xmm1,%xmm4 3.71 │ addss %xmm2,%xmm4 0.14 │ movaps%xmm3,%xmm2 0.04 │ mulss %xmm3,%xmm2 1.99 │ addss %xmm2,%xmm4 0.15 │ movss 0x4(%rsi),%xmm2 9.39 │ sqrtss%xmm4,%xmm4 8.90 │ divss %xmm4,%xmm0 2.10 │ divss %xmm4,%xmm3 1.08 │ mulss %xmm0,%xmm2 0.01 │ movss 0x8(%rsi),%xmm0 while clang Percent│ _ZN7mozilla3gfx12_GLOBAL__N_124SpecularLightingSoftware10LightPixelERKNS0_12Point3DTypedINS0_12UnknownUnitsEfEES7_j(): 0.11 │ xorps %xmm0,%xmm0 0.83 │ movss 0x4(%rdx),%xmm1 3.29 │ addss %xmm0,%xmm1 0.03 │ movss (%rdx),%xmm2 0.08 │ movss 0x8(%rdx),%xmm3 0.04 │ unpcklps %xmm2,%xmm3 0.59 │ movss mozilla::gfx::ConvertComponentTransferFunctionToFilter(mozilla::gfx::ComponentTransferAttributes const&, int, int, mozilla::gfx::DrawTarget*, RefPtr
[Bug tree-optimization/92834] New: misssed SLP vectorization in LightPixel
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92834 Bug ID: 92834 Summary: misssed SLP vectorization in LightPixel Product: gcc Version: 10.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: hubicka at gcc dot gnu.org Target Milestone: --- Created attachment 47431 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=47431&action=edit simplified testcase Clang is able to vectorize LightPixel which leads to about 10% improvements in rasterflood-svg Firefox benchmark.
[Bug tree-optimization/92825] Unnecesary stack protection in Firefox's LightPixel.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92825 Jan Hubicka changed: What|Removed |Added Summary|Unnecesary stack protection |Unnecesary stack protection |and missed SLP |in Firefox's LightPixel. |vectorization in Firefox's | |LightPixel. | --- Comment #2 from Jan Hubicka --- I have filled separate bug for the SLP issue so we do not mix multiple things in one PR. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92834
[Bug ipa/92809] [10 regression] error: calls_comdat_local is set outside of a comdat group
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92809 Jan Hubicka changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |WORKSFORME --- Comment #3 from Jan Hubicka --- This one works for me and should be fixed now by 2019-12-05 Jan Hubicka * ipa-inline-transform.c (inline_call): Fix maintenatnce of comdat_local
[Bug tree-optimization/92834] misssed SLP vectorization in LightPixel
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92834 --- Comment #2 from Jan Hubicka --- Created attachment 47436 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=47436&action=edit Clang assembly from perf It is clang9 build https://treeherder.mozilla.org/#/jobs?repo=try&revision=7d7ee02817ab1ea39a6415862ab7889f5e416598&selectedJob=278948829 it has full logs and binary, too /builds/worker/fetches/sccache/sccache /builds/worker/fetches/clang/bin/clang++ -o Unified_cpp_gfx_2d2.o -c -flto=thin -I/builds/worker/workspace/build/src/obj-firefox/dist/system_wrappers -include /builds/worker/workspace/build/src/config/gcc_hidden.h -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=2 -fstack-protector-strong -DMOZILLA_CLIENT -include /builds/worker/workspace/build/src/obj-firefox/mozilla-config.h -Qunused-arguments -Qunused-arguments -Wall -Wbitfield-enum-conversion -Wempty-body -Wignored-qualifiers -Woverloaded-virtual -Wpointer-arith -Wshadow-field-in-constructor-modified -Wsign-compare -Wtype-limits -Wunreachable-code -Wunreachable-code-return -Wwrite-strings -Wno-invalid-offsetof -Wclass-varargs -Wfloat-overflow-conversion -Wfloat-zero-conversion -Wloop-analysis -Wc++1z-compat -Wc++2a-compat -Wcomma -Wimplicit-fallthrough -Werror=non-literal-null-conversion -Wstring-conversion -Wtautological-overlap-compare -Wtautological-unsigned-enum-zero-compare -Wtautological-unsigned-zero-compare -Wno-error=tautological-type-limit-compare -Wno-inline-new-delete -Wno-error=type-limits -Wno-error=pessimizing-move -Wno-error=nonnull -Wno-error=deprecated-declarations -Wno-error=array-bounds -Wno-error=backend-plugin -Wno-error=return-std-move -Wno-error=atomic-alignment -Wformat -Wformat-security -Wno-gnu-zero-variadic-macro-arguments -Wno-unknown-warning-option -Wno-return-type-c-linkage -D_GLIBCXX_USE_CXX11_ABI=0 -fno-sized-deallocation -fno-aligned-new -fcrash-diagnostics-dir=/builds/worker/artifacts -fno-strict-aliasing -fno-strict-aliasing -fno-exceptions -fno-rtti -fno-exceptions -fno-math-errno -pthread -pipe -I/builds/worker/workspace/build/src/obj-firefox/dist/stl_wrappers -DNDEBUG=1 -DTRIMMED=1 -DUSE_SSE2 -DOS_POSIX=1 -DOS_LINUX=1 -DUSE_CAIRO -DMOZ2D_HAS_MOZ_CAIRO -DMOZ_ENABLE_FREETYPE -DSTATIC_EXPORTABLE_JS_API -DMOZ_HAS_MOZGLUE -DMOZILLA_INTERNAL_API -DIMPL_LIBXUL -I/builds/worker/workspace/build/src/gfx/2d -I/builds/worker/workspace/build/src/obj-firefox/gfx/2d -I/builds/worker/workspace/build/src/obj-firefox/ipc/ipdl/_ipdlheaders -I/builds/worker/workspace/build/src/ipc/chromium/src -I/builds/worker/workspace/build/src/ipc/glue -I/builds/worker/workspace/build/src/gfx/skia -I/builds/worker/workspace/build/src/gfx/skia/skia -I/builds/worker/workspace/build/src/obj-firefox/dist/include -I/builds/worker/workspace/build/src/obj-firefox/dist/include/nspr -I/builds/worker/workspace/build/src/obj-firefox/dist/include/nss -fPIC -g -Xclang -load -Xclang /builds/worker/workspace/build/src/obj-firefox/build/clang-plugin/libclang-plugin.so -Xclang -add-plugin -Xclang moz-check -O2 -fno-omit-frame-pointer -funwind-tables -Werror -Wno-error=shadow -I/builds/worker/workspace/build/src/obj-firefox/dist/include/cairo -I/usr/include/freetype2 -MD -MP -MF .deps/Unified_cpp_gfx_2d2.o.pp Unified_cpp_gfx_2d2.cpp
[Bug c++/92831] CWG1299 extend_ref_init_temps_1 punts on COND_EXPRs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92831 Jan Hubicka changed: What|Removed |Added CC||hubicka at gcc dot gnu.org --- Comment #7 from Jan Hubicka --- Thank you! I wonder if your fix can also have an optional warning which would let us to fix occurrences of this in Firefox since requiring unreleased compilers is not cool
[Bug tree-optimization/92860] New: [8,9,10 regression] Global flags affected by -O settings are clobbered by optimize attribute
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92860 Bug ID: 92860 Summary: [8,9,10 regression] Global flags affected by -O settings are clobbered by optimize attribute Product: gcc Version: 10.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: hubicka at gcc dot gnu.org Target Milestone: --- Hi, the following testcase: void linker_error(); __attribute__ ((optimize("-O0"))) int a () { } static int remove_me () { linker_error (); } void main() { } builds with GCC6 but not with GCC8, GCC9 and GCC10: hubicka@lomikamen-jh:/aux/hubicka/trunk4/gcc$ gcc -O2 t.c hubicka@lomikamen-jh:/aux/hubicka/trunk4/gcc$ /aux/hubicka/trunk-install/bin/gcc -O2 t.c /usr/local/bin/ld: /tmp/cckSFE5R.o: in function `remove_me': t.c:(.text+0x17): undefined reference to `linker_error' collect2: error: ld returned 1 exit status The problem is that while processing the optimize attribute for a we overwritte flag_toplevel_reorder that is affected by optimization flag but not marked as Optimization. I suppose there are other cases like this.
[Bug tree-optimization/92860] [8,9,10 regression] Global flags affected by -O settings are clobbered by optimize attribute
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92860 --- Comment #1 from Jan Hubicka --- Author: hubicka Date: Sun Dec 8 13:50:32 2019 New Revision: 279089 URL: https://gcc.gnu.org/viewcvs?rev=279089&root=gcc&view=rev Log: PR tree-optimization/92860 * common.opt (fprofile-reorder-functions, ftoplevel-reorder): Add Optimization flag. Modified: trunk/gcc/ChangeLog trunk/gcc/common.opt
[Bug tree-optimization/92860] [8/9/10 regression] Global flags affected by -O settings are clobbered by optimize attribute
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92860 Jan Hubicka changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2019-12-08 Summary|[8,9,10 regression] Global |[8/9/10 regression] Global |flags affected by -O|flags affected by -O |settings are clobbered by |settings are clobbered by |optimize attribute |optimize attribute Ever confirmed|0 |1 --- Comment #2 from Jan Hubicka --- Partly fixed on trunk - I think we have other flags/params missing Optimization attribute that behaves same way.
[Bug tree-optimization/92924] New: [10 regression] reproducible indirect call profile merging causes 80% slowdown in Firefox pref-reftest-singletons id-getter microbenchmarks
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92924 Bug ID: 92924 Summary: [10 regression] reproducible indirect call profile merging causes 80% slowdown in Firefox pref-reftest-singletons id-getter microbenchmarks Product: gcc Version: 10.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: hubicka at gcc dot gnu.org Target Milestone: --- During the train run, in firefox2019-release-9test/dom/bindings function ;; Function mozilla::dom::binding_detail::GenericGetter (_ZN7mozilla3dom14binding_detail13GenericGetterINS1_16NormalThisPolicyENS1_15ThrowExceptionsEEEbP9JSContextjPN2JS5ValueE, funcdef_no=39965, decl_uid=943222, cgraph_uid=24044, symbol_order=25045) calls function get_id most of time. With GCC 9 we get: Indirect call value:939751711 match:139135227 all:140993325. Indirect call -> direct call from other modulegetter_18=> 939751711 (will resolve only with LTO) With GCC 10 we get: Trying transformations on stmt ok_20 = getter_18 (cx_131(D), D.1007269, self_129, D.1007259); Indirect call counterall: 140957778, values: [2135000278:-1], [401302964:3804], [1203869319:12375], [429856732:6018]. So the profile omits completely get_id and we fail to inline. This has quite large performance impact of Firefox in general since it seems to affect DOM tree manipulation quite badly.
[Bug tree-optimization/92924] [10 regression] reproducible indirect call profile merging causes 80% slowdown in Firefox pref-reftest-singletons id-getter microbenchmarks
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92924 Jan Hubicka changed: What|Removed |Added CC||hubicka at gcc dot gnu.org, ||mliska at suse dot cz --- Comment #1 from Jan Hubicka --- This is caused by Martin's TOP_N_PROFILE work.
[Bug bootstrap/92653] [10 Regression] PGO bootstrap is broken with --with-build-config=bootstrap-lto-lean
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92653 Jan Hubicka changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #2 from Jan Hubicka --- The underlying updating issues was fixed last week
[Bug rtl-optimization/92925] New: RTl expansion throws away alignment info
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92925 Bug ID: 92925 Summary: RTl expansion throws away alignment info Product: gcc Version: 10.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: hubicka at gcc dot gnu.org Target Milestone: --- Hi, this testcase originally started as attempt to produce self contained reproducer for ipa-cp bug. Problem is that RTL expansion is too limited and refuses to produce aligned moves for me. struct a {long a1; long a2;}; struct b {long b; struct a a[10];}; struct c {long c; struct b b;__int128 e;}; int l; __attribute__ ((noinline)) static void set(struct b *bptr) { for (int i=0;ia[i]=(struct a){}; } test () { struct c c; set (&c.b); } Here ipa-cp propagates that BPTR is always aligned to 16 with misaligment 8. This should let expansion to use movaps for the "bptr->a[i]=(struct a){};" constructions but it does not. set: .LFB0: .cfi_startproc movll(%rip), %ecx testl %ecx, %ecx jle .L1 xorl%eax, %eax .p2align 4,,10 .p2align 3 .L3: movslq %eax, %rdx pxor%xmm0, %xmm0 addl$2, %eax salq$4, %rdx movups %xmm0, 8(%rdi,%rdx) cmpl%ecx, %eax jl .L3 .L1: Overall the loop codegen is quite bad.
[Bug tree-optimization/92924] [10 regression] reproducible indirect call profile merging causes 80% slowdown in Firefox pref-reftest-singletons id-getter microbenchmarks
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92924 --- Comment #2 from Jan Hubicka --- Increasing number of entries does not seem to help: Indirect call counterall: 140960933, values: [429856732:-1], [484692916:1218], [1203869319:12593], [245854587:8179], [1829590552:52], [401302964:7072], [839575652:1422], [2041842690:854], [1646699888:428], [1259057892:1485], [1777186207:1066], [901349086:1276], [2135000278:93], [1926702874:1281], [2135000278:108], [717405103:513].
[Bug tree-optimization/92924] [10 regression] reproducible indirect call profile merging causes 80% slowdown in Firefox pref-reftest-singletons id-getter microbenchmarks
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92924 Jan Hubicka changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2019-12-13 Ever confirmed|0 |1 --- Comment #3 from Jan Hubicka --- I hacked libgcov to make merging no longer reproducible Index: libgcov-merge.c === --- libgcov-merge.c (revision 279167) +++ libgcov-merge.c (working copy) @@ -130,12 +130,27 @@ merge_topn_values_set (gcov_type *counte } } + if (j == GCOV_TOPN_VALUES) + { + int min = 0; + for (j = 1; j < GCOV_TOPN_VALUES; j++) + if (counters[2 * j + 1] < counters[2 * min + 1]) + min = j; + if (counters[2 * min + 1] < read_counters[2 * i + 1]) + { + counters[2 * min] = read_counters[2 * i]; + counters[2 * min + 1] = read_counters[2 * i + 1]; + } + } + +#if 0 /* We haven't found a slot, bail out. */ if (j == GCOV_TOPN_VALUES) { counters[1] = -1; return; } +#endif } } with this I now get: Trying transformations on stmt ok_20 = getter_18 (cx_131(D), D.1007269, self_129, D.1007259); Indirect call counterall: 140964179, values: [939751711:140005207], [2105057161:149880], [708289787:11], [484692916:60283], [1777186207:5], [245854587:38900], [1967741779:28458], [1785108787:23272], [429856732:17057], [401533446:13488], [1203869319:10772], [183365365:9606], [401302964:7243], [824316005:3379], [758688187:2121], [1528155396:1983]. /aux/hubicka/firefox-2019-2/dom/bindings/BindingUtils.cpp:3035:19: missed: Indirect call -> direct call from other module getter_18=> 939751711 (will resolve only with LTO) So the histogram of destinations is indeed greatly dominated by one estination but there are very many others (not all are listed since I started dropping them). One way to make reproducible merging better is to drop destinations with small trip counts before merging, but I am not sure it would help everywhere.
[Bug tree-optimization/92924] [10 regression] reproducible indirect call profile merging causes 80% slowdown in Firefox pref-reftest-singletons id-getter microbenchmarks
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92924 --- Comment #4 from Jan Hubicka --- Looking into how getter variable is determined: vp_35 is function parameter _124 = MEM[(const struct Value *)vp_35(D)].asBits_; _125 = _124 ^ 18446181123756130304; _126 = (struct JSObject *) _125 ... _50 = MEM[(struct Function *)_126].jitinfo ... getter_60 = _50->D.102800.getter; ok_64 = getter_60 (cx_325(D), D.1007269, self_323, D.1007259) Seems our jump functions would need a lot of work to handle this.
[Bug tree-optimization/93055] New: accumulation loops in stepanov_vector benchmark use more instruction level parpallelism
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93055 Bug ID: 93055 Summary: accumulation loops in stepanov_vector benchmark use more instruction level parpallelism Product: gcc Version: 10.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: hubicka at gcc dot gnu.org Target Milestone: --- stepanov_vector benchmark form https://gitlab.com/chriscox/CppPerformanceBenchmarks gets poor codegen on TestOneType Built with -march=bdver1 -O3 (but the regression happens on core too) Clang compiles accumulation loops for testOneType as follows: │vpxor %xmm0,%xmm0,%xmm0 │vpxor %xmm1,%xmm1,%xmm1 │vpxor %xmm2,%xmm2,%xmm2 0.05 │vpxor %xmm3,%xmm3,%xmm3= │data16 nopw %cs:0x0(%rax,%rax,1) 6.95 │ 300:┌─→vpaddd 0x5f0(%rsp,%rcx,4),%xmm0,%xmm0 0.05 │ │ vpaddd 0x600(%rsp,%rcx,4),%xmm1,%xmm1 7.13 │ │ vpaddd 0x610(%rsp,%rcx,4),%xmm2,%xmm2 0.16 │ │ vpaddd 0x620(%rsp,%rcx,4),%xmm3,%xmm3 │ │ add$0x10,%rcx │ │ cmp$0x7dc,%rcx 7.04 │ └──jne300 0.07 │vpaddd %xmm0,%xmm1,%xmm0 1.61 │vpaddd %xmm0,%xmm2,%xmm0 │vpaddd %xmm0,%xmm3,%xmm0 │vpshuf $0x4e,%xmm0,%xmm1 0.07 │vpaddd %xmm1,%xmm0,%xmm0 0.02 │vpshuf $0xe5,%xmm0,%xmm1 while GCC10 does: │ 1c0: vxorps %xmm0,%xmm0,%xmm0 │mov%rbx,%rax │nop 2.25 │ 1d0:┌─→vpaddd (%rax),%xmm0,%xmm0 0.01 │ │ lea0x2100(%rsp),%rdi 0.95 │ │ add$0x10,%rax 1.04 │ │ cmp%rax,%rdi 2.24 │ └──jne1d0 Which runs slower: testdescription absolute operations ratio with numbertime per second test0 0 "int32_t accumulate pointer verify2" 1.06 sec 12440.17 M 1.00 1 "int32_t accumulate vector iterator" 1.06 sec 12458.15 M 1.00 2 "int32_t accumulate pointer reverse reverse" 1.06 sec 12440.34 M 1.00 3 "int32_t accumulate vector reverse_iterator reverse" 1.05 sec 12602.74 M 0.99 4 "int32_t accumulate vector iterator reverse reverse" 1.04 sec 12749.27 M 0.98 5 "int32_t accumulate array Riterator reverse reverse" 1.06 sec 12486.26 M 1.00 Total absolute time for int32_t Vector Accumulate: 6.32 sec int32_t Vector Accumulate Penalty: 0.99 compared to: testdescription absolute operations ratio with numbertime per second test0 0 "int32_t accumulate pointer verify2" 2.29 sec 5773.60 M 1.00 1 "int32_t accumulate vector iterator" 2.27 sec 5806.96 M 0.99 2 "int32_t accumulate pointer reverse reverse" 2.26 sec 5830.72 M 0.99 3 "int32_t accumulate vector reverse_iterator reverse" 2.27 sec 5827.45 M 0.99 4 "int32_t accumulate vector iterator reverse reverse" 2.27 sec 5821.29 M 0.99 5 "int32_t accumulate array Riterator reverse reverse" 2.27 sec 5826.58 M 0.99 Total absolute time for int32_t Vector Accumulate: 13.62 sec int32_t Vector Accumulate Penalty: 0.99