[Bug ipa/92372] [10 Regression] ICE in ipa_update_overall_fn_summary at gcc/ipa-fnsummary.c:3671 since r277780

2020-03-19 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92372

Jan Hubicka  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #9 from Jan Hubicka  ---
Fixed.

[Bug ipa/93351] [10 Regression] ICE in ipa_update_overall_fn_summary at gcc/ipa-fnsummary.c:4014

2020-03-19 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93351

Jan Hubicka  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |DUPLICATE

--- Comment #2 from Jan Hubicka  ---
This crash is also due to flatten attribute on alias.
It takes really long time to build since it is inline bomb.
It produces tons of template instantiations and them flattens them.
Template instantiations consumes 2GB of memory, inlining 3GB.
It would be interesting to check if clang behaves better, but it does not like
the preprocessed file.

*** This bug has been marked as a duplicate of bug 92372 ***

[Bug ipa/92372] [10 Regression] ICE in ipa_update_overall_fn_summary at gcc/ipa-fnsummary.c:3671 since r277780

2020-03-19 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92372

--- Comment #10 from Jan Hubicka  ---
*** Bug 93351 has been marked as a duplicate of this bug. ***

[Bug ipa/93369] [10 regression] g++.dg/lto/pr64076 fails

2020-03-19 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93369

--- Comment #15 from Jan Hubicka  ---
The testcase has an ODR violation that makes comdat groups go out of sync. So I
guess it is just about finding way to not make verifier to ICE.
With release settings the testcase will however quietly compile this I do not
think this is release blocker (P1).

[Bug ipa/94202] [10 Regression] ICE in do_estimate_edge_time, at ipa-inline-analysis.c:222

2020-03-19 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94202

Jan Hubicka  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #3 from Jan Hubicka  ---
Fixed. Probably not important enough to backport even though the bug is present
in all active branches.

[Bug ipa/93621] [10 Regression] ICE in redirect_call_stmt_to_callee, at cgraph.c:1443 since r10-5567

2020-03-20 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93621

Jan Hubicka  changed:

   What|Removed |Added

 CC||mjambor at suse dot cz

--- Comment #3 from Jan Hubicka  ---
The testcase builds for me now, but this is Martin's code (apparently checking
that we did not forget to apply param adjustments)
Martin, was this fixed?

Honza

[Bug ipa/93347] [10 Regression] ICE: verify_cgraph_node failed (error: calls_comdat_local is set outside of a comdat group)

2020-03-20 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93347

Jan Hubicka  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #3 from Jan Hubicka  ---
Fixed. I noticed that some of the tests are not devirtualized, so we may move
that into new PR.

[Bug c++/94243] Missed C++ front-end devirtualizations from Clang testsuite

2020-03-20 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94243

Jan Hubicka  changed:

   What|Removed |Added

 CC||jason at redhat dot com

--- Comment #1 from Jan Hubicka  ---
Jason,
I wonder if those are all valid transformations?
Honza

[Bug c++/94243] New: Missed C++ front-end devirtualizations from Clang testsuite

2020-03-20 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94243

Bug ID: 94243
   Summary: Missed C++ front-end devirtualizations from Clang
testsuite
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: hubicka at gcc dot gnu.org
  Target Milestone: ---

While working on PR93347 I noticed that we do not devirtualize the following
testcases that clang's testsuite tests to be devirtualized:

namespace Test2a {
  struct A {
virtual ~A() final {}
virtual int f();
  };

  // CHECK-LABEL: define i32 @_ZN6Test2a1fEPNS_1AE
  int f(A *a) {
// CHECK: call i32 @_ZN6Test2a1A1fEv
return a->f();
  }
}

Here I guess the final destructor makes the whole class final?

namespace Test4 {
  struct A {
virtual void f();
virtual int operator-();
  };

  struct B final : A {
virtual void f();
virtual int operator-();
  };

  // CHECK-LABEL: define void @_ZN5Test41fEPNS_1BE
  void f(B* d) {
// CHECK: call void @_ZN5Test41B1fEv
static_cast(d)->f();
// CHECK: call i32 @_ZN5Test41BngEv
-static_cast(*d);
  }
}

Her I am not sure, I think parameter d may point to instance of struct A,
so is it Clang's bug to devirtualize?

namespace Test5 {
  struct A {
virtual void f();
virtual int operator-();
  };

  struct B : A {
virtual void f();
virtual int operator-();
  };

  struct C final : B {
  };

  // CHECK-LABEL: define void @_ZN5Test51fEPNS_1CE
  void f(C* d) {
// FIXME: It should be possible to devirtualize this case, but that is
// not implemented yet.
// CHECK: getelementptr
// CHECK-NEXT: %[[FUNC:.*]] = load
// CHECK-NEXT: call void %[[FUNC]]
static_cast(d)->f();
  }
  // CHECK-LABEL: define void @_ZN5Test53fopEPNS_1CE
  void fop(C* d) {
// FIXME: It should be possible to devirtualize this case, but that is
// not implemented yet.
// CHECK: getelementptr
// CHECK-NEXT: %[[FUNC:.*]] = load
// CHECK-NEXT: call i32 %[[FUNC]]
-static_cast(*d);
  }
}

this seems similar to me.
namespace Test7 {
  struct foo {
virtual void g() {}
  };

  struct bar {
virtual int f() { return 0; }
  };

  struct zed final : public foo, public bar {
int z;
virtual int f() {return z;}
  };

  // CHECK-LABEL: define i32 @_ZN5Test71fEPNS_3zedE
  int f(zed *z) {
// CHECK: alloca
// CHECK-NEXT: store
// CHECK-NEXT: load
// CHECK-NEXT: call i32 @_ZN5Test73zed1fEv
// CHECK-NEXT: ret
return static_cast(z)->f();
  }
}

namespace Test8 {
  struct A { virtual ~A() {} };
  struct B {
int b;
virtual int foo() { return b; }
  };
  struct C final : A, B {  };
  // CHECK-LABEL: define i32 @_ZN5Test84testEPNS_1CE
  int test(C *c) {
// CHECK: %[[THIS:.*]] = phi
// CHECK-NEXT: call i32 @_ZN5Test81B3fooEv(%"struct.Test8::B"* %[[THIS]])
return static_cast(c)->foo();
  }
}

namespace Test9 {
  struct A {
int a;
  };
  struct B {
int b;
  };
  struct C : public B, public A {
  };
  struct RA {
virtual A *f() {
  return 0;
}
virtual A *operator-() {
  return 0;
}
  };
  struct RC final : public RA {
virtual C *f() {
  C *x = new C();
  x->a = 1;
  x->b = 2;
  return x;
}
virtual C *operator-() {
  C *x = new C();
  x->a = 1;
  x->b = 2;
  return x;
}
  };
  // CHECK: define {{.*}} @_ZN5Test91fEPNS_2RCE
  A *f(RC *x) {
// FIXME: It should be possible to devirtualize this case, but that is
// not implemented yet.
// CHECK: load
// CHECK: bitcast
// CHECK: [[F_PTR_RA:%.+]] = bitcast
// CHECK: [[VTABLE:%.+]] = load {{.+}} [[F_PTR_RA]]
// CHECK: [[VFN:%.+]] = getelementptr inbounds {{.+}} [[VTABLE]],
i{{[0-9]+}} 0
// CHECK-NEXT: %[[FUNC:.*]] = load {{.+}} [[VFN]]
return static_cast(x)->f();
  }
  // CHECK: define {{.*}} @_ZN5Test93fopEPNS_2RCE
  A *fop(RC *x) {
// FIXME: It should be possible to devirtualize this case, but that is
// not implemented yet.
// CHECK: load
// CHECK: bitcast
// CHECK: [[F_PTR_RA:%.+]] = bitcast
// CHECK: [[VTABLE:%.+]] = load {{.+}} [[F_PTR_RA]]
// CHECK: [[VFN:%.+]] = getelementptr inbounds {{.+}} [[VTABLE]],
i{{[0-9]+}} 1
// CHECK-NEXT: %[[FUNC:.*]] = load {{.+}} [[VFN]]
// CHECK-NEXT: = call {{.*}} %[[FUNC]]
return -static_cast(*x);
  }
}

namespace Test10 {
  struct A {
virtual int f();
  };

  struct B : A {
int f() final;
  };

  // CHECK-LABEL: define i32 @_ZN6Test101fEPNS_1BE
  int f(B *b) {
// CHECK: call i32 @_ZN6Test101B1fEv
return static_cast(b)->f();
  }
}

[Bug lto/91028] [10 Regression] g++.dg/lto/alias-2 FAILs with -fno-use-linker-plugin

2020-03-20 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91028

Jan Hubicka  changed:

   What|Removed |Added

   Assignee|hubicka at gcc dot gnu.org |unassigned at gcc dot 
gnu.org
 Status|ASSIGNED|WAITING

--- Comment #3 from Jan Hubicka  ---
I believe this was fixed a while ago by adding the loop. It no longer fails
with -fno-use-linker-plugin. Is it OK on Solaris?

[Bug ipa/62051] [8/9/10 Regression] Undefined reference to vtable with -O2 and -fdevirtualize-speculatively

2020-03-21 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62051

Jan Hubicka  changed:

   What|Removed |Added

   Target Milestone|8.5 |11.0

--- Comment #23 from Jan Hubicka  ---
This is bit of a grey area of what we can/can not refer in presence of
visibilities and I hope codebases are now adopted for GCC behaviour.  I think
we could delay this post GCC10, so re-taretting.

[Bug tree-optimization/91322] [10 regression] alias-4 test failure

2020-04-03 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91322

--- Comment #8 from Jan Hubicka  ---
Do we have compile farm machine where this can be reproduced?

[Bug ipa/93369] [10 regression] g++.dg/lto/pr64076 fails

2020-04-04 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93369

--- Comment #17 from Jan Hubicka  ---
Note that to fully fix the problem we need to resolve the way aliases works.
In this case ODR violation makes one COMDAT section to contain only ctor, while
other contains ctor and its thunk.  The first COMDAT wins which makes the thunk
to call alias of a symbol prevailed by different COMDAT.
This still work w/o LTO and to immitate what happens in linker correctly
we need ability to load both constructors

https://gcc.gnu.org/pipermail/gcc-patches/2020-March/542733.html

For invalid code like this that does not matter much, but the patch has also a
valid testcase.

I can also however patch around and silence the verifier ICE, but it would be
just symptomatic workaround

[Bug tree-optimization/91322] [10 regression] alias-4 test failure

2020-04-04 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91322

--- Comment #11 from Jan Hubicka  ---
The problem is that on ARM sizeof (short) == sizeof (int)
and LTO will glob all short and int pointers together.  So this is missed
optimization only.

We do this globing sort of by design. For GCC11 I plan to refine type merging
again a bit but until then we could either xfail this testcase or change int to
long which is 4 bytes.

Not a release blocker though.

I would welcome if someone could test the testcase adjustment (I was doing LTO
by hand)

diff --git a/gcc/testsuite/g++.dg/lto/alias-4_0.C
b/gcc/testsuite/g++.dg/lto/alias-4_0.C
index 410c3140baf..0ab12adef5b 100644
--- a/gcc/testsuite/g++.dg/lto/alias-4_0.C
+++ b/gcc/testsuite/g++.dg/lto/alias-4_0.C
@@ -5,7 +5,7 @@ short *ptr_init, **ptr=&ptr_init;

 __attribute__ ((used))
 struct a {
-  int *aptr;
+  long *aptr;
 } a, *aptr=&a;

 void

[Bug ipa/93369] [10 regression] g++.dg/lto/pr64076 fails

2020-04-09 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93369

--- Comment #19 from Jan Hubicka  ---
The reason why we get link failure is that we behave differently to mismatched
comdats.  While linker choose comdat that wins and eliminate other one we keep
the other symbol and end up compiling it which leads to interesting issues with
"half comdat" I am aiming to solve with the patch for proper handling of
aliases.

I think updating the testcase with -shared is a way to go for this P1 and I we
can discuss the alias issue (probably for 10.2, since it is bit involved and
very old)

Honza

[Bug tree-optimization/91322] [10 regression] g++.dg/lto/alias-4_0.C test failure

2020-04-09 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91322

Jan Hubicka  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #17 from Jan Hubicka  ---
So this turned out to be disabled ODR based TBAA for this struct since on ARM
the builtin va_list type has same structure.
I fixed the fialure by adjusting the structure and next stage1 we can make ODR
TBAA to not give up in this case.

[Bug middle-end/94539] [10 Regression] gcc.dg/alias-14.c fails on gcc 10, succeeds on gcc 9, when turned into an execution test

2020-04-09 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94539

--- Comment #2 from Jan Hubicka  ---
Hmm, the testcase is mine so I will take a look (and make it dg-do-run :)
Honza

[Bug gcov-profile/93401] [9 regression] It is no longer possible to use -fprofile-generate= on setups with different instrumentation and feedback directories

2020-04-16 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93401

Jan Hubicka  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
Summary|[9/10 regression] It is no  |[9 regression] It is no
   |longer possible to use  |longer possible to use
   |-fprofile-generate= on |-fprofile-generate= on
   |setups with different   |setups with different
   |instrumentation and |instrumentation and
   |feedback directories|feedback directories
 Resolution|--- |FIXED

--- Comment #14 from Jan Hubicka  ---
Resolved on 10 so far. It may make sense to backport this to 9 and possibly
earlier branches.

[Bug c++/94955] New: ICE in to_wide

2020-05-05 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94955

Bug ID: 94955
   Summary: ICE in to_wide
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: hubicka at gcc dot gnu.org
  Target Milestone: ---

Created attachment 48454
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48454&action=edit
proposed patch

This was reported to me by Mark Williams (who also did the testcase and
proposed patch)

% g++ -std=gnu++17 bug.ii -S -o bug.s
bug.ii: In function �void d()�:
bug.ii:6:32: internal compiler error: in sign_mask, at wide-int.h:855
6 | void d() { short e = e >> b::c(); }
  |^
0xa56be2 generic_wide_int >::sign_mask()
const
/home/engshare/third-party2/gcc/9.x/src/gcc-10.x/gcc/wide-int.h:855
0xa56be2 bool wi::neg_p >
>(generic_wide_int > const&, signop)
/home/engshare/third-party2/gcc/9.x/src/gcc-10.x/gcc/wide-int.h:1836
0xa56be2 tree_int_cst_sgn(tree_node const*)
/home/engshare/third-party2/gcc/9.x/src/gcc-10.x/gcc/tree.c:7386
0x1065e85 cp_build_binary_op(op_location_t const&, tree_code, tree_node*,
tree_node*, int)
/home/engshare/third-party2/gcc/9.x/src/gcc-10.x/gcc/cp/typeck.c:5613
0xfa9629 build_new_op_1
/home/engshare/third-party2/gcc/9.x/src/gcc-10.x/gcc/cp/call.c:6501
0xfa931d build_new_op(op_location_t const&, tree_code, int, tree_node*,
tree_node*, tree_node*, tree_node**, int)
/home/engshare/third-party2/gcc/9.x/src/gcc-10.x/gcc/cp/call.c:6547
0x106267f build_x_binary_op(op_location_t const&, tree_code, tree_node*,
tree_code, tree_node*, tree_code, tree_node**, int)
/home/engshare/third-party2/gcc/9.x/src/gcc-10.x/gcc/cp/typeck.c:4248
0x10162f0 cp_parser_binary_expression
/home/engshare/third-party2/gcc/9.x/src/gcc-10.x/gcc/cp/parser.c:9684
0x10157a4 cp_parser_assignment_expression
/home/engshare/third-party2/gcc/9.x/src/gcc-10.x/gcc/cp/parser.c:9824
0x1015380 cp_parser_constant_expression
/home/engshare/third-party2/gcc/9.x/src/gcc-10.x/gcc/cp/parser.c:10118
0x1015380 cp_parser_initializer_clause
/home/engshare/third-party2/gcc/9.x/src/gcc-10.x/gcc/cp/parser.c:23148
0x1015380 cp_parser_initializer
/home/engshare/third-party2/gcc/9.x/src/gcc-10.x/gcc/cp/parser.c:23086
0x1008ab0 cp_parser_init_declarator
/home/engshare/third-party2/gcc/9.x/src/gcc-10.x/gcc/cp/parser.c:20780
0x1006144 cp_parser_simple_declaration
/home/engshare/third-party2/gcc/9.x/src/gcc-10.x/gcc/cp/parser.c:13689
0x101df42 cp_parser_declaration_statement
/home/engshare/third-party2/gcc/9.x/src/gcc-10.x/gcc/cp/parser.c:13121
0x101a67c cp_parser_statement
/home/engshare/third-party2/gcc/9.x/src/gcc-10.x/gcc/cp/parser.c:11434
0x101a38a cp_parser_statement_seq_opt
/home/engshare/third-party2/gcc/9.x/src/gcc-10.x/gcc/cp/parser.c:11800
0x101a38a cp_parser_compound_statement
/home/engshare/third-party2/gcc/9.x/src/gcc-10.x/gcc/cp/parser.c:11750
0x101a0f9 cp_parser_function_body
/home/engshare/third-party2/gcc/9.x/src/gcc-10.x/gcc/cp/parser.c:22992
0x101a0f9 cp_parser_ctor_initializer_opt_and_function_body
/home/engshare/third-party2/gcc/9.x/src/gcc-10.x/gcc/cp/parser.c:23043

The problem was that it had previously used fold_for_warn to find an
INTEGER_CST, and assumed the cp_fold_rvalue would too. But fold_for_warn
handles some edge cases that cp_fold_rvalue does not, and in this case we end
up with a NOP_EXPR instead of the INTEGER_CST

[Bug c++/94955] [10 regression] ICE in to_wide

2020-05-05 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94955

--- Comment #2 from Jan Hubicka  ---
Created attachment 48455
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48455&action=edit
testcase

[Bug c++/94955] [10 regression] ICE in to_wide

2020-05-05 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94955

Jan Hubicka  changed:

   What|Removed |Added

 Status|WAITING |NEW

[Bug lto/48200] Implement function attribute for symbol versioning (.symver)

2020-05-06 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=48200

--- Comment #44 from Jan Hubicka  ---
Thanks, I am happy we now have real-world use of symver attribute.  I have WIP
patch for better control over the symbol visibility, but I have run into
problems with gas limitations which was fixed by HJ about two weeks ago.
I will try to update the patch and aim for backporting to gcc 10.2.

[Bug tree-optimization/95539] New: Vectorizer ICE in dr_misalignment, at tree-vectorizer.h:1433

2020-06-04 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95539

Bug ID: 95539
   Summary: Vectorizer ICE in dr_misalignment, at
tree-vectorizer.h:1433
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: hubicka at gcc dot gnu.org
  Target Milestone: ---

Created attachment 48675
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48675&action=edit
testcase

Building the testcase with -O3 leads to:
/aux/hubicka/firefox-2019-3/security/nss/lib/freebl/gcm-x86.c:35:1: internal
compiler error: in dr_misalignment, at tree-vectorizer.h:1433
   35 | gcm_HashMult_hw(gcmHashContext *ghash, const unsigned char *buf,
  | ^~~
0xccc57e dr_misalignment(dr_vec_info*) [clone .isra.0]
../../gcc/tree-vectorizer.h:1433
0xb5ea92 aligned_access_p
../../gcc/tree-vectorizer.h:1451
0xb5ea92 vect_supportable_dr_alignment(vec_info*, dr_vec_info*, bool)
../../gcc/tree-vect-data-refs.c:6512
0x933803 vect_get_load_cost(vec_info*, _stmt_vec_info*, int, bool, unsigned
int*, unsigned int*, vec*,
vec*, bool)
../../gcc/tree-vect-stmts.c:1211
0x950966 vect_model_load_cost
../../gcc/tree-vect-stmts.c:1185
0x950966 vectorizable_load
../../gcc/tree-vect-stmts.c:8877
0x964260 vect_analyze_stmt(vec_info*, _stmt_vec_info*, bool*, _slp_tree*,
_slp_instance*, vec*)
../../gcc/tree-vect-stmts.c:11126
0x972af1 vect_slp_analyze_node_operations_1
../../gcc/tree-vect-slp.c:2697
0x972af1 vect_slp_analyze_node_operations
../../gcc/tree-vect-slp.c:2858
0x9728fe vect_slp_analyze_node_operations
../../gcc/tree-vect-slp.c:2816
0x9728fe vect_slp_analyze_node_operations
../../gcc/tree-vect-slp.c:2816
0x9728fe vect_slp_analyze_node_operations
../../gcc/tree-vect-slp.c:2816
0x972f76 vect_slp_analyze_operations(vec_info*)
../../gcc/tree-vect-slp.c:2937
0x9812e4 vect_slp_analyze_bb_1
../../gcc/tree-vect-slp.c:3264
0x9812e4 vect_slp_bb_region
../../gcc/tree-vect-slp.c:3325
0x9812e4 vect_slp_bb(basic_block_def*)
../../gcc/tree-vect-slp.c:3460
0x981c32 execute
../../gcc/tree-vectorizer.c:1320

[Bug ipa/96337] [10/11 Regression] GCC 10.2: twice as slow for -O2 -march=x86-64 vs. GCC 9.3/8.4

2020-07-27 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96337

--- Comment #4 from Jan Hubicka  ---
There was changes to -O2 inliner.  I have
 - enabled auto-inlininig
 - reduced early inlining a bit
 - reduced limits for inlining functions declared inline
The second two was needed to keep code size under control and did well on
overall -O2 spec and Firefox performance (without FDO, with FDO we indeed had
some performance loss and code size gains, which I plan to revisit).

This should not be visible on linux kernel though since it does always inline.
The linked patch to enable -O3 by default does not make too much sense to me. 

I will see if I can reproduce phoronix benchmarks - indeed those workloads are
not typical -O2 workloads and may be affected by the inline limits.

Honza

[Bug ipa/96337] [10/11 Regression] GCC 10.2: twice as slow for -O2 -march=x86-64 vs. GCC 9.3/8.4

2020-07-27 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96337

--- Comment #5 from Jan Hubicka  ---
OK, I started with checking Himeno where phoronix reports 4377->2681
on my notebook (Intel(R) Core(TM) i7-6600U CPU) there may be around 1-5%
regression that is not inliner related

GCC 10
 Loop executed for 7445 times
 Gosa : 2.924613e-08 
 MFLOPS measured : 2346.645663  cpu : 50.172505
 Score based on Pentium III 600MHz using Fortran 77: 28.617630

GCC 9
 Loop executed for 8253 times
 Gosa : 9.062229e-09 
 MFLOPS measured : 2454.019320  cpu : 53.184180
 Score based on Pentium III 600MHz using Fortran 77: 29.927065

The internal loops and inlining looks almost identical.

[Bug ipa/96337] [10/11 Regression] GCC 10.2: twice as slow for -O2 -march=x86-64 vs. GCC 9.3/8.4

2020-07-27 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96337

--- Comment #6 from Jan Hubicka  ---
Coremark.

GCC 9 run1:
CoreMark Size: 666
Total ticks  : 12310
Total time (secs): 12.31
Iterations/Sec   : 24370.430544
Iterations   : 30
Compiler version : GCC9.3.1 20200406 [revision
6db837a5288ee3ca5ec504fbd5a765817e556ac2]
Compiler flags   : -O2 -DPERFORMANCE_RUN=1  -lrt

GCC 9 run2:
CoreMark Size: 666
Total ticks  : 12471
Total time (secs): 12.471000
Iterations/Sec   : 24055.809478
Iterations   : 30
Compiler version : GCC9.3.1 20200406 [revision
6db837a5288ee3ca5ec504fbd5a765817e556ac2]
Compiler flags   : -O2 -DPERFORMANCE_RUN=1  -lrt


GCC 10 run1:
CoreMark Size: 666
Total ticks  : 15269
Total time (secs): 15.269000
Iterations/Sec   : 26196.869474
Iterations   : 40
Compiler version : GCC10.1.1 20200507 [revision
dd38686d9c810cecbaa80bb82ed91caaa58ad635]
Compiler flags   : -O2 -DPERFORMANCE_RUN=1  -lrt

GCC 10 run2:
CoreMark Size: 666
Total ticks  : 11770
Total time (secs): 11.77
Iterations/Sec   : 25488.530161
Iterations   : 30
Compiler version : GCC10.1.1 20200507 [revision
dd38686d9c810cecbaa80bb82ed91caaa58ad635]
Compiler flags   : -O2 -DPERFORMANCE_RUN=1  -lrt

[Bug ipa/96337] [10/11 Regression] GCC 10.2: twice as slow for -O2 -march=x86-64 vs. GCC 9.3/8.4

2020-07-28 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96337

--- Comment #7 from Jan Hubicka  ---
X265
GCC 9:
y4m  [info]: 1920x1080 fps 30/1 i420p8 frames 0 - 599 of 600
raw  [info]: output file: /dev/null
x265 [info]: HEVC encoder version 3.1.2+1-76650bab70f9
x265 [info]: build info [Linux][GCC 9.3.1][64 bit][noasm] 8bit
x265 [info]: using cpu capabilities: none!
x265 [info]: Main profile, Level-4 (Main tier)
x265 [info]: Thread pool created using 4 threads
x265 [info]: Slices  : 1
x265 [info]: frame threads / pool features   : 2 / wpp(17 rows)
x265 [info]: Coding QT: max CU size, min CU size : 64 / 8
x265 [info]: Residual QT: max TU size, max depth : 32 / 1 inter / 1 intra
x265 [info]: ME / range / subpel / merge : hex / 57 / 2 / 3
x265 [info]: Keyframe min / max / scenecut / bias: 25 / 250 / 40 / 5.00
x265 [info]: Lookahead / bframes / badapt: 20 / 4 / 2
x265 [info]: b-pyramid / weightp / weightb   : 1 / 1 / 0
x265 [info]: References / ref-limit  cu / depth  : 3 / off / on
x265 [info]: AQ: mode / str / qg-size / cu-tree  : 2 / 1.0 / 32 / 1
x265 [info]: Rate Control / qCompress: CRF-28.0 / 0.60
x265 [info]: tools: rd=3 psy-rd=2.00 early-skip rskip signhide tmvp b-intra
x265 [info]: tools: strong-intra-smoothing lslices=6 deblock sao
x265 [info]: frame I:  3, Avg QP:27.57  kb/s: 14018.64  
x265 [info]: frame P:146, Avg QP:28.84  kb/s: 4313.98 
x265 [info]: frame B:451, Avg QP:35.29  kb/s: 204.06  
x265 [info]: Weighted P-Frames: Y:0.0% UV:0.0%
x265 [info]: consecutive B-frames: 0.7% 0.0% 0.0% 94.6% 4.7% 

encoded 600 frames in 279.98s (2.14 fps), 1273.22 kb/s, Avg QP:33.68
1056.04user 1.31system 4:40.01elapsed 377%CPU (0avgtext+0avgdata
432688maxresident)k
0inputs+0outputs (0major+102385minor)pagefaults 0swaps


GCC 10:
y4m  [info]: 1920x1080 fps 30/1 i420p8 frames 0 - 599 of 600
raw  [info]: output file: /dev/null
x265 [info]: HEVC encoder version 3.1.2+1-76650bab70f9
x265 [info]: build info [Linux][GCC 10.1.1][64 bit][noasm] 8bit
x265 [info]: using cpu capabilities: none!
x265 [info]: Main profile, Level-4 (Main tier)
x265 [info]: Thread pool created using 4 threads
x265 [info]: Slices  : 1
x265 [info]: frame threads / pool features   : 2 / wpp(17 rows)
x265 [info]: Coding QT: max CU size, min CU size : 64 / 8
x265 [info]: Residual QT: max TU size, max depth : 32 / 1 inter / 1 intra
x265 [info]: ME / range / subpel / merge : hex / 57 / 2 / 3
x265 [info]: Keyframe min / max / scenecut / bias: 25 / 250 / 40 / 5.00
x265 [info]: Lookahead / bframes / badapt: 20 / 4 / 2
x265 [info]: b-pyramid / weightp / weightb   : 1 / 1 / 0
x265 [info]: References / ref-limit  cu / depth  : 3 / off / on
x265 [info]: AQ: mode / str / qg-size / cu-tree  : 2 / 1.0 / 32 / 1
x265 [info]: Rate Control / qCompress: CRF-28.0 / 0.60
x265 [info]: tools: rd=3 psy-rd=2.00 early-skip rskip signhide tmvp b-intra
x265 [info]: tools: strong-intra-smoothing lslices=6 deblock sao
x265 [info]: frame I:  3, Avg QP:27.57  kb/s: 14018.64  
x265 [info]: frame P:146, Avg QP:28.84  kb/s: 4313.98 
x265 [info]: frame B:451, Avg QP:35.29  kb/s: 204.06  
x265 [info]: Weighted P-Frames: Y:0.0% UV:0.0%
x265 [info]: consecutive B-frames: 0.7% 0.0% 0.0% 94.6% 4.7% 

encoded 600 frames in 292.63s (2.05 fps), 1273.22 kb/s, Avg QP:33.68
1079.80user 1.76system 4:52.65elapsed 369%CPU (0avgtext+0avgdata
427464maxresident)k
0inputs+0outputs (0major+73644minor)pagefaults 0swaps

So 5% difference instead of 50%. This is a codebase that I would build with
-O3.  Looking at perf reports there is a difference in inlining.

GCC 9:
   8.74%  x265 libx265.so.176   [.] (anonymous namespace)::satd_8x4
   5.67%  x265 libx265.so.176   [.] (anonymous
namespace)::filterVertical_sp_c<8>
   4.44%  x265 libx265.so.176   [.] (anonymous
namespace)::pixelavg_pp<8, 8>
   4.11%  x265 libx265.so.176   [.] (anonymous
namespace)::psyCost_pp<3>   
   3.81%  x265 libx265.so.176   [.] (anonymous
namespace)::interp_horiz_ps_c<8, 64, 64>
   3.33%  x265 libx265.so.176   [.] (anonymous namespace)::sad<8, 8>
   3.29%  x265 libx265.so.176   [.] partialButterfly32

GCC 10:
   9.17%  x265 libx265.so.176   [.] (anonymous namespace)::_sa8d_8x8
   8.70%  x265 libx265.so.176   [.] (anonymous namespace)::satd_8x4 
   5.80%  x265 libx265.so.176   [.] (anonymous
namespace)::pixelavg_pp<8, 8>
   5.55%  x265 libx265.so.176   [.] (anonymous
namespace)::filterVertical_sp_c<8> 
   3.90%  x265 libx265.so.176   [.] (anonymous namespace)::sad<8, 8>
   3.71%  x265 libx265.so.176   [.] (anonymous
namespace)::interp_horiz_ps_c<8, 64, 64> 
   3.48%  x265 libx265.so.176   [.] (anonymous namespace)::sad_x4<8, 8>

I build with 
cmake ../source/ -DCMAKE_CXX_FLAGS=-O2 -DCMAKE_CXX_

[Bug ipa/96337] [10/11 Regression] GCC 10.2: twice as slow for -O2 -march=x86-64 vs. GCC 9.3/8.4

2020-07-28 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96337

--- Comment #8 from Jan Hubicka  ---
This is the built withour release flags override as seems to be done by
phoronix:

GCC 9:
y4m  [info]: 1920x1080 fps 30/1 i420p8 frames 0 - 599 of 600
raw  [info]: output file: /dev/null
x265 [info]: HEVC encoder version 3.1.2+1-76650bab70f9
x265 [info]: build info [Linux][GCC 9.3.1][64 bit][noasm] 8bit
x265 [info]: using cpu capabilities: none!
x265 [info]: Main profile, Level-4 (Main tier)
x265 [info]: Thread pool created using 4 threads
x265 [info]: Slices  : 1
x265 [info]: frame threads / pool features   : 2 / wpp(17 rows)
x265 [info]: Coding QT: max CU size, min CU size : 64 / 8
x265 [info]: Residual QT: max TU size, max depth : 32 / 1 inter / 1 intra
x265 [info]: ME / range / subpel / merge : hex / 57 / 2 / 3
x265 [info]: Keyframe min / max / scenecut / bias: 25 / 250 / 40 / 5.00
x265 [info]: Lookahead / bframes / badapt: 20 / 4 / 2
x265 [info]: b-pyramid / weightp / weightb   : 1 / 1 / 0
x265 [info]: References / ref-limit  cu / depth  : 3 / off / on
x265 [info]: AQ: mode / str / qg-size / cu-tree  : 2 / 1.0 / 32 / 1
x265 [info]: Rate Control / qCompress: CRF-28.0 / 0.60
x265 [info]: tools: rd=3 psy-rd=2.00 early-skip rskip signhide tmvp b-intra
x265 [info]: tools: strong-intra-smoothing lslices=6 deblock sao
x265 [info]: frame I:  3, Avg QP:27.57  kb/s: 14018.64  
x265 [info]: frame P:146, Avg QP:28.84  kb/s: 4313.98 
x265 [info]: frame B:451, Avg QP:35.29  kb/s: 204.06  
x265 [info]: Weighted P-Frames: Y:0.0% UV:0.0%
x265 [info]: consecutive B-frames: 0.7% 0.0% 0.0% 94.6% 4.7% 

encoded 600 frames in 171.30s (3.50 fps), 1273.22 kb/s, Avg QP:33.68
599.58user 1.62system 2:51.33elapsed 350%CPU (0avgtext+0avgdata
416976maxresident)k
225384inputs+0outputs (0major+95380minor)pagefaults 0swaps

GCC 10:
y4m  [info]: 1920x1080 fps 30/1 i420p8 frames 0 - 599 of 600
raw  [info]: output file: /dev/null
x265 [info]: HEVC encoder version 3.1.2+1-76650bab70f9
x265 [info]: build info [Linux][GCC 10.1.1][64 bit][noasm] 8bit
x265 [info]: using cpu capabilities: none!
x265 [info]: Main profile, Level-4 (Main tier)
x265 [info]: Thread pool created using 4 threads
x265 [info]: Slices  : 1
x265 [info]: frame threads / pool features   : 2 / wpp(17 rows)
x265 [info]: Coding QT: max CU size, min CU size : 64 / 8
x265 [info]: Residual QT: max TU size, max depth : 32 / 1 inter / 1 intra
x265 [info]: ME / range / subpel / merge : hex / 57 / 2 / 3
x265 [info]: Keyframe min / max / scenecut / bias: 25 / 250 / 40 / 5.00
x265 [info]: Lookahead / bframes / badapt: 20 / 4 / 2
x265 [info]: b-pyramid / weightp / weightb   : 1 / 1 / 0
x265 [info]: References / ref-limit  cu / depth  : 3 / off / on
x265 [info]: AQ: mode / str / qg-size / cu-tree  : 2 / 1.0 / 32 / 1
x265 [info]: Rate Control / qCompress: CRF-28.0 / 0.60
x265 [info]: tools: rd=3 psy-rd=2.00 early-skip rskip signhide tmvp b-intra
x265 [info]: tools: strong-intra-smoothing lslices=6 deblock sao
x265 [info]: frame I:  3, Avg QP:27.57  kb/s: 14018.64  
x265 [info]: frame P:146, Avg QP:28.84  kb/s: 4313.98 
x265 [info]: frame B:451, Avg QP:35.29  kb/s: 204.06  
x265 [info]: Weighted P-Frames: Y:0.0% UV:0.0%
x265 [info]: consecutive B-frames: 0.7% 0.0% 0.0% 94.6% 4.7% 

encoded 600 frames in 168.97s (3.55 fps), 1273.22 kb/s, Avg QP:33.68
592.69user 1.89system 2:49.00elapsed 351%CPU (0avgtext+0avgdata
416184maxresident)k
476408inputs+0outputs (1major+95191minor)pagefaults 0swaps

So a small improvement.

[Bug ipa/96337] [10/11 Regression] GCC 10.2: twice as slow for -O2 -march=x86-64 vs. GCC 9.3/8.4

2020-07-28 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96337

--- Comment #9 from Jan Hubicka  ---
scimark
GCC 9:
**  **
** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark **
** for details. (Results can be submitted to p...@nist.gov) **
**  **
Using   2.00 seconds min time per kenel.
Composite Score: 1062.28
FFT Mflops:   189.17(N=1048576)
SOR Mflops:   947.53(1000 x 1000)
MonteCarlo: Mflops:   710.10
Sparse matmult  Mflops:  1402.08(N=10, nz=100)
LU  Mflops:  2062.49(M=1000, N=1000)

GCC 10:
**  **
** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark **
** for details. (Results can be submitted to p...@nist.gov) **
**  **
Using   2.00 seconds min time per kenel.
Composite Score: 1176.22
FFT Mflops:   201.17(N=1048576)
SOR Mflops:   961.33(1000 x 1000)
MonteCarlo: Mflops:   708.62
Sparse matmult  Mflops:  1639.66(N=10, nz=100)
LU  Mflops:  2370.30(M=1000, N=1000)

So again around 10% improvement for gcc10

[Bug ipa/96482] [10/11 Regression] Combination of -finline-small-functions and ipa-cp optimisations causes incorrect values being passed to a function since r279523

2020-08-10 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96482

--- Comment #4 from Jan Hubicka  ---
that patch makes ccp to actually use the bit info ipa-cp determines. Before we
used it only to detect pointer alignments if I remember correctly. So it looks
like propagation bug uncovered by the change.  Smaller testcase or reproduction
steps would be indeed welcome.

[Bug ipa/96337] [10/11 Regression] GCC 10.2: twice as slow for -O2 -march=x86-64 vs. GCC 9.3/8.4

2020-09-19 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96337

Jan Hubicka  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |WORKSFORME

--- Comment #16 from Jan Hubicka  ---
It seems that the benchmarks was flawed. We could reopen if phoronix suceeds to
reporduce them.

[Bug ipa/92074] [10 regression] 26% performance regression on Spec2017 548.exchange2_r

2019-10-23 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92074

--- Comment #6 from Jan Hubicka  ---
Author: hubicka
Date: Wed Oct 23 14:45:24 2019
New Revision: 277333

URL: https://gcc.gnu.org/viewcvs?rev=277333&root=gcc&view=rev
Log:
PR ipa/92074
* params.def (inline-heuristics-hint-percent): Set to 600.

Modified:
trunk/gcc/ChangeLog
trunk/gcc/params.def

[Bug middle-end/92153] [10 Regression] ICE / segmentation fault, use-after-free at gcc/ggc-page.c:1159

2019-10-25 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92153

--- Comment #3 from Jan Hubicka  ---
Author: hubicka
Date: Fri Oct 25 11:17:38 2019
New Revision: 277443

URL: https://gcc.gnu.org/viewcvs?rev=277443&root=gcc&view=rev
Log:
Backport ggc_trim
Backport from mainline

2019-10-18  Jakub Jelinek  
PR middle-end/92153
* ggc-page.c (release_pages): Read g->alloc_size before free rather
than after it.

2019-10-11  Jan Hubicka  

* ggc-page.c (release_pages): Output statistics when !quiet_flag.
(ggc_collect): Dump later to not interfere with release_page dump.
(ggc_trim): New function.
* ggc-none.c (ggc_trim): New.
* ggc.h (ggc_trim): Declare.

* lto-partition.c (add_symbol_to_partition_1): Update.
(undo_parittion): Update.

Modified:
branches/gcc-9-branch/gcc/ChangeLog
branches/gcc-9-branch/gcc/ggc-none.c
branches/gcc-9-branch/gcc/ggc-page.c
branches/gcc-9-branch/gcc/ggc.h
branches/gcc-9-branch/gcc/lto/ChangeLog
branches/gcc-9-branch/gcc/lto/lto.c

[Bug ipa/92242] [10 regression] LTO ICE in ipa_get_cs_argument_count ipa-prop.h:598

2019-10-28 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92242

--- Comment #3 from Jan Hubicka  ---
Author: hubicka
Date: Mon Oct 28 08:19:56 2019
New Revision: 277504

URL: https://gcc.gnu.org/viewcvs?rev=277504&root=gcc&view=rev
Log:

PR ipa/92242
* ipa-fnsummary.c (ipa_merge_fn_summary_after_inlining): Check
for missing EDGE_REF
* ipa-prop.c (update_jump_functions_after_inlining): Likewise.

Modified:
trunk/gcc/ChangeLog
trunk/gcc/ipa-fnsummary.c
trunk/gcc/ipa-prop.c

[Bug ipa/92242] [10 regression] LTO ICE in ipa_get_cs_argument_count ipa-prop.h:598

2019-10-28 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92242

Jan Hubicka  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED

--- Comment #5 from Jan Hubicka  ---
Thanks for confirmation (and testcase). Sadly I am not sure how to put it into
testsuite but given that other tests also broke I hope this patch is tested
sufficiently.

Honza

[Bug ipa/92278] [10 regression] LTO ICE ipa_get_ith_polymorhic_call_context ipa-prop.h:616

2019-10-30 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92278

Jan Hubicka  changed:

   What|Removed |Added

 CC||mjambor at suse dot cz

--- Comment #3 from Jan Hubicka  ---
Since there is no -O0 code here involved I am not sure why the summary gone
missing.  We probably should debug that. I think my today patch silences the
ICE however.

Martin, do you have any idea?

[Bug ipa/92254] [10 regression] ICE LTO in inline_small_functions, at ipa-inline.c:2000

2019-10-30 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92254

Jan Hubicka  changed:

   What|Removed |Added

 CC||mjambor at suse dot cz

--- Comment #3 from Jan Hubicka  ---
Similarly here. It seems like previoulsy latent bug showing up now.

[Bug ipa/92394] New: operand_equal_p should compare as base+offset when comparing addresses

2019-11-06 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92394

Bug ID: 92394
   Summary: operand_equal_p should compare as base+offset when
comparing addresses
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: ipa
  Assignee: unassigned at gcc dot gnu.org
  Reporter: hubicka at gcc dot gnu.org
CC: marxin at gcc dot gnu.org
  Target Milestone: ---

Compiling firefox one gets many of:
  false returned: '' in operand_equal_p at ../../gcc/ipa-icf-gimple.c:259   
  false returned: 'operand_equal_p failed' in compare_operand at
../../gcc/ipa-icf-gimple.c:303
  false returned: 'memory operands are different' in compare_gimple_assign at
../../gcc/ipa-icf-gimple.c:621
  different statement for code: GIMPLE_ASSIGN (compare_bb:468): 
_6 = &self_5->D.1557805.D.1541362.D.1218628.D.20474;
_6 = &self_5->D.1593155;
  false returned: '' in equals_private at ../../gcc/ipa-icf.c:885   
Equals called for: _finalize/10691342:_finalize/10809461 with result: false 

here operand_equal_p seems overly conservative (assuming that base+offset
match). When comparing stuff in ADDR_EXPR it does not need to care about actual
access path.

[Bug ipa/92394] operand_equal_p should compare as base+offset when comparing addresses

2019-11-06 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92394

--- Comment #1 from Jan Hubicka  ---
Following testcase is mergeable:

struct a {int a; int b;};
struct b {int c; short d;};

void *
retadr1(struct a *a)
{
  return &a->b;
}
void *
retadr2(struct b *a)
{
  return &a->d;
}

[Bug ipa/92394] operand_equal_p should compare as base+offset when comparing addresses

2019-11-06 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92394

Jan Hubicka  changed:

   What|Removed |Added

 Status|NEW |UNCONFIRMED
   Last reconfirmed|2019-11-06 00:00:00 |
Version|10.0|unknown
   Assignee|marxin at gcc dot gnu.org  |unassigned at gcc dot 
gnu.org
   Target Milestone|10.0|---
 Ever confirmed|1   |0

--- Comment #2 from Jan Hubicka  ---
this is statistics of reason why ICF failes:
   6523   false returned: 'different tree types' in compatible_types_p at
../../gcc/ipa-icf-gimple.c:203
   7521   false returned: 'parameter types are not compatible' in equals_wpa at
../../gcc/ipa-icf.c:637
  12973   false returned: 'memory operands are different' in
compare_gimple_call at ../../gcc/ipa-icf-gimple.c:582
  14799   false returned: 'decl_or_type flags are different' in equals_wpa at
../../gcc/ipa-icf.c:570
  16052   false returned: 'inline attributes are different' in
compare_referenced_symbol_properties at ../../gcc/ipa-icf.c:344
  20962   false returned: 'references to virtual tables cannot be merged' in
compare_referenced_symbol_properties at ../../gcc/ipa-icf.c:364
  72431   false returned: 'call function types are not compatible' in
compare_gimple_call at ../../gcc/ipa-icf-gimple.c:566
  80695   false returned: 'result types are different' in equals_wpa at
../../gcc/ipa-icf.c:619
  84475   false returned: 'types are not compatible' in compatible_types_p at
../../gcc/ipa-icf-gimple.c:209
 117458   false returned: '' in compare_gimple_call at
../../gcc/ipa-icf-gimple.c:545
 388866   false returned: 'THIS pointer ODR type mismatch' in equals_wpa at
../../gcc/ipa-icf.c:675
 391183   false returned: 'types are not same for ODR' in
compatible_polymorphic_types_p at ../../gcc/ipa-icf-gimple.c:194
 618107   false returned: '' in operand_equal_p at
../../gcc/ipa-icf-gimple.c:259
2953032   false returned: 'memory operands are different' in
compare_gimple_assign at ../../gcc/ipa-icf-gimple.c:621
3083711   false returned: 'operand_equal_p failed' in compare_operand at
../../gcc/ipa-icf-gimple.c:303
3156681   false returned: '' in equals_private at ../../gcc/ipa-icf.c:885


so 2.9M functions are streamed in for memory operands being different.

Honza

[Bug ipa/92394] operand_equal_p should compare as base+offset when comparing addresses

2019-11-06 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92394

--- Comment #3 from Jan Hubicka  ---
This is corresponding stats from gcc 9, so we definitly load a lot more bodies
now
  13228   false returned: 'memory operands are different'
(compare_gimple_call:785)
  14011   false returned: 'decl_or_type flags are different' (equals_wpa:577)
  15619   false returned: 'types are not compatible' (compatible_types_p:233)
  16877   false returned: (compare_cst_or_decl:341)
  17365   false returned: 'references to virtual tables cannot be merged'
(compare_referenced_symbol_properties:370)
  19423   false returned: (compare_operand:478)
  28816   false returned: (compare_operand:509)
  87413   false returned: 'memory operands are different'
(compare_gimple_assign:824)
 199751   false returned: 'THIS pointer ODR type mismatch' (equals_wpa:682)
 201097   false returned: 'types are not same for ODR'
(compatible_polymorphic_types_p:218)
 375744   false returned: 'parameter type is not compatible'
(compatible_parm_types_p:509)
 457840   false returned: '' (equals_private:890)
 783534   false returned: 'alias sets are different' (compatible_types_p:244)


gcc 9 merges 40k functions, while trunk 30k.

[Bug lto/92406] [10 Regression] ICE in ipa_call_summary at ipa-fnsummary.h:253 with lto and pgo

2019-11-07 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92406

--- Comment #4 from Jan Hubicka  ---
Created attachment 47193
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=47193&action=edit
Proposed patch

Hi,
does this patch fix the problem?
Honza

[Bug lto/92406] [10 Regression] ICE in ipa_call_summary at ipa-fnsummary.h:253 with lto and pgo

2019-11-07 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92406

--- Comment #7 from Jan Hubicka  ---
Author: hubicka
Date: Thu Nov  7 17:08:11 2019
New Revision: 277927

URL: https://gcc.gnu.org/viewcvs?rev=277927&root=gcc&view=rev
Log:

PR ipa/92406
* ipa-fnsummary.c (analyze_function_body): Use get_create to copy
summary.

Modified:
trunk/gcc/ChangeLog
trunk/gcc/ipa-fnsummary.c

[Bug ipa/92471] [ICE] lto1 segmentation fault: ipa-profile.c ipa_get_cs_argument_count (args=0x0)

2019-11-12 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92471

Jan Hubicka  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #6 from Jan Hubicka  ---
Fixed.

[Bug ipa/92471] [ICE] lto1 segmentation fault: ipa-profile.c ipa_get_cs_argument_count (args=0x0)

2019-11-12 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92471

--- Comment #5 from Jan Hubicka  ---
Author: hubicka
Date: Tue Nov 12 19:31:04 2019
New Revision: 278100

URL: https://gcc.gnu.org/viewcvs?rev=278100&root=gcc&view=rev
Log:
PR ipa/92471
* ipa-profile.c (check_argument_count): Break out from ...;
watch for missing summaries.
(ipa_profile): Here.

Modified:
trunk/gcc/ChangeLog
trunk/gcc/ipa-profile.c

[Bug ipa/92498] [10 regression] gcc.dg/tree-prof/crossmodule-indircall-1.c fails starting with r278100

2019-11-13 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92498

--- Comment #1 from Jan Hubicka  ---
Author: hubicka
Date: Wed Nov 13 19:44:35 2019
New Revision: 278157

URL: https://gcc.gnu.org/viewcvs?rev=278157&root=gcc&view=rev
Log:
PR ipa/92498
* ipa-profile.c (check_argument_count): Do not ICE when descriptors
is NULL.
(ipa_profile): Fix reversed test.

Modified:
trunk/gcc/ChangeLog
trunk/gcc/ipa-profile.c

[Bug ipa/92421] [10 Regression] ICE in inline_small_functions, at ipa-inline.c:2001 since r277759

2019-11-13 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92421

--- Comment #6 from Jan Hubicka  ---
Author: hubicka
Date: Wed Nov 13 21:02:11 2019
New Revision: 278159

URL: https://gcc.gnu.org/viewcvs?rev=278159&root=gcc&view=rev
Log:

PR c++/92421
* ipa-prop.c (update_indirect_edges_after_inlining):
Mark parameter as used.
* ipa-inline.c (recursive_inlining): Reset node cache
after inlining.
(inline_small_functions): Remove checking ifdef.
* ipa-inline-analysis.c (do_estimate_edge_time): Verify
cache consistency.
* g++.dg/torture/pr92421.C: New testcase.

Added:
trunk/gcc/testsuite/g++.dg/torture/pr92421.C
Modified:
trunk/gcc/ChangeLog
trunk/gcc/ipa-inline-analysis.c
trunk/gcc/ipa-inline.c
trunk/gcc/ipa-prop.c
trunk/gcc/testsuite/ChangeLog

[Bug c/66825] RFE: Add attributes for symbol versioning.

2019-11-14 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66825

Jan Hubicka  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 CC||hubicka at gcc dot gnu.org
 Resolution|--- |DUPLICATE

--- Comment #2 from Jan Hubicka  ---
We have earlier bug on this. I am going to attach WIP patch there.

*** This bug has been marked as a duplicate of bug 48200 ***

[Bug lto/48200] Implement function attribute for symbol versioning (.symver)

2019-11-14 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=48200

Jan Hubicka  changed:

   What|Removed |Added

 CC||carlos at redhat dot com

--- Comment #37 from Jan Hubicka  ---
*** Bug 66825 has been marked as a duplicate of this bug. ***

[Bug testsuite/92520] [10 Regression] new test case gcc/testsuite/gcc.dg/ipa/inline-9.c in r278220 is unresolved

2019-11-15 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92520

--- Comment #1 from Jan Hubicka  ---
Author: hubicka
Date: Fri Nov 15 08:19:16 2019
New Revision: 278279

URL: https://gcc.gnu.org/viewcvs?rev=278279&root=gcc&view=rev
Log:
PR testsuite/92520
* gcc.dg/ipa/inline-9.c: Fix template.

Modified:
trunk/gcc/testsuite/ChangeLog
trunk/gcc/testsuite/gcc.dg/ipa/inline-9.c

[Bug testsuite/92520] [10 Regression] new test case gcc/testsuite/gcc.dg/ipa/inline-9.c in r278220 is unresolved

2019-11-15 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92520

Jan Hubicka  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED

--- Comment #2 from Jan Hubicka  ---
I have fixed the testcase in r278279

[Bug lto/48200] Implement function attribute for symbol versioning (.symver)

2019-11-15 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=48200

--- Comment #40 from Jan Hubicka  ---
I posted initial patch here
https://gcc.gnu.org/ml/gcc-patches/2019-11/msg01334.html

[Bug ipa/92528] [10 Regression] ICE in ipa_get_parm_lattices since r278219

2019-11-15 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92528

Jan Hubicka  changed:

   What|Removed |Added

   Assignee|hubicka at gcc dot gnu.org |fxue at os dot 
amperecomputing.com

--- Comment #6 from Jan Hubicka  ---
This is the same issue as I hit in Firefox build and we discussed at:
https://gcc.gnu.org/ml/gcc-patches/2019-11/msg01351.html
Feng is right that ipa_set_jf_unknown is missing clear of agg.

> I checked update_jump_functions_after_inlining(), and found one suspicious 
> place:

>  for (i = 0; i < count; i++)
>{
>  struct ipa_jump_func *dst = ipa_get_ith_jump_func (args, i);
>  if (!top)
>{
>  ipa_set_jf_unknown (dst);
>  <<<<<<<<<<<<<<<<<   we should also invalidate dst->agg.items.

Yes following patch fixes it:

Index: ipa-prop.c
===
--- ipa-prop.c  (revision 278222)
+++ ipa-prop.c  (working copy)
@@ -514,6 +514,8 @@ ipa_set_jf_unknown (struct ipa_jump_func
   jfunc->type = IPA_JF_UNKNOWN;
   jfunc->bits = NULL;
   jfunc->m_vr = NULL;
+  jfunc->agg.by_ref = false;
+  jfunc->agg.items = NULL;
 }

 /* Set JFUNC to be a copy of another jmp (to be used by jump function

>  continue;
>}
>  class ipa_polymorphic_call_context *dst_ctx
>= ipa_get_ith_polymorhic_call_context (args, i);   <<<< An irrelevant 
> point:  and should we also do some kind of cleaning on dst_ctx?

There is no need to clear polymorphic call context. It does not refer to the
parameters of caller. If it was valid for all possible contexts it is still
valid. 

So I think ipa_set_jf_unknown shall not clear bits and m_vr.

Honza

[Bug ipa/92508] [10 Regression] ICE in do_estimate_edge_time, at ipa-inline-analysis.c:223 since r278159

2019-11-15 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92508

--- Comment #8 from Jan Hubicka  ---
Aha, that makes sense for sreal it is not sure that
  a == a * 1 / 1
and the code was inconsistent about guaring the noop scales.
Thanks for tracking this down! I suppose it would also make sense to
pre-compute 1/1 and use it instead of divisions.  I will look into it after
fixing other issues.

Honza

[Bug ipa/92535] New: [10 regression] ICF is relatively expensive and became less effective

2019-11-15 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92535

Bug ID: 92535
   Summary: [10 regression] ICF is relatively expensive and became
less effective
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: ipa
  Assignee: unassigned at gcc dot gnu.org
  Reporter: hubicka at gcc dot gnu.org
CC: marxin at gcc dot gnu.org
  Target Milestone: ---

Created attachment 47274
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=47274&action=edit
Memory use graph for linktime for GCC10

ICF currently is very conservative optimizing libxul.so saving only about 1.5%
of text segment:

$ bloaty libxul.so -- libxul.so.old2
 VM SIZE  FILE SIZE
 ++ GROWING++
  +1.5% +1.21Mi .text  +1.21Mi  +1.5%
  +4.4%  +351Ki .eh_frame   +351Ki  +4.4%
  +6.0%  +102Ki .eh_frame_hdr   +102Ki  +6.0%
  [ = ]   0 .strtab+62.4Ki  +0.2%
  +0.5% +52.6Ki .rela.dyn  +52.6Ki  +0.5%
  +0.1% +19.6Ki .rodata+19.6Ki  +0.1%
  +0.4% +13.2Ki .data.rel.ro.local +13.2Ki  +0.4%
  +1.3% +9.97Ki .data.rel.ro   +9.97Ki  +1.3%
  +0.2% +12 .gcc_except_table  +12  +0.2%

 -- SHRINKING  --
  [ = ]   0 .symtab-10.0Ki  -0.1%
  -0.0% -64 .data  -64  -0.0%
  -0.0% -16 .bss 0  [ = ]

 -+-+-+-+-+-+-+ MIXED  +-+-+-+-+-+-+-
   +76%+124 [Unmapped] -3.04Ki -77.5%

  +1.3% +1.75Mi TOTAL  +1.79Mi  +0.9%

This used to be 7% in GCC5 (at Firefox from 2015)

At the same time it is relatively expensive memory wise and compile time wise.

It increases peak memory use from 6GB to 7.5GB and compile time from:
real8m57.454s
user91m8.020s
sys 6m20.372s

to

real9m41.361s
user91m47.076s
sys 6m16.760s

For GCC 9 the code size improvement is 2.3%, build time change is:
real7m53.778s
user76m10.368s
sys 6m55.324s

to

real8m14.613s
user72m57.932s
sys 6m32.792s

and peak memory use is from 8gm to 10gb.

[Bug ipa/92535] [10 regression] ICF is relatively expensive and became less effective

2019-11-15 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92535

--- Comment #1 from Jan Hubicka  ---
Created attachment 47275
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=47275&action=edit
memory use of GCC10 with icf disabled

[Bug ipa/92535] [10 regression] ICF is relatively expensive and became less effective

2019-11-15 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92535

--- Comment #3 from Jan Hubicka  ---
Created attachment 47277
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=47277&action=edit
Meory use of gcc9 with ICF disabled

[Bug ipa/92535] [10 regression] ICF is relatively expensive and became less effective

2019-11-15 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92535

--- Comment #2 from Jan Hubicka  ---
Created attachment 47276
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=47276&action=edit
Memory use of gcc9

[Bug ipa/92535] [10 regression] ICF is relatively expensive and became less effective

2019-11-15 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92535

--- Comment #4 from Jan Hubicka  ---
Forgot bloaty report for GCC9 and disabling ICF
$ bloaty libxul.so -- libxul.so.old 
 VM SIZE  FILE SIZE
 ++ GROWING++
  +2.3% +1.87Mi .text  +1.87Mi  +2.3%
  +5.4%  +423Ki .eh_frame   +423Ki  +5.4%
  +7.1%  +122Ki .eh_frame_hdr   +122Ki  +7.1%
  +0.6% +61.3Ki .rela.dyn  +61.3Ki  +0.6%
  +0.2% +29.8Ki .rodata+29.8Ki  +0.2%
  +0.4% +14.1Ki .data.rel.ro.local +14.1Ki  +0.4%
  +1.5% +12.0Ki .data.rel.ro   +12.0Ki  +1.5%
  +0.1%+224 .data +224  +0.1%

 -- SHRINKING  --
  [ = ]   0 .strtab -291Ki  -0.7%
  [ = ]   0 .symtab-46.6Ki  -0.3%
 -69.6%-240 [Unmapped] -3.12Ki -73.1%
  -0.0%-120 .bss 0  [ = ]

  +1.8% +2.51Mi TOTAL  +2.18Mi  +1.1%

[Bug ipa/92508] [10 Regression] ICE in do_estimate_edge_time, at ipa-inline-analysis.c:223 since r278159

2019-11-18 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92508

--- Comment #15 from Jan Hubicka  ---
Author: hubicka
Date: Mon Nov 18 19:28:53 2019
New Revision: 278419

URL: https://gcc.gnu.org/viewcvs?rev=278419&root=gcc&view=rev
Log:

PR ipa/92508
* ipa-inline.c (inline_small_functions): Add new edges after reseting
caches.
* ipa-inline-analysis.c (do_estimate_edge_time): Fix sanity check.

Modified:
trunk/gcc/ChangeLog
trunk/gcc/ipa-inline-analysis.c
trunk/gcc/ipa-inline.c

[Bug ipa/92508] [10 Regression] ICE in do_estimate_edge_time, at ipa-inline-analysis.c:223 since r278159

2019-11-18 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92508

Jan Hubicka  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #16 from Jan Hubicka  ---
Fixed all three problems.

[Bug ipa/92476] [10 regression] SEGV in cgraph_edge_brings_value_p

2019-11-18 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92476

Jan Hubicka  changed:

   What|Removed |Added

   Assignee|hubicka at gcc dot gnu.org |mjambor at suse dot cz

--- Comment #3 from Jan Hubicka  ---
Martin,
this problem is caused by ipa-cp deciding to clone function which has thunk
associated to it.  create_virtual_clone then copies thunk (which is your code)
and and expands all thunks.  This turn thunk into real function and because
ipa-cp does not produce summaries for thunks we now ICE because summary is
missing.  I tried the following to compute the missing summary:
Index: cgraphclones.c
===
--- cgraphclones.c  (revision 278390)
+++ cgraphclones.c  (working copy)
@@ -80,6 +80,11 @@ along with GCC; see the file COPYING3.
 #include "tree-inline.h"
 #include "dumpfile.h"
 #include "gimple-pretty-print.h"
+#include "alloc-pool.h"
+#include "symbol-summary.h"
+#include "tree-vrp.h"
+#include "ipa-prop.h"
+#include "ipa-fnsummary.h"

 /* Create clone of edge in the node N represented by CALL_EXPR
the callgraph.  */
@@ -268,6 +273,8 @@ cgraph_node::expand_all_artificial_thunk
thunk->thunk.thunk_p = false;
thunk->analyze ();
  }
+   ipa_analyze_node (thunk);
+   inline_analyze_function (thunk);
thunk->expand_all_artificial_thunks ();
   }
 else

but that moves the ICE later:
hubicka@lomikamen-jh:/aux/hubicka/trunk5/build-lto/gcc$ ./xgcc -B ./ -O2 a.C
-m32
during IPA pass: cp
a.C:40:1: internal compiler error: in ipa_get_parm_lattices, at ipa-cp.c:388
   40 | }
  | ^
0x234db7d ipa_get_parm_lattices
../../gcc/ipa-cp.c:388
0x23595b1 ipcp_store_bits_results
../../gcc/ipa-cp.c:5417
0x2359c7a ipcp_driver
../../gcc/ipa-cp.c:5558
0x2359e58 execute
../../gcc/ipa-cp.c:5647
Please submit a full bug report,

which is caued by fact that we have no lattices for that function (since it was
not considered by the propagator). 

I also wonder why these seems to show with 32bit only.
Honza

[Bug c++/55135] Segfault of gcc on a big file

2019-11-21 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55135

--- Comment #30 from Jan Hubicka  ---
Reconfirmed that we still take ages to build the testcase (early inliner is
still running for me)

The early inliner issue here is caused by tree-inline removing individual
clones one by one.  Each time a clone is removed a new clone becomes a root of
the clone tree and it takes long time to update all pointers.

[Bug ipa/44563] GCC uses a lot of RAM when compiling a large numbers of functions

2019-11-21 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=44563

Jan Hubicka  changed:

   What|Removed |Added

 Status|ASSIGNED|NEW
   Assignee|hubicka at gcc dot gnu.org |unassigned at gcc dot 
gnu.org

--- Comment #38 from Jan Hubicka  ---
 it is GCC10 but I finally managed to implement the incremental update
here.
Memory use is about 1.1GB but inliner finishes quite quickly:

Time variable   usr   sys  wall
  GGC
 phase setup:   0.00 (  0%)   0.00 (  0%)   0.00 (  0%)
   1237 kB (  0%)
 phase parsing  :   1.29 (  2%)   1.24 (  6%)   2.54 (  3%)
 247897 kB (  6%)
 phase lang. deferred   :   0.01 (  0%)   0.00 (  0%)   0.01 (  0%)
  0 kB (  0%)
 phase opt and generate :  56.81 ( 98%)  19.35 ( 94%)  76.27 ( 97%)
3859026 kB ( 94%)
 garbage collection :   0.84 (  1%)   0.10 (  0%)   0.93 (  1%)
  0 kB (  0%)
 dump files :   3.28 (  6%)   1.85 (  9%)   5.30 (  7%)
  0 kB (  0%)
 callgraph construction :   0.70 (  1%)   0.28 (  1%)   1.07 (  1%)
  99328 kB (  2%)
 callgraph optimization :   1.38 (  2%)   0.74 (  4%)   2.03 (  3%)
   1026 kB (  0%)
 callgraph functions expansion  :  47.27 ( 81%)  15.51 ( 75%)  62.89 ( 80%)
2827825 kB ( 69%)
 callgraph ipa passes   :   8.19 ( 14%)   3.26 ( 16%)  11.45 ( 15%)
 709147 kB ( 17%)
 ipa function summary   :   0.34 (  1%)   0.08 (  0%)   0.43 (  1%)
  97794 kB (  2%)
 ipa dead code removal  :   0.25 (  0%)   0.01 (  0%)   0.27 (  0%)
  0 kB (  0%)
 ipa inheritance graph  :   0.01 (  0%)   0.00 (  0%)   0.02 (  0%)
  0 kB (  0%)
 ipa devirtualization   :   0.04 (  0%)   0.00 (  0%)   0.04 (  0%)
  0 kB (  0%)
 ipa cp :   0.23 (  0%)   0.02 (  0%)   0.27 (  0%)
   7169 kB (  0%)
 ipa inlining heuristics:   0.19 (  0%)   0.00 (  0%)   0.22 (  0%)
  0 kB (  0%)
 ipa function splitting :   0.02 (  0%)   0.01 (  0%)   0.06 (  0%)
  0 kB (  0%)
 ipa comdats:   0.05 (  0%)   0.00 (  0%)   0.05 (  0%)
  0 kB (  0%)
 ipa various optimizations  :   0.06 (  0%)   0.00 (  0%)   0.06 (  0%)
  0 kB (  0%)
 ipa reference  :   0.10 (  0%)   0.00 (  0%)   0.11 (  0%)
  0 kB (  0%)
 ipa profile:   0.07 (  0%)   0.00 (  0%)   0.06 (  0%)
  0 kB (  0%)
 ipa pure const :   0.45 (  1%)   0.15 (  1%)   0.47 (  1%)
  0 kB (  0%)
 ipa icf:   0.22 (  0%)   0.01 (  0%)   0.23 (  0%)
  0 kB (  0%)
 ipa SRA:   0.13 (  0%)   0.00 (  0%)   0.14 (  0%)
   5120 kB (  0%)
 ipa free lang data :   0.04 (  0%)   0.00 (  0%)   0.04 (  0%)
  0 kB (  0%)
 ipa free inline summary:   0.08 (  0%)   0.00 (  0%)   0.07 (  0%)
  0 kB (  0%)
 cfg construction   :   0.07 (  0%)   0.01 (  0%)   0.19 (  0%)
  0 kB (  0%)
 cfg cleanup:   0.73 (  1%)   0.23 (  1%)   0.95 (  1%)
  0 kB (  0%)
 trivially dead code:   0.30 (  1%)   0.06 (  0%)   0.30 (  0%)
  0 kB (  0%)
 df scan insns  :   0.81 (  1%)   0.21 (  1%)   0.93 (  1%)
   3072 kB (  0%)
 df multiple defs   :   0.28 (  0%)   0.06 (  0%)   0.41 (  1%)
  0 kB (  0%)
 df reaching defs   :   1.48 (  3%)   0.20 (  1%)   1.63 (  2%)
  0 kB (  0%)
 df live regs   :   1.12 (  2%)   0.26 (  1%)   1.33 (  2%)
  0 kB (  0%)
 df live&initialized regs   :   0.51 (  1%)   0.19 (  1%)   0.66 (  1%)
  0 kB (  0%)
 df must-initialized regs   :   0.11 (  0%)   0.06 (  0%)   0.14 (  0%)
  0 kB (  0%)
 df use-def / def-use chains:   0.36 (  1%)   0.04 (  0%)   0.43 (  1%)
  0 kB (  0%)
 df reg dead/unused notes   :   1.69 (  3%)   0.20 (  1%)   1.81 (  2%)
  12288 kB (  0%)
 register information   :   0.38 (  1%)   0.04 (  0%)   0.39 (  0%)
  0 kB (  0%)
 alias analysis :   0.82 (  1%)   0.17 (  1%)   1.15 (  1%)
  36865 kB (  1%)
 alias stmt walking :   0.06 (  0%)   0.04 (  0%)   0.07 (  0%)
  0 kB (  0%)
 register scan  :   0.07 (  0%)   0.03 (  0%)   0.11 (  0%)
  0 kB (  0%)
 rebuild jump labels:   0.16 (  0%)   0.06 (  0%)   0.14 (  0%)
  0 kB (  0%)
 preprocessing  :   0.39 (  1%)   0.32 (  2%)   0.49 (  1%)
  44508 kB (  1%)
 lexical analysis   :   0.32 (  1%)   0.39 (  2%)   0.73 (  1%)
  0 kB (  0%)
 parser (global):   0.11 (  0%)   

[Bug ipa/60243] IPA is slow on large cgraph tree

2019-11-21 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60243

--- Comment #27 from Jan Hubicka  ---
profile_estimate issue is still here, inliner and early inliner issues seems
solved. Seems that ipa_profile just orders the nodes for propagation in wrong
way - we propagate from callers to callees while toposorter is for propagation
opoposite way.

operand_scan seems slow too.

Time variable   usr   sys  wall
  GGC
 phase setup:   0.00 (  0%)   0.00 (  0%)   0.00 (  0%)
   1237 kB (  0%)
 phase parsing  :   6.63 (  9%)   6.77 ( 77%)  13.41 ( 17%)
 655497 kB ( 20%)
 phase opt and generate :  64.47 ( 91%)   2.07 ( 23%)  66.57 ( 83%)
2603397 kB ( 80%)
 garbage collection :   0.64 (  1%)   0.00 (  0%)   0.65 (  1%)
  0 kB (  0%)
 dump files :   0.05 (  0%)   0.01 (  0%)   0.04 (  0%)
  0 kB (  0%)
 callgraph construction :   0.91 (  1%)   0.01 (  0%)   0.83 (  1%)
 399235 kB ( 12%)
 callgraph optimization :   0.37 (  1%)   0.00 (  0%)   0.43 (  1%)
  0 kB (  0%)
 callgraph functions expansion  :  15.98 ( 22%)   1.20 ( 14%)  17.18 ( 21%)
 297309 kB (  9%)
 callgraph ipa passes   :  40.57 ( 57%)   0.40 (  5%)  40.99 ( 51%)
 617751 kB ( 19%)
 ipa function summary   :   0.14 (  0%)   0.00 (  0%)   0.14 (  0%)
   1807 kB (  0%)
 ipa dead code removal  :   0.22 (  0%)   0.00 (  0%)   0.24 (  0%)
  0 kB (  0%)
 ipa cp :   0.97 (  1%)   0.03 (  0%)   1.03 (  1%)
 327514 kB ( 10%)
 ipa inlining heuristics:   0.72 (  1%)   0.00 (  0%)   0.63 (  1%)
  84183 kB (  3%)
 ipa function splitting :   0.02 (  0%)   0.00 (  0%)   0.05 (  0%)
  0 kB (  0%)
 ipa various optimizations  :   0.69 (  1%)   0.20 (  2%)   0.89 (  1%)
 128398 kB (  4%)
 ipa reference  :   0.05 (  0%)   0.00 (  0%)   0.05 (  0%)
  0 kB (  0%)
 ipa profile:  18.24 ( 26%)   0.00 (  0%)  18.25 ( 23%)
  0 kB (  0%)
 ipa pure const :   0.45 (  1%)   0.00 (  0%)   0.46 (  1%)
  0 kB (  0%)
 ipa icf:   0.17 (  0%)   0.02 (  0%)   0.17 (  0%)
  0 kB (  0%)
 ipa SRA:   0.21 (  0%)   0.00 (  0%)   0.21 (  0%)
102 kB (  0%)
 ipa free inline summary:   0.03 (  0%)   0.00 (  0%)   0.04 (  0%)
  0 kB (  0%)
 cfg cleanup:   0.00 (  0%)   0.01 (  0%)   0.02 (  0%)
  0 kB (  0%)
 trivially dead code:   0.12 (  0%)   0.03 (  0%)   0.12 (  0%)
  0 kB (  0%)
 df scan insns  :   0.85 (  1%)   0.14 (  2%)   1.28 (  2%)
 46 kB (  0%)
 df multiple defs   :   0.30 (  0%)   0.06 (  1%)   0.31 (  0%)
  0 kB (  0%)
 df reaching defs   :   0.69 (  1%)   0.05 (  1%)   0.63 (  1%)
  0 kB (  0%)
 df live regs   :   0.49 (  1%)   0.02 (  0%)   0.57 (  1%)
  0 kB (  0%)
 df live&initialized regs   :   0.19 (  0%)   0.01 (  0%)   0.12 (  0%)
  0 kB (  0%)
 df must-initialized regs   :   0.10 (  0%)   0.00 (  0%)   0.10 (  0%)
  0 kB (  0%)
 df use-def / def-use chains:   0.44 (  1%)   0.05 (  1%)   0.40 (  1%)
  0 kB (  0%)
 df reg dead/unused notes   :   1.35 (  2%)   0.09 (  1%)   1.15 (  1%)
747 kB (  0%) register information   :   0.16 (  0%)   0.00 ( 
0%)   0.18 (  0%)   0 kB (  0%)
 alias analysis :   0.16 (  0%)   0.00 (  0%)   0.11 (  0%)
436 kB (  0%)
 alias stmt walking :   0.49 (  1%)   0.07 (  1%)   0.67 (  1%)
  0 kB (  0%)
 register scan  :   0.04 (  0%)   0.00 (  0%)   0.01 (  0%)
  0 kB (  0%)
 rebuild jump labels:   0.00 (  0%)   0.00 (  0%)   0.01 (  0%)
  0 kB (  0%)
 preprocessing  :   2.37 (  3%)   2.37 ( 27%)   4.49 (  6%)
 383477 kB ( 12%)
 lexical analysis   :   1.88 (  3%)   2.13 ( 24%)   4.20 (  5%)
  0 kB (  0%)
 parser (global):   0.01 (  0%)   0.01 (  0%)   0.03 (  0%)
   1442 kB (  0%)
 parser function body   :   2.19 (  3%)   2.26 ( 26%)   4.50 (  6%)
 270577 kB (  8%)
 early inlining heuristics  :   2.80 (  4%)   0.03 (  0%)   2.81 (  4%)
   3076 kB (  0%)
 inline parameters  :   6.43 (  9%)   0.14 (  2%)   6.74 (  8%)
  31127 kB (  1%)
 integration:   0.17 (  0%)   0.00 (  0%)   0.08 (  0%)
   6789 kB (  0%)
 tree gimplify  :   1.01 (  1%)   0.03 (  0%)   1.15 (  1%)
 610970 kB ( 19%)
 tree eh:   0.50 (  1%)   0.03 (  0%)   0.44 (  1%)
  0 kB (  0%)
 tree CFG construction  :   3.50 (  5%)   0.02 (  0%)   3.74 (  5%)
 628087 kB ( 19%)
 tree CFG cleanup   

[Bug tree-optimization/92632] New: Calculix regression

2019-11-22 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92632

Bug ID: 92632
   Summary: Calculix regression
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: hubicka at gcc dot gnu.org
  Target Milestone: ---

LNT testing show 137% regression of calculix with LTO and PGO
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=288.170.0
The range is between
Revision: fbbadf0018292a93 (2019-11-15 03:28)
and
Revision: 1e9cd853b7ecae82 (2019-11-18 02:22)

The diff from this range is:
+2019-11-18  Hongtao Liu  
+
+   PR target/92448
+   * config/i386/i386-expand.c (ix86_expand_set_or_cpymem):
+   Replace TARGET_AVX128_OPTIMAL with TARGET_AVX256_SPLIT_REGS.
+   * config/i386/i386-option.c (ix86_vec_cost): Ditto.
+   (ix86_reassociation_width): Ditto.
+   * config/i386/i386-options.c (ix86_option_override_internal):
+   Replace TARGET_AVX128_OPTIAML with
+   ix86_tune_features[X86_TUNE_AVX128_OPTIMAL]
+   * config/i386/i386.h (TARGET_AVX256_SPLIT_REGS): New macro.
+   (TARGET_AVX128_OPTIMAL): Deleted.
+   * config/i386/x86-tune.def (X86_TUNE_AVX256_SPLIT_REGS): New
+   DEF_TUNE.
+
+2019-11-16  Segher Boessenkool  
+
+   * config/rs6000/rs6000.md (cceq_ior_compare): Rename to...
+   (@cceq_ior_compare_ for GPR): ... this.  Allow GPR instead of
+   just SI.
+   (cceq_rev_compare): Rename to...
+   (@cceq_rev_compare_ for GPR): ... this.  Allow GPR instead of
+   just SI.
+   (define_split for tf_): Add SImode first argument to
+   gen_cceq_ior_compare.
+
+2019-11-16  Segher Boessenkool  
+
+   * common/config/powerpcspe: Delete.
+
+2019-11-16  Richard Sandiford  
+
+   * config/aarch64/aarch64-sve.md (aarch64_wrffr): Wrap the FFRT
+   output in UNSPEC_WRFFR.
+
+2019-11-16  Richard Sandiford  
+
+   * tree-data-ref.c (create_intersect_range_checks_index): Rewrite
+   the index tests to have the form (unsigned T) (B - A + bias) <= limit.
+
+2019-11-16  Richard Sandiford  
+
+   * tree-data-ref.c (create_intersect_range_checks_index)
+   (create_intersect_range_checks): Print dump messages.
+
+2019-11-16  Richard Sandiford  
+
+   * tree-data-ref.c (dump_alias_pair): New function.
+   (prune_runtime_alias_test_list): Use it to dump each merged alias pair.
+
+2019-11-16  Richard Sandiford  
+
+   * tree-data-ref.h (DR_ALIAS_MIXED_STEPS): New flag.
+   * tree-data-ref.c (prune_runtime_alias_test_list): Set it when
+   merging data references with different steps.
+   (create_intersect_range_checks_index): Take a
+   dr_with_seg_len_pair_t instead of two dr_with_seg_lens.
+   Bail out if DR_ALIAS_MIXED_STEPS is set.
+   (create_intersect_range_checks): Take a dr_with_seg_len_pair_t
+   instead of two dr_with_seg_lens.  Update call to
+   create_intersect_range_checks_index.
+   (create_runtime_alias_checks): Update call accordingly.
+
+2019-11-16  Richard Sandiford  
+
+   * tree-data-ref.h (DR_ALIAS_RAW, DR_ALIAS_WAR, DR_ALIAS_WAW)
+   (DR_ALIAS_ARBITRARY, DR_ALIAS_SWAPPED, DR_ALIAS_UNSWAPPED): New flags.
+   (dr_with_seg_len_pair_t::sequencing): New enum.
+   (dr_with_seg_len_pair_t::flags): New member variable.
+   (dr_with_seg_len_pair_t::dr_with_seg_len_pair_t): Take a sequencing
+   parameter and initialize the flags member variable.
+   * tree-loop-distribution.c (compute_alias_check_pairs): Update
+   call accordingly.
+   * tree-vect-data-refs.c (vect_prune_runtime_alias_test_list): Likewise.
+   Ensure the two data references in an alias pair are in statement
+   order, if there is a defined order.
+   * tree-data-ref.c (prune_runtime_alias_test_list): Use
+   DR_ALIAS_SWAPPED and DR_ALIAS_UNSWAPPED to record whether we've
+   swapped the references in a dr_with_seg_len_pair_t.  OR together
+   the flags when merging two dr_with_seg_len_pair_ts.  After merging,
+   try to restore the original dr_with_seg_len order, updating the
+   flags if that fails.
+
+2019-11-16  Richard Sandiford  
+
+   * tree-data-ref.c (prune_runtime_alias_test_list): Delay
+   swapping the dr_as based on init values until we've decided
+   whether to merge them.
+
+2019-11-16  Richard Sandiford  
+
+   * tree-data-ref.c (prune_runtime_alias_test_list): Sort the
+   two accesses in each dr_with_seg_len_pair_t before trying to
+   combine separate dr_with_seg_len_pair_ts.
+   * tree-loop-distribution.c (compute_alias_check_pairs): Don't do
+   that here.
+   * tree-vect-data-refs.c (vect_prune_runtime_alias_test_list): Likewise.
+
+2019-11-16  Richard Sandiford  
+
+   * config/aarch64/aarch64-sve.md
+   (scatter_store): Extend to...
+   (scatt

[Bug tree-optimization/92645] New: Hand written vector code is 450 times slower when compiled with GCC compared to Clang

2019-11-24 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92645

Bug ID: 92645
   Summary: Hand written vector code is 450 times slower when
compiled with GCC compared to Clang
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: hubicka at gcc dot gnu.org
  Target Milestone: ---

Hi,
the attached are preprocessed files for Skia where Clang ifdefs was removed so
we get roughly same file for GCC and Clang. The internal loop of
_ZN3hsw16blit_row_color32EPjPKjij, _ZN3hsw16blit_row_color32EPjPKjij,
_ZN3hsw16blit_row_color32EPjPKjij and _ZN3hsw16blit_row_color32EPjPKjij
looks a lot worse when compiled by GCC then by clang. 

I also added flatten to eliminate the inlining difference. Clang has heuristics
that makes functions with hand written vector code hot.

GCC code packs via stack:
  0.43 â   mov  %ax,0xae(%rsp)
  0.03 â   movzbl   0x78(%rsp),%eax
  0.02 â   mov  %cx,0xd8(%rsp)
  0.02 â   mov  %ax,0xb0(%rsp)
  0.54 â   vpextrb  $0x9,%xmm5,%eax
  0.16 â   mov  %ax,0xb2(%rsp)
  0.51 â   vpextrb  $0xa,%xmm5,%eax
  0.21 â   mov  %ax,0xb4(%rsp)
  0.16 â   vpextrb  $0xb,%xmm5,%eax
  0.46 â   mov  %ax,0xb6(%rsp)
  0.24 â   vpextrb  $0xc,%xmm5,%eax
  0.28 â   mov  %ax,0xb8(%rsp)
  0.41 â   vpextrb  $0xd,%xmm5,%eax
  0.20 â   mov  %ax,0xba(%rsp)
  0.47 â   vpextrb  $0xe,%xmm5,%eax
  0.92 â   mov  %ax,0xbc(%rsp)
  0.72 â   vpextrb  $0xf,%xmm5,%eax
  1.24 â   mov  %ax,0xbe(%rsp)
 10.94 â   vmovdqa  0xa0(%rsp),%ymm4
  0.02 â   mov  %cx,0xda(%rsp)
  0.00 â   mov  %cx,0xdc(%rsp)
   â   mov  %cx,0xde(%rsp)
 10.34 â   vpmullw  0xc0(%rsp),%ymm4,%ymm0
  2.05 â   vpaddw   %ymm1,%ymm0,%ymm0
  0.50 â   vpaddw   %ymm3,%ymm0,%ymm0
  0.00 â   mov  %r9,0x58(%rsp)
  0.52 â   vpsrlw   $0x8,%ymm0,%ymm0
  0.39 â   vpextrw  $0x0,%xmm0,%eax
  0.69 â   mov  %al,%r8b
  0.17 â   vpextrw  $0x1,%xmm0,%eax
  0.51 â   mov  %r8,0x50(%rsp)
  6.87 â   vmovdqa  0x50(%rsp),%xmm5
  1.08 â   vpinsrb  $0x1,%eax,%xmm5,%xmm1
  0.00 â   vpextrw  $0x2,%xmm0,%eax
  0.73 â   vpinsrb  $0x2,%eax,%xmm1,%xmm1
  0.02 â   vpextrw  $0x3,%xmm0,%eax

  0.75 â   vpinsrb  $0x3,%eax,%xmm1,%xmm1
  0.10 â   vpextrw  $0x4,%xmm0,%eax
  0.98 â   vpinsrb  $0x4,%eax,%xmm1,%xmm1
  0.16 â   vpextrw  $0x5,%xmm0,%eax
  1.00 â   vpinsrb  $0x5,%eax,%xmm1,%xmm1
  0.22 â   vpextrw  $0x6,%xmm0,%eax
  1.10 â   vpinsrb  $0x6,%eax,%xmm1,%xmm1
  0.30 â   vpextrw  $0x7,%xmm0,%eax
  0.31 â   vextracti128 $0x1,%ymm0,%xmm0
  0.90 â   vpinsrb  $0x7,%eax,%xmm1,%xmm6
  0.21 â   vpextrw  $0x0,%xmm0,%eax
  0.35 â   vmovaps  %xmm6,0x50(%rsp)
  1.15 â   mov  0x58(%rsp),%r9
  0.13 â   mov  0x50(%rsp),%r8
  0.29 â   mov  %al,%r9b
  0.49 â   mov  %r8,0x50(%rsp)
  0.07 â   vpextrw  $0x1,%xmm0,%eax
  0.45 â   mov  %r9,0x58(%rsp)
  7.08 â   vmovdqa  0x50(%rsp),%xmm7
  1.19 â   vpinsrb  $0x9,%eax,%xmm7,%xmm1
  0.00 â   vpextrw  $0x2,%xmm0,%eax
  0.78 â   vpinsrb  $0xa,%eax,%xmm1,%xmm1
  0.00 â   vpextrw  $0x3,%xmm0,%eax
  0.77 â   vpinsrb  $0xb,%eax,%xmm1,%xmm1
  0.01 â   vpextrw  $0x4,%xmm0,%eax
  0.86 â   vpinsrb  $0xc,%eax,%xmm1,%xmm1
  0.03 â   vpextrw  $0x5,%xmm0,%eax
  0.88 â   vpinsrb  $0xd,%eax,%xmm1,%xmm1
  0.04 â   vpextrw  $0x6,%xmm0,%eax
  0.97 â   vpinsrb  $0xe,%eax,%xmm1,%xmm1
  0.08 â   vpextrw  $0x7,%xmm0,%eax
  1.44 â   vpinsrb  $0xf,%eax,%xmm1,%xmm0
  1.37 â   vpextrd  $0x1,%xmm0,%eax
  0.13 â   vinsertps$0xe,%xmm0,%xmm0,%xmm1
  0.02 â   vmovaps  %xmm0,0x50(%rsp)
  2.17 â   vpinsrd  $0x1,%eax,%xmm1,%xmm1



Clang code:

Percentâ   vpmullw  %ymm0,%ymm2,%ymm2
   â   vpaddw   %ymm1,%ymm2,%ymm2
   â   vpsrlw   $0x8,%ymm2,%ymm2
   â   vextracti128 $0x1,%ymm2,%xmm3
   â   vpackuswb%xmm3,%xmm2,%xmm2
   â   vmovdqu  %xmm2,(%rdi)
   â   add  $0x10,%rsi
   â   add  $0x10,%rdi
   â   mov  %r9d,%eax
   â   cmp  $0x4,%r9d
   â â jae  39179b0 
   â â jmp  3917a02 
   â   mov  %edx,%eax
  0.29 â   cmp  $0x4,%r9d
  0.00 â â jb   3917a02 
  0.07 â   nop
  3.95 â   vpmovzxbw(%rsi),%ymm2
 13.41 â   vpmullw  %ymm0,%ymm2,%ymm2
 13.87

[Bug tree-optimization/92645] Hand written vector code is 450 times slower when compiled with GCC compared to Clang

2019-11-24 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92645

--- Comment #1 from Jan Hubicka  ---
Created attachment 47340
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=47340&action=edit
Clang source

[Bug tree-optimization/92645] Hand written vector code is 450 times slower when compiled with GCC compared to Clang

2019-11-24 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92645

--- Comment #2 from Jan Hubicka  ---
Created attachment 47341
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=47341&action=edit
clang output with -O2 -mavx2 -mf16c -mfma

[Bug tree-optimization/92645] Hand written vector code is 450 times slower when compiled with GCC compared to Clang

2019-11-24 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92645

--- Comment #3 from Jan Hubicka  ---
Created attachment 47342
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=47342&action=edit
GCC source

[Bug tree-optimization/92645] Hand written vector code is 450 times slower when compiled with GCC compared to Clang

2019-11-24 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92645

--- Comment #4 from Jan Hubicka  ---
Created attachment 47343
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=47343&action=edit
GCC 10 output

[Bug bootstrap/92680] New: PGO bootstrap is broken with --with-build-config=bootstrap-lto-lean and in-itree mpfr

2019-11-26 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92680

Bug ID: 92680
   Summary: PGO bootstrap is broken with
--with-build-config=bootstrap-lto-lean and in-itree
mpfr
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: bootstrap
  Assignee: unassigned at gcc dot gnu.org
  Reporter: hubicka at gcc dot gnu.org
  Target Milestone: ---

Build with bootstrap-lto-lean with in-tree mpfr fails in profile mismatch on
set_d.o.  This is caused by fact that mpfr actually misconfigures itself with
LTO. Its configure script scans assembly to detect format of long double and
this gives wrong answer with LTO leading to suboptimal configuration.

[Bug other/92681] New: PGO bootstrap is broken with --with-build-config=bootstrap-lto-lean is not training non-C++ frontends

2019-11-26 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92681

Bug ID: 92681
   Summary: PGO bootstrap is broken with
--with-build-config=bootstrap-lto-lean is not training
non-C++ frontends
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: other
  Assignee: unassigned at gcc dot gnu.org
  Reporter: hubicka at gcc dot gnu.org
  Target Milestone: ---

This definitly leads to suboptimal compile time experience with Ada, Fortran,
go, etc.

[Bug tree-optimization/92711] New: GCC 10 libxul.so -fprofile-generate binary is 360MB while clang needs only 163MB.

2019-11-28 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92711

Bug ID: 92711
   Summary: GCC 10 libxul.so -fprofile-generate binary is 360MB
while clang needs only 163MB.
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: hubicka at gcc dot gnu.org
  Target Milestone: ---

It seems that profiling became more expensive in GCC10 compared to clang or
previous GCC releases.
Clang binary is here
https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/H_iSouCVTha9mEw9y5XO5Q/runs/0/artifacts/public/build/target.tar.bz2
more or less comparable GCC build is here 
https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/NOUqVShcSMaJn5j3g5nEYg/runs/0/artifacts/public/build/target.tar.bz2
It also seems that profile streaming is slower in GCC build (which is important
since Firefox forks multiple times on startup and then when creating new tab
and that triggers profile data streamout).

[Bug tree-optimization/92711] GCC 10 libxul.so -fprofile-generate binary is 360MB while clang needs only 163MB.

2019-11-28 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92711

--- Comment #1 from Jan Hubicka  ---
https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/ObkoHsHHSriQdU0Twc12Wg/runs/0/artifacts/public/build/target.tar.bz2
This is GCC9 build. 310MB, so still a lot bigger than clang, but better than
gcc10.

[Bug tree-optimization/92711] GCC 10 libxul.so -fprofile-generate binary is 360MB while clang needs only 163MB.

2019-11-28 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92711

Jan Hubicka  changed:

   What|Removed |Added

 CC||mliska at suse dot cz
 Blocks||45375

--- Comment #2 from Jan Hubicka  ---
Actually what I thought is GCC9 build is actually GCC10 build.  Seems that
today profile fixes made the binary noticeably smaller which seems promising.
But it is still very large.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=45375
[Bug 45375] [meta-bug] Issues with building Mozilla (i.e. Firefox) with LTO

[Bug tree-optimization/92711] GCC 10 libxul.so -fprofile-generate binary is 360MB while clang needs only 163MB.

2019-11-28 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92711

--- Comment #3 from Jan Hubicka  ---
Proper GCC 9 -fprofile-generate build is 296MB
https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/aMGsffWPQ1qzjgj4LIqcwQ/runs/0/artifacts/public/build/target.tar.bz2
So about 5% regression compared to gcc9

[Bug ipa/92737] New: cgraph_node and varpool_node needs explicit constructor

2019-11-30 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92737

Bug ID: 92737
   Summary: cgraph_node and varpool_node needs explicit
constructor
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: ipa
  Assignee: unassigned at gcc dot gnu.org
  Reporter: hubicka at gcc dot gnu.org
CC: marxin at gcc dot gnu.org
  Target Milestone: ---

cgraph_node and varpool_node are non-pods, but still allocated via
alloc_cleared and we rely on various flags to be set to 0.

[Bug tree-optimization/92738] New: [10 regression] Large code size growth for -O2 binaries between 2019-05-19...2019-05-29

2019-11-30 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92738

Bug ID: 92738
   Summary: [10 regression] Large code size growth for -O2
binaries between 2019-05-19...2019-05-29
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: hubicka at gcc dot gnu.org
  Target Milestone: ---

[Bug tree-optimization/92738] [10 regression] Large code size growth for -O2 binaries between 2019-05-19...2019-05-29

2019-11-30 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92738

--- Comment #1 from Jan Hubicka  ---
This is seen on
https://lnt.opensuse.org/db_default/v4/SPEC/graph?highlight_run=7361&plot.574=31.574.4

[Bug tree-optimization/92738] [10 regression] Large code size growth for -O2 binaries between 2019-05-19...2019-05-29

2019-11-30 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92738

--- Comment #2 from Jan Hubicka  ---
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=10.542.4&highlight_run=7354
shows shorter range
+2019-05-24  Jakub Jelinek  
+
+   * tree-core.h (enum omp_clause_code): Add OMP_CLAUSE__CONDTEMP_.
+   * tree.h (OMP_CLAUSE_DECL): Use OMP_CLAUSE__CONDTEMP_ instead of
+   OMP_CLAUSE__REDUCTEMP_.
+   * tree.c (omp_clause_num_ops, omp_clause_code_name): Add
+   OMP_CLAUSE__CONDTEMP_.

+2019-05-19  Segher Boessenkool  
+
+   * config/rs6000/constraints.md (define_register_constraint "wo"):
+   Delete.
+   * config/rs6000/rs6000.h (enum r6000_reg_class_enum): Delete
+   RS6000_CONSTRAINT_wo.
+   * config/rs6000/rs6000.c (rs6000_debug_reg_global): Adjust.
+   (rs6000_init_hard_regno_mode_ok): Adjust.
+   * config/rs6000/rs6000.md: Replace "wo" constraint by "wa" with "p9v".
+   * config/rs6000/altivec.md: Ditto.
+   * doc/md.texi (Machine Constraints): Adjust.
+
 2019-05-18  Iain Sandoe  

It may be easy to bisect.

[Bug tree-optimization/92740] New: induct2 (from polyhedron) regresses 267% with -O2 -ftree-vectorize -ftree-slp-vectorize -fvect-cost-modes=dynamic compared to -O2

2019-11-30 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92740

Bug ID: 92740
   Summary: induct2 (from polyhedron) regresses 267% with -O2
-ftree-vectorize -ftree-slp-vectorize
-fvect-cost-modes=dynamic compared to -O2
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: hubicka at gcc dot gnu.org
  Target Milestone: ---

This is on zen2 hardware.

[Bug tree-optimization/92740] induct2 (from polyhedron) regresses 267% with -O2 -ftree-vectorize -ftree-slp-vectorize -fvect-cost-modes=dynamic compared to -O2

2019-11-30 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92740

--- Comment #1 from Jan Hubicka  ---
There is also 75% regression on fft2 and 5% on rnflow2.
Induct2 reproduces on kaby lake, fft2 and rnflow seems zen specific.

[Bug tree-optimization/92825] New: Unnecesary stack protection and missed SLP vectorization in Firefox's LightPixel.

2019-12-05 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92825

Bug ID: 92825
   Summary: Unnecesary stack protection and missed SLP
vectorization in Firefox's LightPixel.
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: hubicka at gcc dot gnu.org
  Target Milestone: ---

Created attachment 47428
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=47428&action=edit
full testcase

uint32_t DiffuseLightingSoftware::LightPixel(const Point3D& aNormal,
 const Point3D& aVectorToLight,
 uint32_t aColor) {
  Float dotNL = std::max(0.0f, aNormal.DotProduct(aVectorToLight));
  Float diffuseNL = mDiffuseConstant * dotNL;

  union {
uint32_t bgra;
uint8_t components[4];
  } color = {aColor};
  color.components[B8G8R8A8_COMPONENT_BYTEOFFSET_B] = umin(
  uint32_t(diffuseNL * color.components[B8G8R8A8_COMPONENT_BYTEOFFSET_B]),
  255U);
  color.components[B8G8R8A8_COMPONENT_BYTEOFFSET_G] = umin(
  uint32_t(diffuseNL * color.components[B8G8R8A8_COMPONENT_BYTEOFFSET_G]),
  255U);
  color.components[B8G8R8A8_COMPONENT_BYTEOFFSET_R] = umin(
  uint32_t(diffuseNL * color.components[B8G8R8A8_COMPONENT_BYTEOFFSET_R]),
  255U);
  color.components[B8G8R8A8_COMPONENT_BYTEOFFSET_A] = 255;
  return color.bgra;
}

(full testcase attached)
Built with -O3 -fstack-protection-strong results in slower code with gcc10 than
with gcc9 or clang.

GCC produces:
   │ 04390e20  const&,
   │
_ZN7mozilla3gfx12_GLOBAL__N_124SpecularLightingSoftware10LightPixelERKNS0_12Point3DTypedINS0_12UnknownUnitsEfEES7_j():
  0.19 │   push  %rbp
  0.60 │   pxor  %xmm5,%xmm5
  0.05 │   mov   %rsp,%rbp
  0.12 │   push  %rbx
  0.65 │   sub   $0x18,%rsp
  0.33 │   movss 0x4(%rdx),%xmm0
  0.10 │   movss (%rdx),%xmm1
  0.58 │   mov   %fs:0x28,%rax
  0.03 │   mov   %rax,-0x18(%rbp)
  0.22 │   xor   %eax,%eax
  0.07 │   movss pw_32+0x1588,%xmm3
  1.58 │   addss 0x8(%rdx),%xmm3
  0.67 │   addss %xmm5,%xmm0
  0.23 │   addss %xmm5,%xmm1
   │   movaps%xmm0,%xmm2
  0.41 │   movaps%xmm1,%xmm4
  0.87 │   mulss %xmm0,%xmm2
  0.28 │   mulss %xmm1,%xmm4
  3.71 │   addss %xmm2,%xmm4
  0.14 │   movaps%xmm3,%xmm2
  0.04 │   mulss %xmm3,%xmm2
  1.99 │   addss %xmm2,%xmm4
  0.15 │   movss 0x4(%rsi),%xmm2
  9.39 │   sqrtss%xmm4,%xmm4
  8.90 │   divss %xmm4,%xmm0
  2.10 │   divss %xmm4,%xmm3
  1.08 │   mulss %xmm0,%xmm2
  0.01 │   movss 0x8(%rsi),%xmm0

while clang
Percent│
_ZN7mozilla3gfx12_GLOBAL__N_124SpecularLightingSoftware10LightPixelERKNS0_12Point3DTypedINS0_12UnknownUnitsEfEES7_j():
  0.11 │   xorps %xmm0,%xmm0
  0.83 │   movss 0x4(%rdx),%xmm1
  3.29 │   addss %xmm0,%xmm1
  0.03 │   movss (%rdx),%xmm2
  0.08 │   movss 0x8(%rdx),%xmm3
  0.04 │   unpcklps  %xmm2,%xmm3
  0.59 │   movss
mozilla::gfx::ConvertComponentTransferFunctionToFilter(mozilla::gfx::ComponentTransferAttributes
const&, int, int, mozilla::gfx::DrawTarget*, RefPtr

[Bug tree-optimization/92834] New: misssed SLP vectorization in LightPixel

2019-12-05 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92834

Bug ID: 92834
   Summary: misssed SLP vectorization in LightPixel
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: hubicka at gcc dot gnu.org
  Target Milestone: ---

Created attachment 47431
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=47431&action=edit
simplified testcase

Clang is able to vectorize LightPixel which leads to about 10% improvements in
rasterflood-svg Firefox benchmark.

[Bug tree-optimization/92825] Unnecesary stack protection in Firefox's LightPixel.

2019-12-05 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92825

Jan Hubicka  changed:

   What|Removed |Added

Summary|Unnecesary stack protection |Unnecesary stack protection
   |and missed SLP  |in Firefox's LightPixel.
   |vectorization in Firefox's  |
   |LightPixel. |

--- Comment #2 from Jan Hubicka  ---
I have filled separate bug for the SLP issue so we do not mix multiple things
in one PR. 
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92834

[Bug ipa/92809] [10 regression] error: calls_comdat_local is set outside of a comdat group

2019-12-06 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92809

Jan Hubicka  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |WORKSFORME

--- Comment #3 from Jan Hubicka  ---
This one works for me and should be fixed now by
2019-12-05  Jan Hubicka  

* ipa-inline-transform.c (inline_call): Fix maintenatnce of
comdat_local

[Bug tree-optimization/92834] misssed SLP vectorization in LightPixel

2019-12-06 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92834

--- Comment #2 from Jan Hubicka  ---
Created attachment 47436
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=47436&action=edit
Clang assembly from perf

It is clang9 build
https://treeherder.mozilla.org/#/jobs?repo=try&revision=7d7ee02817ab1ea39a6415862ab7889f5e416598&selectedJob=278948829
it has full logs and binary, too

/builds/worker/fetches/sccache/sccache /builds/worker/fetches/clang/bin/clang++
-o Unified_cpp_gfx_2d2.o -c -flto=thin
-I/builds/worker/workspace/build/src/obj-firefox/dist/system_wrappers -include
/builds/worker/workspace/build/src/config/gcc_hidden.h -U_FORTIFY_SOURCE
-D_FORTIFY_SOURCE=2 -fstack-protector-strong -DMOZILLA_CLIENT -include
/builds/worker/workspace/build/src/obj-firefox/mozilla-config.h
-Qunused-arguments -Qunused-arguments -Wall -Wbitfield-enum-conversion
-Wempty-body -Wignored-qualifiers -Woverloaded-virtual -Wpointer-arith
-Wshadow-field-in-constructor-modified -Wsign-compare -Wtype-limits
-Wunreachable-code -Wunreachable-code-return -Wwrite-strings
-Wno-invalid-offsetof -Wclass-varargs -Wfloat-overflow-conversion
-Wfloat-zero-conversion -Wloop-analysis -Wc++1z-compat -Wc++2a-compat -Wcomma
-Wimplicit-fallthrough -Werror=non-literal-null-conversion -Wstring-conversion
-Wtautological-overlap-compare -Wtautological-unsigned-enum-zero-compare
-Wtautological-unsigned-zero-compare -Wno-error=tautological-type-limit-compare
-Wno-inline-new-delete -Wno-error=type-limits -Wno-error=pessimizing-move
-Wno-error=nonnull -Wno-error=deprecated-declarations -Wno-error=array-bounds
-Wno-error=backend-plugin -Wno-error=return-std-move
-Wno-error=atomic-alignment -Wformat -Wformat-security
-Wno-gnu-zero-variadic-macro-arguments -Wno-unknown-warning-option
-Wno-return-type-c-linkage -D_GLIBCXX_USE_CXX11_ABI=0 -fno-sized-deallocation
-fno-aligned-new -fcrash-diagnostics-dir=/builds/worker/artifacts
-fno-strict-aliasing -fno-strict-aliasing -fno-exceptions -fno-rtti
-fno-exceptions -fno-math-errno -pthread -pipe
-I/builds/worker/workspace/build/src/obj-firefox/dist/stl_wrappers -DNDEBUG=1
-DTRIMMED=1 -DUSE_SSE2 -DOS_POSIX=1 -DOS_LINUX=1 -DUSE_CAIRO
-DMOZ2D_HAS_MOZ_CAIRO -DMOZ_ENABLE_FREETYPE -DSTATIC_EXPORTABLE_JS_API
-DMOZ_HAS_MOZGLUE -DMOZILLA_INTERNAL_API -DIMPL_LIBXUL
-I/builds/worker/workspace/build/src/gfx/2d
-I/builds/worker/workspace/build/src/obj-firefox/gfx/2d
-I/builds/worker/workspace/build/src/obj-firefox/ipc/ipdl/_ipdlheaders
-I/builds/worker/workspace/build/src/ipc/chromium/src
-I/builds/worker/workspace/build/src/ipc/glue
-I/builds/worker/workspace/build/src/gfx/skia
-I/builds/worker/workspace/build/src/gfx/skia/skia
-I/builds/worker/workspace/build/src/obj-firefox/dist/include
-I/builds/worker/workspace/build/src/obj-firefox/dist/include/nspr
-I/builds/worker/workspace/build/src/obj-firefox/dist/include/nss -fPIC -g
-Xclang -load -Xclang
/builds/worker/workspace/build/src/obj-firefox/build/clang-plugin/libclang-plugin.so
-Xclang -add-plugin -Xclang moz-check -O2 -fno-omit-frame-pointer
-funwind-tables -Werror -Wno-error=shadow
-I/builds/worker/workspace/build/src/obj-firefox/dist/include/cairo
-I/usr/include/freetype2  -MD -MP -MF .deps/Unified_cpp_gfx_2d2.o.pp  
Unified_cpp_gfx_2d2.cpp

[Bug c++/92831] CWG1299 extend_ref_init_temps_1 punts on COND_EXPRs

2019-12-07 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92831

Jan Hubicka  changed:

   What|Removed |Added

 CC||hubicka at gcc dot gnu.org

--- Comment #7 from Jan Hubicka  ---
Thank you!  I wonder if your fix can also have an optional warning which would
let us to fix occurrences of this in Firefox since requiring unreleased
compilers is not cool

[Bug tree-optimization/92860] New: [8,9,10 regression] Global flags affected by -O settings are clobbered by optimize attribute

2019-12-08 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92860

Bug ID: 92860
   Summary: [8,9,10 regression] Global flags affected by -O
settings are clobbered by optimize attribute
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: hubicka at gcc dot gnu.org
  Target Milestone: ---

Hi,
the following testcase:
void linker_error();
__attribute__ ((optimize("-O0")))
int a ()
{
}
static int remove_me ()
{
  linker_error ();
}
void
main()
{
}

builds with GCC6 but not with GCC8, GCC9 and GCC10:
hubicka@lomikamen-jh:/aux/hubicka/trunk4/gcc$ gcc -O2 t.c
hubicka@lomikamen-jh:/aux/hubicka/trunk4/gcc$
/aux/hubicka/trunk-install/bin/gcc -O2 t.c
/usr/local/bin/ld: /tmp/cckSFE5R.o: in function `remove_me':
t.c:(.text+0x17): undefined reference to `linker_error'
collect2: error: ld returned 1 exit status

The problem is that while processing the optimize attribute for a we overwritte
flag_toplevel_reorder that is affected by optimization flag but not marked as
Optimization.  I suppose there are other cases like this.

[Bug tree-optimization/92860] [8,9,10 regression] Global flags affected by -O settings are clobbered by optimize attribute

2019-12-08 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92860

--- Comment #1 from Jan Hubicka  ---
Author: hubicka
Date: Sun Dec  8 13:50:32 2019
New Revision: 279089

URL: https://gcc.gnu.org/viewcvs?rev=279089&root=gcc&view=rev
Log:
PR tree-optimization/92860
* common.opt (fprofile-reorder-functions, ftoplevel-reorder): Add
Optimization flag.

Modified:
trunk/gcc/ChangeLog
trunk/gcc/common.opt

[Bug tree-optimization/92860] [8/9/10 regression] Global flags affected by -O settings are clobbered by optimize attribute

2019-12-08 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92860

Jan Hubicka  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2019-12-08
Summary|[8,9,10 regression] Global  |[8/9/10 regression] Global
   |flags affected by -O|flags affected by -O
   |settings are clobbered by   |settings are clobbered by
   |optimize attribute  |optimize attribute
 Ever confirmed|0   |1

--- Comment #2 from Jan Hubicka  ---
Partly fixed on trunk - I think we have other flags/params missing Optimization
attribute that behaves same way.

[Bug tree-optimization/92924] New: [10 regression] reproducible indirect call profile merging causes 80% slowdown in Firefox pref-reftest-singletons id-getter microbenchmarks

2019-12-12 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92924

Bug ID: 92924
   Summary: [10 regression] reproducible indirect call profile
merging causes 80% slowdown in Firefox
pref-reftest-singletons id-getter microbenchmarks
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: hubicka at gcc dot gnu.org
  Target Milestone: ---

During the train run, in firefox2019-release-9test/dom/bindings function
;; Function
mozilla::dom::binding_detail::GenericGetter
(_ZN7mozilla3dom14binding_detail13GenericGetterINS1_16NormalThisPolicyENS1_15ThrowExceptionsEEEbP9JSContextjPN2JS5ValueE,
funcdef_no=39965, decl_uid=943222, cgraph_uid=24044, symbol_order=25045)

calls function get_id most of time.  With GCC 9 we get:

Indirect call value:939751711 match:139135227 all:140993325.
Indirect call -> direct call from other modulegetter_18=> 939751711 (will
resolve only with LTO)

With GCC 10 we get:

Trying transformations on stmt ok_20 = getter_18 (cx_131(D), D.1007269,
self_129, D.1007259);
Indirect call counterall: 140957778, values: [2135000278:-1], [401302964:3804],
[1203869319:12375], [429856732:6018].

So the profile omits completely get_id and we fail to inline. This has quite
large performance impact of Firefox in general since it seems to affect DOM
tree manipulation quite badly.

[Bug tree-optimization/92924] [10 regression] reproducible indirect call profile merging causes 80% slowdown in Firefox pref-reftest-singletons id-getter microbenchmarks

2019-12-12 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92924

Jan Hubicka  changed:

   What|Removed |Added

 CC||hubicka at gcc dot gnu.org,
   ||mliska at suse dot cz

--- Comment #1 from Jan Hubicka  ---
This is caused by Martin's TOP_N_PROFILE work.

[Bug bootstrap/92653] [10 Regression] PGO bootstrap is broken with --with-build-config=bootstrap-lto-lean

2019-12-12 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92653

Jan Hubicka  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #2 from Jan Hubicka  ---
The underlying updating issues was fixed last week

[Bug rtl-optimization/92925] New: RTl expansion throws away alignment info

2019-12-12 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92925

Bug ID: 92925
   Summary: RTl expansion throws away alignment info
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: hubicka at gcc dot gnu.org
  Target Milestone: ---

Hi,
this testcase originally started as attempt to produce self contained
reproducer for ipa-cp bug.  Problem is that RTL expansion is too limited and
refuses to produce aligned moves for me.  
struct a {long a1; long a2;};
struct b {long b; struct a a[10];};
struct c {long c; struct b b;__int128 e;};
int l;
__attribute__ ((noinline))
static void
set(struct b *bptr)
{
  for (int i=0;ia[i]=(struct a){};
}
test ()
{
  struct c c;
  set (&c.b);
}

Here ipa-cp propagates that BPTR is always aligned to 16 with misaligment 8.
This should let expansion to use movaps for the "bptr->a[i]=(struct a){};"
constructions but it does not.

set:
.LFB0:
.cfi_startproc
movll(%rip), %ecx
testl   %ecx, %ecx
jle .L1
xorl%eax, %eax
.p2align 4,,10
.p2align 3
.L3:
movslq  %eax, %rdx
pxor%xmm0, %xmm0
addl$2, %eax
salq$4, %rdx
movups  %xmm0, 8(%rdi,%rdx)
cmpl%ecx, %eax
jl  .L3
.L1:

Overall the loop codegen is quite bad.

[Bug tree-optimization/92924] [10 regression] reproducible indirect call profile merging causes 80% slowdown in Firefox pref-reftest-singletons id-getter microbenchmarks

2019-12-13 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92924

--- Comment #2 from Jan Hubicka  ---
Increasing number of entries does not seem to help:
Indirect call counterall: 140960933, values: [429856732:-1], [484692916:1218],
[1203869319:12593], [245854587:8179], [1829590552:52], [401302964:7072],
[839575652:1422], [2041842690:854], [1646699888:428], [1259057892:1485],
[1777186207:1066], [901349086:1276], [2135000278:93], [1926702874:1281],
[2135000278:108], [717405103:513].

[Bug tree-optimization/92924] [10 regression] reproducible indirect call profile merging causes 80% slowdown in Firefox pref-reftest-singletons id-getter microbenchmarks

2019-12-13 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92924

Jan Hubicka  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2019-12-13
 Ever confirmed|0   |1

--- Comment #3 from Jan Hubicka  ---
I hacked libgcov to make merging no longer reproducible
Index: libgcov-merge.c
===
--- libgcov-merge.c (revision 279167)
+++ libgcov-merge.c (working copy)
@@ -130,12 +130,27 @@ merge_topn_values_set (gcov_type *counte
}
}

+  if (j == GCOV_TOPN_VALUES)
+   {
+ int min = 0;
+ for (j = 1; j < GCOV_TOPN_VALUES; j++)
+   if (counters[2 * j + 1] < counters[2 * min + 1])
+ min = j;
+ if (counters[2 * min + 1] < read_counters[2 * i + 1])
+   {
+  counters[2 * min] = read_counters[2 * i];
+  counters[2 * min + 1] = read_counters[2 * i + 1];
+   }
+   }
+
+#if 0
   /* We haven't found a slot, bail out.  */
   if (j == GCOV_TOPN_VALUES)
{
  counters[1] = -1;
  return;
}
+#endif
 }
 }

with this I now get:
Trying transformations on stmt ok_20 = getter_18 (cx_131(D), D.1007269,
self_129, D.1007259);
Indirect call counterall: 140964179, values: [939751711:140005207],
[2105057161:149880], [708289787:11], [484692916:60283], [1777186207:5],
[245854587:38900], [1967741779:28458], [1785108787:23272], [429856732:17057],
[401533446:13488], [1203869319:10772], [183365365:9606], [401302964:7243],
[824316005:3379], [758688187:2121], [1528155396:1983].
/aux/hubicka/firefox-2019-2/dom/bindings/BindingUtils.cpp:3035:19: missed:
Indirect call -> direct call from other module getter_18=> 939751711 (will
resolve only with LTO)

So the histogram of destinations is indeed greatly dominated by one estination
but there are very many others (not all are listed since I started dropping
them).

One way to make reproducible merging better is to drop destinations with small
trip counts before merging, but I am not sure it would help everywhere.

[Bug tree-optimization/92924] [10 regression] reproducible indirect call profile merging causes 80% slowdown in Firefox pref-reftest-singletons id-getter microbenchmarks

2019-12-13 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92924

--- Comment #4 from Jan Hubicka  ---
Looking into how getter variable is determined:

vp_35 is function parameter
_124 = MEM[(const struct Value *)vp_35(D)].asBits_;
_125 = _124 ^ 18446181123756130304;
_126 = (struct JSObject *) _125
...
_50 = MEM[(struct Function *)_126].jitinfo
...
getter_60 = _50->D.102800.getter;
ok_64 = getter_60 (cx_325(D), D.1007269, self_323, D.1007259)

Seems our jump functions would need a lot of work to handle this.

[Bug tree-optimization/93055] New: accumulation loops in stepanov_vector benchmark use more instruction level parpallelism

2019-12-23 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93055

Bug ID: 93055
   Summary: accumulation loops in stepanov_vector benchmark use
more instruction level parpallelism
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: hubicka at gcc dot gnu.org
  Target Milestone: ---

stepanov_vector benchmark form
https://gitlab.com/chriscox/CppPerformanceBenchmarks gets poor codegen on
TestOneType

Built with -march=bdver1 -O3 (but the regression happens on core too)

Clang compiles accumulation loops for testOneType as follows:

   │vpxor  %xmm0,%xmm0,%xmm0
   │vpxor  %xmm1,%xmm1,%xmm1 
   │vpxor  %xmm2,%xmm2,%xmm2
  0.05 │vpxor  %xmm3,%xmm3,%xmm3=
   │data16 nopw %cs:0x0(%rax,%rax,1)
  6.95 │ 300:┌─→vpaddd 0x5f0(%rsp,%rcx,4),%xmm0,%xmm0 
  0.05 │ │  vpaddd 0x600(%rsp,%rcx,4),%xmm1,%xmm1
  7.13 │ │  vpaddd 0x610(%rsp,%rcx,4),%xmm2,%xmm2
  0.16 │ │  vpaddd 0x620(%rsp,%rcx,4),%xmm3,%xmm3
   │ │  add$0x10,%rcx
   │ │  cmp$0x7dc,%rcx
  7.04 │ └──jne300
  0.07 │vpaddd %xmm0,%xmm1,%xmm0
  1.61 │vpaddd %xmm0,%xmm2,%xmm0
   │vpaddd %xmm0,%xmm3,%xmm0
   │vpshuf $0x4e,%xmm0,%xmm1
  0.07 │vpaddd %xmm1,%xmm0,%xmm0 
  0.02 │vpshuf $0xe5,%xmm0,%xmm1

while GCC10 does:

   │ 1c0:   vxorps %xmm0,%xmm0,%xmm0 
   │mov%rbx,%rax
   │nop
  2.25 │ 1d0:┌─→vpaddd (%rax),%xmm0,%xmm0 
  0.01 │ │  lea0x2100(%rsp),%rdi
  0.95 │ │  add$0x10,%rax
  1.04 │ │  cmp%rax,%rdi
  2.24 │ └──jne1d0  

Which runs slower:

testdescription   absolute   operations
  ratio with
numbertime   per second
  test0

 0 "int32_t accumulate pointer verify2"   1.06 sec   12440.17 M
1.00
 1 "int32_t accumulate vector iterator"   1.06 sec   12458.15 M
1.00
 2 "int32_t accumulate pointer reverse reverse"   1.06 sec   12440.34 M
1.00
 3 "int32_t accumulate vector reverse_iterator reverse"   1.05 sec   12602.74 M
0.99
 4 "int32_t accumulate vector iterator reverse reverse"   1.04 sec   12749.27 M
0.98
 5 "int32_t accumulate array Riterator reverse reverse"   1.06 sec   12486.26 M
1.00

Total absolute time for int32_t Vector Accumulate: 6.32 sec 

int32_t Vector Accumulate Penalty: 0.99 

compared to:
testdescription   absolute   operations
  ratio with
numbertime   per second
  test0

 0 "int32_t accumulate pointer verify2"   2.29 sec   5773.60 M 
   1.00
 1 "int32_t accumulate vector iterator"   2.27 sec   5806.96 M 
   0.99
 2 "int32_t accumulate pointer reverse reverse"   2.26 sec   5830.72 M 
   0.99
 3 "int32_t accumulate vector reverse_iterator reverse"   2.27 sec   5827.45 M 
   0.99
 4 "int32_t accumulate vector iterator reverse reverse"   2.27 sec   5821.29 M 
   0.99
 5 "int32_t accumulate array Riterator reverse reverse"   2.27 sec   5826.58 M 
   0.99

Total absolute time for int32_t Vector Accumulate: 13.62 sec

int32_t Vector Accumulate Penalty: 0.99

  1   2   3   4   5   6   7   8   9   10   >