[Bug middle-end/26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)

2020-03-05 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163
Bug 26163 depends on bug 89430, which changed state.

Bug 89430 Summary: A missing ifcvt optimization to generate csel
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89430

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

[Bug tree-optimization/89430] A missing ifcvt optimization to generate csel

2020-03-05 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89430

Martin Jambor  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 CC||jamborm at gcc dot gnu.org
 Resolution|--- |FIXED

--- Comment #11 from Martin Jambor  ---
(In reply to Jeffrey A. Law from comment #10)
> Fixed on the trunk.

So marking as fixed.

[Bug tree-optimization/80635] [8/9/10 regression] std::optional and bogus -Wmaybe-uninitialized warning

2020-03-17 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80635

--- Comment #51 from Martin Jambor  ---
(In reply to Andrew Pinski from comment #48)
> This should also work too:
> diff --git a/gcc/tree-sra.c b/gcc/tree-sra.c
> index ea8594db193..83b1d981439 100644
> --- a/gcc/tree-sra.c
> +++ b/gcc/tree-sra.c
> @@ -2499,6 +2499,7 @@ analyze_access_subtree (struct access *root, struct
> access *parent,
>   For integral types this means the precision has to match.
>  Avoid assumptions based on the integral type kind, too.  */
>if (INTEGRAL_TYPE_P (root->type)
> + && TREE_CODE (root->type) != BOOLEAN_TYPE
>   && (TREE_CODE (root->type) != INTEGER_TYPE
>   || TYPE_PRECISION (root->type) != root->size)
>   /* But leave bitfield accesses alone.  */
> 
>  CUT 

Well, this re-introduces bug PR 52244 and makes the associated
testcase fail.  PR 52244 fix specifically aimed to disallow boolean
replacements.

(In reply to Jeffrey A. Law from comment #50)
> Reassigning to Martin Jambor since the real fix is to avoid creating the
> V_C_E in the first place.

I hoped that changing SRA to emit a NOP_EXPR instead of V_C_E would
help, but unfortunately it doesn't.  I've been looking at this for the
whole evening yesterday and ATM I do not see how I could avoid
conversion without reintroducing PR 52244 (in the general case - this
is another consequence of the fact that SRA is not flow sensitive).

[Bug tree-optimization/93435] [8/9/10 Regression] Hang with -O2 on innocuous looking code with GCC 8.3

2020-03-19 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93435

--- Comment #8 from Martin Jambor  ---
The issue actually started with my r8-344-2bba75411e1 and it is
basically a perfect SRA bomb, it makes SRA sub-access propagation
accross assignments create gazillions of accesses and then
replacements, because they facilitate forward propagation (and as ccp3
dumps shows, they do).

I already have a patch that simply limits the number of replacements
to a param, defaulting to 128, which makes the testcase compilation
finish in about 9 seconds on my machine.  However, SRA analysis still
takes 7 seconds of that, so I'm looking at capping the propagation
earlier.  That takes more book-keeping, so at least for backports, I'd
like to use the simpler approach on released branches.

[Bug tree-optimization/93435] [8/9 Regression] Hang with -O2 on innocuous looking code with GCC 8.3

2020-03-20 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93435

Martin Jambor  changed:

   What|Removed |Added

Summary|[8/9/10 Regression] Hang|[8/9 Regression] Hang with
   |with -O2 on innocuous   |-O2 on innocuous looking
   |looking code with GCC 8.3   |code with GCC 8.3

--- Comment #10 from Martin Jambor  ---
Fixed on trunk with
https://gcc.gnu.org/pipermail/gcc-patches/2020-March/542390.html

[Bug ipa/94360] New: 6% run-time regression of 502.gcc_r against GCC 9 when compiled with -O2 and both PGO and LTO

2020-03-27 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94360

Bug ID: 94360
   Summary: 6% run-time regression of 502.gcc_r against GCC 9 when
compiled with -O2 and both PGO and LTO
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: ipa
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jamborm at gcc dot gnu.org
CC: hubicka at gcc dot gnu.org, marxin at gcc dot gnu.org
Blocks: 26163
  Target Milestone: ---
  Host: x86_64-linux
Target: x86_64-linux

When built at -O2, generic march/mtune and with both PGO and LTO and
current trunk/master, SPEC 2017 INTrate 502.gcc_r is 6% slower when
run on and AMD Zen2-based CPU - and about 4.8% slower on Intel Cascade
Lake.

Looking at how the run-time of the benchmark evolved over the course
of GCC 10 development cycle, the first and biggest regression (9%)
comes with:

  commit 2925cad2151842daa387950e62d989090e47c91d
  Author: Jan Hubicka 
  Date:   Thu Oct 3 17:08:21 2019 +0200

params.def (PARAM_INLINE_HEURISTICS_HINT_PERCENT, [...]): New.

* params.def (PARAM_INLINE_HEURISTICS_HINT_PERCENT,
PARAM_INLINE_HEURISTICS_HINT_PERCENT_O2): New.
* doc/invoke.texi (inline-heuristics-hint-percent,
inline-heuristics-hint-percent-O2): Document.
* tree-inline.c (inline_insns_single, inline_insns_auto): Add new
hint attribute.
(can_inline_edge_by_limits_p): Use it.

   From-SVN: r276516

Then between Wed Nov 6 (72d6aeecd95) and Mon Nov 18 (58c036c8354) it
improved to about 103% of GCC 9 run-time (I did not exactly found what
caused it because in much of this range the compiler was segfaulting
in the LTO phase).  Eventually, the benchmark regresses to current
106% of GCC 9 run-time with Honza's:

  - 9340d34599e Convert inliner to function specific param infrastructure, or
  - 1e83bd7003e Convert inliner to new param infrastructure.

The former cannot be built without the latter.

Symbol profiles are:

trunk (26b3e568a60):
  OverheadSamples  Shared Object Symbol 
    .   


 4.04%  42371  cpugcc_r_peak.pgolto  bitmap_ior_into
 2.91%  30281  cpugcc_r_peak.pgolto  df_worklist_dataflow
 2.24%  23342  cpugcc_r_peak.pgolto  df_note_compute
 1.92%  20120  cpugcc_r_peak.pgolto  bitmap_set_bit
 1.75%  18148  cpugcc_r_peak.pgolto  rest_of_handle_fast_dce.lto_priv.0
 1.58%  16580  libc-2.31.so  __memset_avx2_unaligned_erms
 1.40%  14514  cpugcc_r_peak.pgolto  extract_new_fences_from.lto_priv.0
 1.39%  14732  libc-2.31.so  _int_malloc
 1.33%  13824  cpugcc_r_peak.pgolto  bitmap_copy
 1.24%  12962  cpugcc_r_peak.pgolto  bitmap_bit_p
 1.19%  12346  cpugcc_r_peak.pgolto  bitmap_and
 1.18%  12242  cpugcc_r_peak.pgolto  df_lr_local_compute.lto_priv.0
 1.02%  10618  cpugcc_r_peak.pgolto  cleanup_cfg.isra.0


vs gcc 9 (releases/gcc-9.3.0):


  OverheadSamples  Shared Object Symbol 
    .   
.

 6.81%  66967  cpugcc_r_peak.pgolto  df_worklist_dataflow
 2.83%  28063  cpugcc_r_peak.pgolto  bitmap_ior_into
 2.80%  27489  cpugcc_r_peak.pgolto  df_note_compute.lto_priv.0
 2.17%  21334  cpugcc_r_peak.pgolto  rest_of_handle_fast_dce.lto_priv.0
 1.69%  16671  libc-2.31.so  __memset_avx2_unaligned_erms
 1.51%  14876  cpugcc_r_peak.pgolto  try_optimize_cfg.lto_priv.0
 1.50%  14990  libc-2.31.so  _int_malloc
 1.50%  14715  cpugcc_r_peak.pgolto  extract_new_fences_from.lto_priv.0
 1.36%  13406  cpugcc_r_peak.pgolto  df_lr_local_compute.lto_priv.0
 1.20%  11926  cpugcc_r_peak.pgolto  remove_unused_locals
 1.06%  10433  cpugcc_r_peak.pgolto  sched_analyze_insn
 1.04%  10210  cpugcc_r_peak.pgolto  init_alias_analysis
 1.04%  10188  cpugcc_r_peak.pgolto  prescan_insns_for_dce.lto_priv.0
 1.00%   9876  cpugcc_r_peak.pgolto  compute_transp


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163
[Bug 26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)

[Bug tree-optimization/94364] New: 505.mcf_r is 8% faster when compiled with -mprefer-vector-width=128

2020-03-27 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94364

Bug ID: 94364
   Summary: 505.mcf_r is 8% faster when compiled with
-mprefer-vector-width=128
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jamborm at gcc dot gnu.org
Blocks: 26163
  Target Milestone: ---
  Host: x86_64-linux
Target: x86_64-linux

SPEC 2017 INTrate benchmark 505.mcf_r, when compiled with options
-Ofast -march=native -mtune=native, is 8% slower than when we also use
option -mprefer-vector-width=128.  I have observed it on both AMD Zen2
and Intel Cascade Lake Server CPUs (using master revision 26b3e568a60).

Better vector width selection would therefore bring about noticeable
speed-up.


Symbol profiles (collected on AMD Rome):

-Ofast -march=native -mtune=native:

  Overhead   Samples  Shared ObjectSymbol  
      ...  

28.64%462302  mcf_r_peak.mine  spec_qsort
21.58%348703  mcf_r_peak.mine  cost_compare
15.81%255029  mcf_r_peak.mine  primal_bea_mpp
15.58%251176  mcf_r_peak.mine  replace_weaker_arc
 7.37%118646  mcf_r_peak.mine  arc_compare
 6.53%105337  mcf_r_peak.mine  price_out_impl
 1.38% 22276  mcf_r_peak.mine  update_tree

-Ofast -march=native -mtune=native -mprefer-vector-width=128:

  Overhead   Samples  Shared ObjectSymbol  
      ...  

23.57%354536  mcf_r_peak.mine  spec_qsort
23.51%353767  mcf_r_peak.mine  cost_compare
16.98%255104  mcf_r_peak.mine  primal_bea_mpp
16.65%249891  mcf_r_peak.mine  replace_weaker_arc
 7.29%109267  mcf_r_peak.mine  arc_compare
 7.09%106380  mcf_r_peak.mine  price_out_impl
 1.53% 22968  mcf_r_peak.mine  update_tree


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163
[Bug 26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)

[Bug gcov-profile/94369] New: 505.mcf_r is 6-7% slower at -Ofast -march=native with PGO+LTO than with just LTO

2020-03-27 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94369

Bug ID: 94369
   Summary: 505.mcf_r is 6-7% slower at -Ofast -march=native with
PGO+LTO than with just LTO
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: gcov-profile
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jamborm at gcc dot gnu.org
CC: marxin at gcc dot gnu.org
Blocks: 26163
  Target Milestone: ---
  Host: x86_64-linux
Target: x86_64-linux

SPEC 2017 INTrate benchmark 505.mcf_r, when compiled with options
-Ofast -march=native -mtune=native, is 6-7% slower when compiled with
both PGO and LTO than when built with just LTO.  I have observed this
on both AMD Zen2 (7%) and Intel Cascade Lake (6%) server CPUs.  The
train run cannot be very bad because without LTO, PGO improves
run-time by 15% on both systems.  This is with master revision
26b3e568a60.

Profiling results (from an AMD CPU):

LTO:

  OverheadSamples  Shared ObjectSymbol 
    .  ...  

39.53% 518450  mcf_r_peak.mine  spec_qsort.constprop.0
22.13% 289745  mcf_r_peak.mine  master.constprop.0
19.00% 248641  mcf_r_peak.mine  replace_weaker_arc
 9.37% 122669  mcf_r_peak.mine  main
 8.60% 112601  mcf_r_peak.mine  spec_qsort.constprop.1

PGO+LTO:

  OverheadSamples  Shared ObjectSymbol 
    .  ...  ...

40.13% 562770  mcf_r_peak.mine  spec_qsort.constprop.0
21.68% 303543  mcf_r_peak.mine  master.constprop.0
18.24% 255236  mcf_r_peak.mine  replace_weaker_arc
10.32% 144433  mcf_r_peak.mine  main
 8.07% 112775  mcf_r_peak.mine  arc_compare

Perhaps I should note that we have patched qsort in the benchmark to
work with strict aliasing even with LTO.  But the performance gap is
there also with -fno-strict-aliasing.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163
[Bug 26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)

[Bug middle-end/26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)

2020-03-27 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163
Bug 26163 depends on bug 90056, which changed state.

Bug 90056 Summary: 548.exchange2_r regressions on AMD Zen
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90056

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |MOVED

[Bug middle-end/90056] 548.exchange2_r regressions on AMD Zen

2020-03-27 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90056

Martin Jambor  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |MOVED

--- Comment #2 from Martin Jambor  ---
(In reply to Martin Jambor from comment #0)
> As of revision 270053, the 548.exchange2_r benchmark from SPEC 2017
> INTrate suite suffered a number of smaller regressions on AMD Zen
> CPUs:
> 
>   - At -O2, it is 4.5% slower than when compiled with GCC 7

I am about to file a specific bug about exchange at -O2.

>   - At -Ofast, it is 4.7% slower than when compiled with GCC 8

This is no longer true.

>   - At -Ofast -march=native -mutine=native, this difference is 6.9%

Again, I will file a more specific bug about -Ofast -march=native in a
little while.

>   - At -Ofast and native tuning, it is 6% slower with PGO than
> without it.

I can still see this in my measurements on Zen1-based CPU but not in
those done on AMD Zen2 or Intel Cascade Lake.  So I am not sure if we
care.  I'll e happy to file a specific bug if we do.

[Bug tree-optimization/94373] New: 548.exchange2_r run time is 7-12% worse than GCC 9 at -O2 and generic march/mtune

2020-03-27 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94373

Bug ID: 94373
   Summary: 548.exchange2_r run time is 7-12% worse than GCC 9 at
-O2 and generic march/mtune
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jamborm at gcc dot gnu.org
CC: rguenth at gcc dot gnu.org
Blocks: 26163
  Target Milestone: ---
  Host: x86_64-linux
Target: x86_64-linux

When compiled with just -O2, SPEC 2017 INTrate benchmark
548.exchange2_r runs slower than when compiled with GCC 9.2. It is:

-  8% slower on AMD Zen2-based server CPU (rev. 26b3e568a60)
- 12% slower on Intel Cascade Lake server CPU (rev. abe13e1847f)
-  7% slower on AMD Zen1-based server CPU (rev. 26b3e568a60)

During GCC 10 development cycle the benchmark was relatively noisy and
the run time was increasing in many small steps, but between October 7
and November 15 we were doing 3% better than GCC 9 (on Zen2).
Specifically the following commit brought about the improvement:

  commit 806bdf4e40d31cf55744c876eb9f17654de36b99
  Author: Richard Biener 
  Date:   Mon Oct 7 07:53:45 2019 +

re PR tree-optimization/91975 (worse code for small array copy using
pointer arithmetic than array indexing)

2019-10-07  Richard Biener  

PR tree-optimization/91975
* tree-ssa-loop-ivcanon.c (constant_after_peeling): Consistently
handle invariants.

From-SVN: r276645

But it was undone by its revert:

  commit f0af4848ac40d2342743c9b16416310d61db85b5
  Author: Richard Biener 
  Date:   Fri Nov 15 09:09:16 2019 +

re PR tree-optimization/92039 (Spurious -Warray-bounds warnings building
32-bit glibc)

2019-11-15  Richard Biener  

PR tree-optimization/92039
PR tree-optimization/91975
* tree-ssa-loop-ivcanon.c (constant_after_peeling): Revert
previous change, treat invariants consistently as non-constant.
(tree_estimate_loop_size): Ternary ops with just the first op
constant are not optimized away.

* gcc.dg/tree-ssa/cunroll-2.c: Revert to state previous to
unroller adjustment.
* g++.dg/tree-ssa/ivopts-3.C: Likewise.

From-SVN: r278281

On the Intel machine, reverting the revert fixes the regression too.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163
[Bug 26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)

[Bug tree-optimization/94375] New: 548.exchange2_r run time is 8-18% worse than GCC 9 at -Ofast -march=native

2020-03-27 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94375

Bug ID: 94375
   Summary: 548.exchange2_r run time is 8-18% worse than GCC 9 at
-Ofast -march=native
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jamborm at gcc dot gnu.org
Blocks: 26163
  Target Milestone: ---
  Host: x86_64-linux
Target: x86_64-linux

When compiled with trunk revision 26b3e568a60 and options -Ofast
-march=native -mtune=native, SPEC 2017 INTrate benchmark
548.exchange2_r runs 19% slower on AMD Zen2 and 12% slower on Intel
Cascade Lake than when built with GCC 9.2.

It appears that the main culprit is the vectorizer, switching it off
recovers the performance - it is in fact even some 4% better than GCC
9 on AMD).

Side note: with --param ipa-cp-eval-threshold=1 --param
ipa-cp-unit-growth=80 one can exchange that is 25% faster yet but that
is a different issue.

This started happening in the autumn but not exactly at one point, as
the following table of run-times relative to GCC 9.2 shows. 

Revision:  time 
-  
d82f38123b5 (Nov 14 2019)  117%
d9adca6e663 (Nov 5 2019)   117%
bf037872d3c (Oct 24 2019)  101%
77ef339456f (Oct 14 2019)  118%
38a734350fd (Oct 3 2019)   100%
d469a71e5a0 (Sep 23 2019)  101%


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163
[Bug 26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)

[Bug middle-end/90056] 548.exchange2_r regressions on AMD Zen

2020-03-27 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90056

--- Comment #3 from Martin Jambor  ---
So replaced with more specific bugs for newer hardware: PR94373 and PR94375.

[Bug middle-end/87528] Popcount changes caused 531.deepsjeng_r run-time regression on Skylake

2020-03-30 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87528

--- Comment #8 from Martin Jambor  ---
Do I understand correctly that this is fixed?

[Bug tree-optimization/94375] 548.exchange2_r run time is 8-18% worse than GCC 9 at -Ofast -march=native

2020-03-30 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94375

--- Comment #3 from Martin Jambor  ---
(In reply to Hongtao.liu from comment #1)
> Try -mprefer-vector-width=128,256-bit vectorization is not helpful for 548
> according to our experience.

I have seen this helping on one system running SLES 15.1 and with
trunk abe13e1847f (Feb 17 2020) but not on another running openSUSE
Tumbleweed and with trunk revision 26b3e568a60 (Mar 23 2020).  So,
from my perspective, perhaps it helps, perhaps it doesn't.

[Bug target/94400] New: 531.deepsjeng_r is 7% slower at -O2 -march=znver2 than GCC 9

2020-03-30 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94400

Bug ID: 94400
   Summary: 531.deepsjeng_r is 7% slower at -O2 -march=znver2 than
GCC 9
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jamborm at gcc dot gnu.org
CC: hubicka at gcc dot gnu.org
Blocks: 26163
  Target Milestone: ---
  Host: x86_64-linux
Target: x86_64-linux

When compiled with -O2 -march=native and run on an AMD Zen2 CPU,
531.deepsjeng_r runs about 7% slower.  This can be bisected to a
single commit:

commit a9a4edf0e71bbac9f1b5dcecdcf9250111d16889
Author: Jan Hubicka 
Date:   Sat Nov 30 22:25:24 2019 +0100

Update max_bb_count in execute_fixup_cfg

* tree-cfg.c (execute_fixup_cfg): Update also max_bb_count when
scaling happen.

From-SVN: r278879

Surprisingly, I cannot see a similar problem on an Intel Cascade Lake
server CPU, but I have confirmed the above on two different Rome
systems (one running SLES, one openSUSE Tumbleweed).


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163
[Bug 26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)

[Bug gcov-profile/94369] 505.mcf_r is 6-7% slower at -Ofast -march=native with PGO+LTO than with just LTO

2020-03-30 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94369

--- Comment #3 from Martin Jambor  ---
I did not save the reported number of samples but from the raw sample
numbers and percentage points it seems so:

 (562770/0.4013)/(518450/0.3953) = 1.069

Nevertheless, I did save separately obtained perf stat numbers which
also look similar (and the number of branches might be a clue):

LTO:

 326083.03 msec task-clock:u  #0.999 CPUs utilized  
 0  context-switches:u#0.000 K/sec  
 0  cpu-migrations:u  #0.000 K/sec  
  8821  page-faults:u #0.027 K/sec  
 1080945983089  cycles:u  #
  (83.33%)
   21883016095  stalled-cycles-frontend:u #2.02% frontend cycles
idle (83.33%)
  435184347885  stalled-cycles-backend:u  #   40.26% backend cycles
idle  (83.33%)
  847570680279  instructions:u#0.78  insn per cycle 
  #0.51  stalled cycles per
insn  (83.34%)
  147428907202  branches:u#  452.121 M/sec 
  (83.33%)
   13395643229  branch-misses:u   #9.09% of all branches   
  (83.33%)

 326.436794016 seconds time elapsed

 325.869528000 seconds user
   0.086873000 seconds sys

vs. PGO+LTO:

 347929.80 msec task-clock:u  #0.999 CPUs utilized  
 0  context-switches:u#0.000 K/sec  
 0  cpu-migrations:u  #0.000 K/sec  
  8535  page-faults:u #0.025 K/sec  
 1153803509197  cycles:u  #
  (83.33%)
   19911862620  stalled-cycles-frontend:u #1.73% frontend cycles
idle (83.33%)
  476343319558  stalled-cycles-backend:u  #   41.28% backend cycles
idle  (83.33%)
  894092414890  instructions:u#0.77  insn per cycle 
  #0.53  stalled cycles per
insn  (83.33%)
  173999066006  branches:u#  500.098 M/sec 
  (83.33%)
   13698979291  branch-misses:u   #7.87% of all branches   
  (83.34%)

 348.308607033 seconds time elapsed

 347.711752000 seconds user
   0.090975000 seconds sys

[Bug target/90234] 503.bwaves_r is 6% slower on Zen1 CPUs at -Ofast with native march/mtune than with generic ones

2020-03-30 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90234

Martin Jambor  changed:

   What|Removed |Added

Summary|503.bwaves_r is 6% slower   |503.bwaves_r is 6% slower
   |on Zen CPUs at -Ofast with  |on Zen1 CPUs at -Ofast with
   |native march/mtune than |native march/mtune than
   |with generic ones   |with generic ones

--- Comment #1 from Martin Jambor  ---
I can still see this issue on a Zen1 machine as of trunk revision
abe13e1847f (Feb 17 2020) but not on Zen2 machines (in both cases
targeting native ISAs).

[Bug target/94406] New: 503.bwaves_r is 11% slower on Zen2 CPUs than GCC 9 with -Ofast -march=native

2020-03-30 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94406

Bug ID: 94406
   Summary: 503.bwaves_r is 11% slower on Zen2 CPUs than GCC 9
with -Ofast -march=native
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jamborm at gcc dot gnu.org
CC: andre.simoesdiasvieira at arm dot com
Blocks: 26163
  Target Milestone: ---
  Host: x86_64-linux
Target: x86_64-linux

SPEC 2017 FPrate benchmark 503.bwaves_r compiled with -Ofast
-march=native -mtune=native runs 11% slower on AMD Zen2 CPUs when
built with trunk (revision abe13e1847f) than when compiled with GCC
9.2.

Bisecting led to commit:

  commit 1297712fb4af6c6bfd827e0f0a9695b14669f87d
  Author: Andre Vieira 
  Date:   Thu Oct 31 09:49:47 2019 +

[vect]Make vect-epilogues-nomask=1 default

This patch turns epilogue vectorization on by default for all targets.


  From-SVN: r277659

If we use current trunk but build also with option
--param vect-epilogues-nomask=0 we get run-time on par with GCC 9.

This is also the reason why generic march/tuning or building with
-mprefer-vector-width=128 currently results in faster code than simple
-march=native.

Interestingly, I do not see this issue on an Intel Cascade Lake Server
CPU, even though the epilogue is created there too - judging by CFG of
the hottest function which looks the same.

And I am not sure to what extent it tells anything at all, but I
accidentally also perf'ed load-to-store-stall events and in the slow
version, the reported "samples" was 10% higher and the reported "event
count" shot up 2.8 times(!).


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163
[Bug 26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)

[Bug target/94406] 503.bwaves_r is 11% slower on Zen2 CPUs than GCC 9 with -Ofast -march=native

2020-03-30 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94406

--- Comment #1 from Martin Jambor  ---
For the record, the collected profiles both for the traditional
"cycles:u" event and (originally unintended) "ls_stlf:u" event are
below:

-Ofast -march=native -mtune=native

# Samples: 894K of event 'cycles:u'
# Event count (approx.): 735979402525
#
# Overhead   Samples  Command  Shared Object Symbol 
#     ...   
.
#
67.18%599542  bwaves_r_peak.e  bwaves_r_peak.experiment-m64  [.]
mat_times_vec_
11.40%102686  bwaves_r_peak.e  bwaves_r_peak.experiment-m64  [.]
shell_
11.37%101388  bwaves_r_peak.e  bwaves_r_peak.experiment-m64  [.]
bi_cgstab_block_
 6.95% 62694  bwaves_r_peak.e  bwaves_r_peak.experiment-m64  [.]
jacobian_
 1.88% 16957  bwaves_r_peak.e  bwaves_r_peak.experiment-m64  [.]
flux_
 1.01%  9023  bwaves_r_peak.e  libc-2.31.so  [.]
__memset_avx2_unaligned


# Samples: 769K of event 'ls_stlf:u'
# Event count (approx.): 154704730574
#
# Overhead   Samples  Command  Shared Object Symbol 
#     ...   

#
94.59%612921  bwaves_r_peak.e  bwaves_r_peak.experiment-m64  [.]
mat_times_vec_
 1.83% 88259  bwaves_r_peak.e  bwaves_r_peak.experiment-m64  [.]
shell_
 1.12% 13615  bwaves_r_peak.e  bwaves_r_peak.experiment-m64  [.]
flux_
 1.11% 43093  bwaves_r_peak.e  bwaves_r_peak.experiment-m64  [.]
jacobian_
 1.05%  8746  bwaves_r_peak.e  libc-2.31.so  [.]
__memset_avx2_unaligned



-Ofast -march=native -mtune=native --param vect-epilogues-nomask=0

# Samples: 816K of event 'cycles:u'
# Event count (approx.): 671104061807
#
# Overhead   Samples  Command  Shared Object Symbol 
#     ...   
.
#
64.07%521532  bwaves_r_peak.e  bwaves_r_peak.experiment-m64  [.]
mat_times_vec_
12.50%102670  bwaves_r_peak.e  bwaves_r_peak.experiment-m64  [.]
shell_
12.39%100777  bwaves_r_peak.e  bwaves_r_peak.experiment-m64  [.]
bi_cgstab_block_
 7.60% 62641  bwaves_r_peak.e  bwaves_r_peak.experiment-m64  [.]
jacobian_
 2.06% 16925  bwaves_r_peak.e  bwaves_r_peak.experiment-m64  [.]
flux_
 1.17%  9531  bwaves_r_peak.e  libc-2.31.so  [.]
__memset_avx2_unaligned

# Samples: 705K of event 'ls_stlf:u'
# Event count (approx.): 55009340780
#
# Overhead   Samples  Command  Shared Object Symbol 
#     ...   
..
#
86.26%532930  bwaves_r_peak.e  bwaves_r_peak.experiment-m64  [.]
mat_times_vec_
 5.15% 88270  bwaves_r_peak.e  bwaves_r_peak.experiment-m64  [.]
shell_
 3.17% 13696  bwaves_r_peak.e  bwaves_r_peak.experiment-m64  [.]
flux_
 3.06% 57149  bwaves_r_peak.e  bwaves_r_peak.experiment-m64  [.]
jacobian_
 1.59%  9226  bwaves_r_peak.e  libc-2.31.so  [.]
__memset_avx2_unaligned

[Bug target/94406] 503.bwaves_r is 11% slower on Zen2 CPUs than GCC 9 with -Ofast -march=native

2020-03-30 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94406

--- Comment #2 from Martin Jambor  ---
And for completeness, LNT sees this too and has just managed to catch the
regression:

https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=276.427.0&plot.1=295.427.0&;

[Bug target/94406] 503.bwaves_r is 11% slower on Zen2 CPUs than GCC 9 with -Ofast -march=native

2020-03-30 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94406

--- Comment #3 from Martin Jambor  ---
One more data point, binary compiled for cascadelake does not run on
Zen2, but one for znver2 runs on Cascade Lake and it makes no
difference in run-time.

If disapling epilogues helps on Intel, the difference is less than 2%.

[Bug gcov-profile/94410] New: 511.povray_r is 11% slower built at -O2 PGO+LTO than with GCC 9 and same options

2020-03-30 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94410

Bug ID: 94410
   Summary: 511.povray_r is 11% slower built at -O2 PGO+LTO than
with GCC 9 and same options
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: gcov-profile
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jamborm at gcc dot gnu.org
CC: hubicka at gcc dot gnu.org, marxin at gcc dot gnu.org
Blocks: 26163
  Target Milestone: ---
  Host: x86_64-linux
Target: x86_64-linux

SPEC 2017 FPrate benchmark 511.povray_r runs 11 % slower on AMD Zen2
CPU and 10% slower on Intel Cascade Lake server CPU when built with
-O2 (generic march/tuning) and both PGO and LTO with trunk (revision
26b3e568a60) than when compiled with the same options with GCC 9.

Bisecting revealed that the slowdown was introduced with:

commit 2925cad2151842daa387950e62d989090e47c91d
Author: Jan Hubicka 
Date:   Thu Oct 3 17:08:21 2019 +0200

params.def (PARAM_INLINE_HEURISTICS_HINT_PERCENT, [...]): New.

* params.def (PARAM_INLINE_HEURISTICS_HINT_PERCENT,
PARAM_INLINE_HEURISTICS_HINT_PERCENT_O2): New.
* doc/invoke.texi (inline-heuristics-hint-percent,
inline-heuristics-hint-percent-O2): Document.
* tree-inline.c (inline_insns_single, inline_insns_auto): Add new
hint attribute.
(can_inline_edge_by_limits_p): Use it.

From-SVN: r276516

The revision just before it was even 9% and 7% faster than GCC 9 on
AMD and Intel respectively.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163
[Bug 26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)

[Bug gcov-profile/94410] 511.povray_r is 11% slower built at -O2 PGO+LTO than with GCC 9 and same options

2020-03-30 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94410

Martin Jambor  changed:

   What|Removed |Added

   See Also||https://gcc.gnu.org/bugzill
   ||a/show_bug.cgi?id=94360

--- Comment #1 from Martin Jambor  ---
PR94360 is another O2 PGO+LTO bug where the commit caused a slowdown.

[Bug ipa/94360] 6% run-time regression of 502.gcc_r against GCC 9 when compiled with -O2 and both PGO and LTO

2020-03-30 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94360

--- Comment #2 from Martin Jambor  ---
PR94410 is another O2 PGO+LTO bug where g:2925cad2151 caused a slowdown.

[Bug gcov-profile/90364] 521.wrf_r is 8-17% slower with PGO at -Ofast and native march/mtune

2020-03-30 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90364

Martin Jambor  changed:

   What|Removed |Added

   Last reconfirmed|2019-05-06 00:00:00 |2020-3-30
Summary|521.wrf_r is 9.5 % slower   |521.wrf_r is 8-17% slower
   |with PGO on Zen CPUs at |with PGO at -Ofast and
   |-Ofast and native   |native march/mtune
   |march/mtune |

--- Comment #9 from Martin Jambor  ---
The problem still persists accross the board, causing:

- 17% regression against non-PGO on AMD Zen2 CPU,
-  8% regression against non-PGO on AMD Zen1 CPU, and
- 12% regression against non-PGO on Intel Cascade Lake server CPU.

All of the above is at -Ofast -march=native, by the way, at just -O2
(and generic -march) PGO actually helps by 25-27% on all three
systems, so I would double check before blaming specinvoke (though of
course it might be the culprit).

[Bug middle-end/90283] 519.lbm_r is 7%-10% slower with -Ofast -march=native and both LTO and PGO than with GCC 8

2020-03-30 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90283

--- Comment #5 from Martin Jambor  ---
The numbers from this year are:

- on Intel Cascade Lake server CPU the regression disappeared, if
  there ever was one, I don't have Skylake numbers this year.

- On AMD Zen1 CPU, the measured regression is 20% compared to GCC 8
  (15% compared to GCC 9) but that most likely means we hit the known
  code-placement problem again.

- On AMD Zen2 CPU, there is actually 6.8% regression compared to GCC
  8 (and only negligible one compared to GCC 9).  It may or may not be
  the same problem we were looking at last year.  In any event,
  probably not very pressing, given the behavior of the benchmark :-/

[Bug gcov-profile/94410] 511.povray_r is 11% slower built at -O2 PGO+LTO than with GCC 9 and same options

2020-03-30 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94410

--- Comment #2 from Martin Jambor  ---
For the record, SPEC 2006 453.povray is similarly affected, the commit
makes it run 26% slower.

[Bug ipa/90151] 554.roms_r regression on x86_64 at -O2 and generic march/mtune

2020-03-30 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90151

--- Comment #1 from Martin Jambor  ---
This year's numbers:

- on AMD Zen1, we are still 7.2% worse than GCC 7
- on AMD Zen2, the reegression is 4.6%
- in Intel Cascade Lake server CPU, it is 5.4%

This is all -O2, so perhaps not that important for a Fortran
benchmark.

[Bug target/94406] 503.bwaves_r is 11% slower on Zen2 CPUs than GCC 9 with -Ofast -march=native

2020-03-30 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94406

--- Comment #4 from Martin Jambor  ---
For the record, on AMD Zen2 at least, SPEC 2006 410.bwaves also runs
about 12% faster with --param vect-epilogues-nomask=0 (and otherwise
with -Ofast -march=native -mtune=native).

[Bug tree-optimization/94427] New: 456.hmmer is 8-17% slower when compiled at -Ofast than with GCC 9

2020-03-31 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94427

Bug ID: 94427
   Summary: 456.hmmer is 8-17% slower when compiled at -Ofast than
with GCC 9
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jamborm at gcc dot gnu.org
CC: rguenth at gcc dot gnu.org
Blocks: 26163
  Target Milestone: ---
  Host: x86_64-linux
Target: x86_64-linux

SPECINT 2006 benchmark 456.hmmer runs 18% slower on AMD Zen2 CPUs, 15%
on AMD Zen1 CPUs and 8% on Intel Cascade Lake server CPUs when built
with trunk (revision 26b3e568a60) and just -Ofast (so with generic
march/mtune) than when compiled wth GCC 9.

Bisecting the regression leads to commit:

  commit 14ec49a7537004633b7fff859178cbebd288ca1d
  Author: Richard Biener 
  Date:   Tue Jul 2 07:35:23 2019 +

re PR tree-optimization/58483 (missing optimization opportunity for const
std::vector compared to std::array)

2019-07-02  Richard Biener  

PR tree-optimization/58483
* tree-ssa-scopedtables.c (avail_expr_hash): Use OEP_ADDRESS_OF
for MEM_REF base hashing.
(equal_mem_array_ref_p): Likewise for base comparison.

* gcc.dg/tree-ssa/ssa-dom-cse-8.c: New testcase.

From-SVN: r272922


Collected profiles are weird, almost the other way round I would
expect them to be, because the *slow* version spends less time in cold
section - but both spend IMHO too much time there.  The following data
were collected on AMD Zen2 but those from Intel are similar in this
regard.  What is different is that on Intel perf stat reports doubling
of branch misses - and because it has older perf it does not report
front/back-end stalls.

Before the aforementioned revision:

 Performance counter stats for 'numactl -C 0 -l specinvoke':

 163360.87 msec task-clock:u  #0.992 CPUs utilized
 0  context-switches:u#0.000 K/sec
 0  cpu-migrations:u  #0.000 K/sec
  7639  page-faults:u #0.047 K/sec
  525635661818  cycles:u  #
 809847511  stalled-cycles-frontend:u #0.15% frontend cycles
idle (83.35%)
  299331255326  stalled-cycles-backend:u  #   56.95% backend cycles
idle  (83.30%)
 1757801907547  instructions:u#3.34  insn per cycle
  #0.17  stalled cycles per
insn  (83.34%)
  133496985084  branches:u#  817.191 M/sec 
  (83.35%)
 682351923  branch-misses:u   #0.51% of all branches   
  (83.31%)

 164.659685804 seconds time elapsed

 163.32542 seconds user
   0.022183000 seconds sys

# Samples: 637K of event 'cycles:u'
# Event count (approx.): 527143782584
#
# Overhead   Samples  Shared ObjectSymbol
#     ...  
#   
58.43%372284  hmmer_peak.mine-std-gen  [.] P7Viterbi
35.12%223887  hmmer_peak.mine-std-gen  [.] P7Viterbi.cold
 2.59% 16418  hmmer_peak.mine-std-gen  [.] FChoose
 2.51% 15906  hmmer_peak.mine-std-gen  [.] sre_random


At the aforementioned revision:

 Performance counter stats for 'numactl -C 0 -l specinvoke':

 191483.84 msec task-clock:u  #0.994 CPUs utilized  
 0  context-switches:u#0.000 K/sec  
 0  cpu-migrations:u  #0.000 K/sec  
  7639  page-faults:u #0.040 K/sec  
  622159384711  cycles:u  #
 817604010  stalled-cycles-frontend:u #0.13% frontend cycles
idle (83.31%)  
  439972264588  stalled-cycles-backend:u  #   70.72% backend cycles
idle  (83.34%)  
 1707838992202  instructions:u#2.75  insn per cycle 
  #0.26  stalled cycles per
insn  (83.35%)  
   91309384910  branches:u#  476.852 M/sec 
  (83.32%)  
 655463713  branch-misses:u   #0.72% of all branches   
  (83.33%)  

 192.564513355 seconds time elapsed

 191.443774000 seconds user
   0.023978000 seconds sys

# Samples: 752K of event 'cycles:u'
# Event count (approx.): 622947549968
#
# Overhead   Samples  Shared Object Symbol
#       
#   
83.68%629645  hmmer_peak.small-std-gen

[Bug tree-optimization/94427] 456.hmmer is 8-17% slower when compiled at -Ofast than with GCC 9

2020-03-31 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94427

--- Comment #1 from Martin Jambor  ---
OK, so it turns out the identified commit only allows us to shoot
ourselves in the foot - and there one too few branches, not too many.

The hottest loop, consuming most of the time is:

Percent Instructions

  0.03 │ fb0:┌─+add -0x8(%r9,%rcx,4),%eax
  5.03 │ │  mov %eax,-0x4(%r13,%rcx,4)
  2.48 │ │  mov -0x8(%r8,%rcx,4),%esi
  0.02 │ │  add -0x8(%rdx,%rcx,4),%esi
  0.06 │ │  cmp %eax,%esi
  4.49 │ │  cmovge  %esi,%eax
 17.17 │ │  mov %ecx,%esi
  0.03 │ │  cmp $0xc521974f,%eax
  3.50 │ │  cmovl   %ebx,%eax   <--- this used to be a branch
 21.84 │ │  mov %eax,-0x4(%r13,%rcx,4)
  3.88 │ │  add $0x1,%rcx
  0.00 │ │  cmp %rdi,%rcx
  0.04 │ └──jne fb0

where the marked conditional move was a branch one revision before,
because, after fwprop3 the IL looked like:

   [local count: 955630217]:
  # cstore_281 = PHI <[fast_algorithms.c:142:53] sc_223(14),
[fast_algorithms.c:142:53] cstore_249(15)>
  [fast_algorithms.c:142:49] MEM  [(void *)_72] = cstore_281;
  [fast_algorithms.c:143:13] _78 = [fast_algorithms.c:143:13] *_72;
  [fast_algorithms.c:143:10] if (_78 < -987654321)
goto ; [50.00%]
  else
goto ; [50.00%]

   [local count: 477815109]:

   [local count: 955630217]:
  # cstore_250 = PHI <[fast_algorithms.c:143:33] -987654321(16),
[fast_algorithms.c:143:33] cstore_281(17)>
  [fast_algorithms.c:143:29] MEM  [(void *)_72] = cstore_250;

The aforementioned revision turned this into more optimized code:

   [local count: 955630217]:
  # cstore_281 = PHI <[fast_algorithms.c:142:53] sc_223(14),
[fast_algorithms.c:142:53] _73(15)>
  [fast_algorithms.c:143:10] if (cstore_281 < -987654321)
goto ; [50.00%]
  else
goto ; [50.00%]

   [local count: 477815109]:

   [local count: 955630217]:
  # cstore_250 = PHI <[fast_algorithms.c:143:33] -987654321(16),
[fast_algorithms.c:143:33] cstore_281(17)>
  [fast_algorithms.c:143:29] MEM  [(void *)_72] = cstore_250;

Which then phiopt3 changed to:

  cstore_248 = MAX_EXPR ;
  [fast_algorithms.c:143:29] MEM  [(void *)_72] = cstore_248;

and expander apparently always expands MAX_EXPR into a conditional
move if it can(?).

When I hacked phiopt not to do the transformation for - ehm - any
GIMPLE_COND statement originating from source line 143, I recovered
the original run-time of the benchmark.  On both AMD and Intel.

[Bug target/94364] 505.mcf_r is 8% faster when compiled with -mprefer-vector-width=128

2020-04-01 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94364

--- Comment #2 from Martin Jambor  ---
(In reply to Richard Biener from comment #1)
> Huh, looks like this is the (patched by us) memory copying done in
> spec_qsort?

Yes

> I wonder if you can re-measure with our patching undone but then with
> -fno-strict-aliasing (though I think that only was required with LTO).
>

The difference indeed goes away :-/  The current code we're
benchmarking (when not using LTO) is slower in both cases :-/

> How large are the objects sorted in mcf?

It's always pointers, 8 bytes.

[Bug tree-optimization/94375] 548.exchange2_r run time is 8-18% worse than GCC 9 at -Ofast -march=native

2020-04-01 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94375

--- Comment #6 from Martin Jambor  ---
(In reply to Richard Biener from comment #2)
> Do we ever hit the vectorized paths?

What's the best way to find out?  If I open the disassembled code in
perf report and search for ymm, some of these (groups of) instructions
have (very few) samples, but more often they don't.

[Bug target/94364] 505.mcf_r is 8% faster when compiled with -mprefer-vector-width=128

2020-04-02 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94364

Martin Jambor  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |WONTFIX

--- Comment #6 from Martin Jambor  ---
OK, I'm going to close this given that this problem is specific to our
mcf patch which we decided to change and the issue cannot easily be
avoided in the compiler.

[Bug middle-end/26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)

2020-04-02 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163
Bug 26163 depends on bug 94364, which changed state.

Bug 94364 Summary: 505.mcf_r is 8% faster when compiled with 
-mprefer-vector-width=128
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94364

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |WONTFIX

[Bug tree-optimization/53947] [meta-bug] vectorizer missed-optimizations

2020-04-02 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
Bug 53947 depends on bug 94364, which changed state.

Bug 94364 Summary: 505.mcf_r is 8% faster when compiled with 
-mprefer-vector-width=128
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94364

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |WONTFIX

[Bug ipa/92676] [10 Regression] lto1: error: comdat-local function called by construct.constprop outside its comdat since r278669

2020-04-02 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92676

Martin Jambor  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #5 from Martin Jambor  ---
Fixed.

[Bug ipa/93621] [10 Regression] ICE in redirect_call_stmt_to_callee, at cgraph.c:1443 since r10-5567

2020-04-02 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93621

Martin Jambor  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |jamborm at gcc dot 
gnu.org
 Status|NEW |ASSIGNED

--- Comment #8 from Martin Jambor  ---
Let me have a look

[Bug ipa/93621] [10 Regression] ICE in redirect_call_stmt_to_callee, at cgraph.c:1443 since r10-5567

2020-04-02 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93621

--- Comment #9 from Martin Jambor  ---
(In reply to Jan Hubicka from comment #3)
> The testcase builds for me now, but this is Martin's code

that's questionable :-) Git blame points correctly to me but before
new IPA-SRA the assert used to be:

  gcc_assert (!node || !node->clone.combined_args_to_skip);

and was added by Honza in 2012 (in 66a20fc2a7de).

> (apparently
> checking that we did not forget to apply param adjustments)

AFAIU no, quite the opposite, it checks that we are not going to apply
param adjustment twice to a call, which is in a way what we are about
to do.

We find ourselves looking at a call statement with parameters already
adjusted and the decl in the statement being the IPA-CP created one.
In the cgraph edge, however, the callee's decl is one created during
save_inline_function_body.  Because redirect_call_stmt_to_callee
decides whether it has to do anything by comparing decls, it thinks it
has to redirect and remove params and... BOOM.

When I wrote that the call had already been adjusted that actually was
not entirely true.  The call was already created that way in
expand_thunk, because it is in an expanded artificial thunk of the
IPA-CP clone.

The assumption was that because the decl would be the correct one from
the start, no additional redirection would be taking place.  That
perhaps wasn't the best idea as save_inline_function_body can clearly
violate that (and in future some IPA pass might want to redirect the
edge too).

Having said that, I am not sure where to best fix this so late in the
GCC 10 development cycle.

[Bug ipa/93621] [10 Regression] ICE in redirect_call_stmt_to_callee, at cgraph.c:1443 since r10-5567

2020-04-03 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93621

--- Comment #13 from Martin Jambor  ---
(In reply to Jan Hubicka from comment #12)
> > Having said that, I am not sure where to best fix this so late in the
> > GCC 10 development cycle.
> 
> So the problem is that thunk is expanded on the adjusted decl but we
> still keep the adjustments and later fail to apply them?
> 
> I guess we have two options:
>  1) force thunk expansion to happen on original decls (before cloning)
> so the body ends up being same as for ordinary function

I was thinking about this too.  I will try to look into expand_thunk
whether I can leave the call statement mostly alone (apart from the
thunk transform itself, of course).

>  2) remove the adjustments after expansion - this should IMO work
> under the assumption that optimization passes don't insert
> non-trivial code into the thunk before they expand the thunk (i.e.
> if you want to adjust it in ipa-sra you will want to first produce
> the thunk and then do adjustement)
> It seems to me that 2 should be not that hard to implement
> Does that make sense?

Unfortunately I don't think so.  The adjustment is attached to the
callee (just like in the past the skip_args bitmap was - and we're
only skipping arguments in the testcase), so you cannot just remove it
in one caller.  Or am I missing something?

[Bug ipa/93621] [10 Regression] ICE in redirect_call_stmt_to_callee, at cgraph.c:1443 since r10-5567

2020-04-03 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93621

--- Comment #14 from Martin Jambor  ---
Actually, we should be able to simply skip applying adjustments, if
e->caller->former_thunk_p().  I'm playing with a patch.

[Bug ipa/93621] [10 Regression] ICE in redirect_call_stmt_to_callee, at cgraph.c:1443 since r10-5567

2020-04-03 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93621

--- Comment #15 from Martin Jambor  ---
It turns out that no, recursive inlining will happily put an adjusted and not
yet adjusted call into the same function which was formerly a thunk.

[Bug gcov-profile/94472] New: 400.perlbench is slower when compiled at -O2 with both PGO and LTO on AMD Zen CPUs

2020-04-03 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94472

Bug ID: 94472
   Summary: 400.perlbench is slower when compiled at -O2 with both
PGO and LTO on AMD Zen CPUs
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: gcov-profile
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jamborm at gcc dot gnu.org
CC: hubicka at gcc dot gnu.org, marxin at gcc dot gnu.org
Blocks: 26163
  Target Milestone: ---
  Host: x86_64-linux
Target: x86_64-linux

400.perlbench is slower when compiled at -O2 (and generic march/mtune)
with both PGO and LTO when compiled with master (26b3e568a60) than
when built with GCC 9, on Zen2 by 13% and on Zen1 by 7%.  The
performance is comparable on Intel Cascade Lake server CPU.

I attempted bisecting the problems on the Zen2 CPU but was only
partially successful because a lot of the slowdown seemed to have
happened gradually.  The first bigger slowdown - almost 4% - came
with:

  562d1e9556777988ae46c5d1357af2636bc272ea is the first bad commit
  commit 562d1e9556777988ae46c5d1357af2636bc272ea
  Author: Jan Hubicka 
  Date:   Wed Oct 2 16:01:47 2019 +

cif-code.def (MAX_INLINE_INSNS_SINGLE_O2_LIMIT, [...]): New.


* cif-code.def (MAX_INLINE_INSNS_SINGLE_O2_LIMIT,
MAX_INLINE_INSNS_AUTO_O2_LIMIT): New.

  ...
From-SVN: r276469

About the same performance loss was then introduced by:

commit 2925cad2151842daa387950e62d989090e47c91d
Author: Jan Hubicka 
Date:   Thu Oct 3 17:08:21 2019 +0200

params.def (PARAM_INLINE_HEURISTICS_HINT_PERCENT, [...]): New.

* params.def (PARAM_INLINE_HEURISTICS_HINT_PERCENT,
PARAM_INLINE_HEURISTICS_HINT_PERCENT_O2): New.
* doc/invoke.texi (inline-heuristics-hint-percent,
inline-heuristics-hint-percent-O2): Document.
* tree-inline.c (inline_insns_single, inline_insns_auto): Add new
hint attribute.
(can_inline_edge_by_limits_p): Use it.


And finally throughout March the benchmark is quite jumpy but finally
ended again ended up about 5% slower than at the beginning of the
month.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163
[Bug 26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)

[Bug ipa/93621] [10 Regression] ICE in redirect_call_stmt_to_callee, at cgraph.c:1443 since r10-5567

2020-04-03 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93621

--- Comment #16 from Martin Jambor  ---
The following workaround works for the testcase but would need to be
generalized for a chain of former_decl_of's to be universal, I'm afraid:

diff --git a/gcc/cgraph.c b/gcc/cgraph.c
index 6b780f80eb3..241b996151a 100644
--- a/gcc/cgraph.c
+++ b/gcc/cgraph.c
@@ -1467,7 +1467,8 @@ cgraph_edge::redirect_call_stmt_to_callee (cgraph_edge
*e)


   if (e->indirect_unknown_callee
-  || decl == e->callee->decl)
+  || decl == e->callee->decl
+  || decl == e->callee->former_clone_of)
 return e->call_stmt;

   if (flag_checking && decl)
diff --git a/gcc/ipa-inline-transform.c b/gcc/ipa-inline-transform.c
index eed992d314d..a6675768552 100644
--- a/gcc/ipa-inline-transform.c
+++ b/gcc/ipa-inline-transform.c
@@ -588,6 +588,7 @@ save_inline_function_body (struct cgraph_node *node)
   first_clone->next_sibling_clone = NULL;
   gcc_assert (!first_clone->prev_sibling_clone);
 }
+  first_clone->former_clone_of = node->decl;
   first_clone->clone_of = NULL;

   /* Now node in question has no clones.  */

[Bug tree-optimization/93435] [8/9 Regression] Hang with -O2 on innocuous looking code with GCC 8.3

2020-04-03 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93435

--- Comment #13 from Martin Jambor  ---
The problematic behavior of SRA is now fixed on master and both opened
release branches so I consider my work done here.

I'm leaving the bug opened in case Jeff wants to add some DSE limiter
like he wrote in comment #5.

[Bug ipa/93621] [10 Regression] ICE in redirect_call_stmt_to_callee, at cgraph.c:1443 since r10-5567

2020-04-06 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93621

--- Comment #17 from Martin Jambor  ---
Created attachment 48208
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48208&action=edit
WIP patch

This is the current version of my patch to fix this.  I think that at
least for the purposes of JIT I need to find a place to deallocate the
new summary - but that can only happen after all inlining is done.
Then I'll add that, re-base and submit it to the mailing list.

[Bug tree-optimization/94482] [8/9/10 Regression] Inserting into vector with optimization enabled on x86 generates incorrect result

2020-04-06 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94482

--- Comment #21 from Martin Jambor  ---
As Richi already found out, the path in sra_modify_expr handling type
incompatible replacement does not work when the replaced expr comes
from within a BIT_FIELD_REF - it does only half of what is necessary.

A conservative (not yet much tested) fix would be to emit a full RMW:

*** /tmp/UTN9NX_tree-sra.c  Mon Apr  6 15:28:23 2020
--- gcc/tree-sra.c  Mon Apr  6 15:22:40 2020
*** sra_modify_expr (tree *expr, gimple_stmt
*** 3742,3768 

  ref = build_ref_for_model (loc, orig_expr, 0, access, gsi, false);

! if (write)
{
  gassign *stmt;

  if (access->grp_partial_lhs)
!   ref = force_gimple_operand_gsi (gsi, ref, true, NULL_TREE,
!false, GSI_NEW_STMT);
! stmt = gimple_build_assign (repl, ref);
  gimple_set_location (stmt, loc);
! gsi_insert_after (gsi, stmt, GSI_NEW_STMT);
}
! else
{
  gassign *stmt;

  if (access->grp_partial_lhs)
!   repl = force_gimple_operand_gsi (gsi, repl, true, NULL_TREE,
!true, GSI_SAME_STMT);
! stmt = gimple_build_assign (ref, repl);
  gimple_set_location (stmt, loc);
! gsi_insert_before (gsi, stmt, GSI_SAME_STMT);
}
}
else
--- 3742,3771 

  ref = build_ref_for_model (loc, orig_expr, 0, access, gsi, false);

! if (!write || bfr)
{
  gassign *stmt;
+ tree src = repl;

  if (access->grp_partial_lhs)
!   src = force_gimple_operand_gsi (gsi, repl, true, NULL_TREE,
!true, GSI_SAME_STMT);
! stmt = gimple_build_assign (ref, src);
  gimple_set_location (stmt, loc);
! gsi_insert_before (gsi, stmt, GSI_SAME_STMT);
}
! if (bfr)
!   ref = unshare_expr (ref);
! if (write || bfr)
{
  gassign *stmt;

  if (access->grp_partial_lhs)
!   ref = force_gimple_operand_gsi (gsi, ref, true, NULL_TREE,
!false, GSI_NEW_STMT);
! stmt = gimple_build_assign (repl, ref);
  gimple_set_location (stmt, loc);
! gsi_insert_after (gsi, stmt, GSI_NEW_STMT);
}
}
else

But I wonder whether we care about type incompatibility within a B_F_R
at all - isn't B_F_R also an implicit V_C_E, always looking at the
binary image?  So perhaps something as simple as the following might
work?

diff --git a/gcc/tree-sra.c b/gcc/tree-sra.c
index b2056b58750..d22b03814d2 100644
--- a/gcc/tree-sra.c
+++ b/gcc/tree-sra.c
@@ -3736,7 +3736,7 @@ sra_modify_expr (tree *expr, gimple_stmt_iterator *gsi,
bool write)
  be accessed as a different type too, potentially creating a need for
  type conversion (see PR42196) and when scalarized unions are involved
  in assembler statements (see PR42398).  */
-  if (!useless_type_conversion_p (type, access->type))
+  if (!bfr && !useless_type_conversion_p (type, access->type))
{
  tree ref;

I'll test both options ...and it seems we need the RMW one to handle
REALPART_EXPR and IMAGPART_EXPR.

[Bug ipa/94434] [AArch64][SVE] ICE caused by incompatibility of SRA and svst3 builtin-function

2020-04-09 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94434

Martin Jambor  changed:

   What|Removed |Added

 Ever confirmed|0   |1
   Last reconfirmed||2020-04-09
  Component|tree-optimization   |ipa
 Status|UNCONFIRMED |ASSIGNED
 CC||jamborm at gcc dot gnu.org,
   ||marxin at gcc dot gnu.org

[Bug tree-optimization/94482] [8/9 Regression] Inserting into vector with optimization enabled on x86 generates incorrect result

2020-04-09 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94482

Martin Jambor  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |jamborm at gcc dot 
gnu.org
 Status|NEW |ASSIGNED

--- Comment #24 from Martin Jambor  ---
Fixed on trunk, will backport in a week or so.

[Bug ipa/94434] [AArch64][SVE] ICE caused by incompatibility of SRA and svst3 builtin-function

2020-04-09 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94434

Martin Jambor  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |jamborm at gcc dot 
gnu.org

--- Comment #1 from Martin Jambor  ---
Created attachment 48248
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48248&action=edit
Proposed fix

After our discussion on the mailing list, I'm currently testing this patch

[Bug ipa/94434] [AArch64][SVE] ICE caused by incompatibility of SRA and svst3 builtin-function

2020-04-09 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94434

Martin Jambor  changed:

   What|Removed |Added

  Attachment #48248|0   |1
is obsolete||

--- Comment #2 from Martin Jambor  ---
Created attachment 48249
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48249&action=edit
Proposed fix without a stupid pasto

The previous attachment had an obviou pasto in it, this is what I'm testing.

[Bug ipa/92550] [10 Regression] FAIL: gcc.dg/ipa/ipa-sra-8.c execution test

2020-04-09 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92550

--- Comment #3 from Martin Jambor  ---
Almost certainly started with new IPA-SRA (r275982 or as we now call
it gcc-10-3311-gff6686d2e5f).  I looked at dumps from a cross-compiler
and the funny bit is, however, that new IPA-SRA simply does nothing.

That is not as it should be.  Because foo is not versionable, the pass
does not even look at it and then cannot do anything because it has
not seen a call to get_a.  But of course it should still analyze
outgoing calls to allow IPA-SRA of callees.

But that is merely a missed optimization, not this miscompilation.  I
looks almost as if it was simply the expand of misaligned structure
copy that is broken on (this?) strict-aliasing target.  I also believe
the test case does not successfuly run when compiled with earlier
revisions and option -fno-ipa-sra.

[Bug target/92550] [10 Regression] FAIL: gcc.dg/ipa/ipa-sra-8.c execution test

2020-04-09 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92550

Martin Jambor  changed:

   What|Removed |Added

  Component|ipa |target

--- Comment #4 from Martin Jambor  ---
Not an IPA issue.

[Bug ipa/94434] [AArch64][SVE] ICE caused by incompatibility of SRA and svst3 builtin-function

2020-04-09 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94434

--- Comment #3 from Martin Jambor  ---
I have proposed the patch on the mailing list:
https://gcc.gnu.org/pipermail/gcc-patches/2020-April/543658.html

[Bug ipa/94434] [AArch64][SVE] ICE caused by incompatibility of SRA and svst3 builtin-function

2020-04-14 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94434

Martin Jambor  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #5 from Martin Jambor  ---
Fixed.

[Bug ipa/93621] [10 Regression] ICE in redirect_call_stmt_to_callee, at cgraph.c:1443 since r10-5567

2020-04-14 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93621

--- Comment #18 from Martin Jambor  ---
I posted a patch to fix this for review to the mailing list:

https://gcc.gnu.org/pipermail/gcc-patches/2020-April/543659.html

[Bug tree-optimization/94598] [10 Regression] ICE in verify_sra_access_forest, at tree-sra.c:2360 with -O1 or higher since r10-6321-g636e80eea24b780f

2020-04-15 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94598

--- Comment #2 from Martin Jambor  ---
For arrays of size 1, get_ref_base_and_extent knows that the expression can
only access the one element even if the index is a variable.  It seems it does
not happen if the ARRAY_REF is within a COMPONENT_REF, an expression created by
new total scalarization.  I'll adjust the assert for GCC 10 but will also have
a look at why get_ref_base_and_extent does that.

[Bug tree-optimization/94598] [10 Regression] ICE in verify_sra_access_forest, at tree-sra.c:2360 with -O1 or higher since r10-6321-g636e80eea24b780f

2020-04-15 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94598

--- Comment #3 from Martin Jambor  ---
I'm going to test the following:

--- a/gcc/tree-sra.c
+++ b/gcc/tree-sra.c
@@ -2357,9 +2357,11 @@ verify_sra_access_forest (struct access *root)
   gcc_assert (base == first_base);
   gcc_assert (offset == access->offset);
   gcc_assert (access->grp_unscalarizable_region
+ || access->grp_total_scalarization
  || size == max_size);
-  gcc_assert (!is_gimple_reg_type (access->type)
- || max_size == access->size);
+  gcc_assert (!access->grp_unscalarizable_region
+ || !is_gimple_reg_type (access->type)
+ || size == access->size);
   gcc_assert (reverse == access->reverse);

   if (access->first_child)

[Bug tree-optimization/94598] [10 Regression] ICE in verify_sra_access_forest, at tree-sra.c:2360 with -O1 or higher since r10-6321-g636e80eea24b780f

2020-04-15 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94598

--- Comment #4 from Martin Jambor  ---
I proposed the fix on the mailing list:
https://gcc.gnu.org/pipermail/gcc-patches/2020-April/543909.html

(Note that the one in comment #3 has a small but important typo.)

[Bug tree-optimization/94598] [10 Regression] ICE in verify_sra_access_forest, at tree-sra.c:2360 with -O1 or higher since r10-6321-g636e80eea24b780f

2020-04-16 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94598

Martin Jambor  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #6 from Martin Jambor  ---
Fixed, thanks for reporting.

[Bug ipa/93621] [10 Regression] ICE in redirect_call_stmt_to_callee, at cgraph.c:1443 since r10-5567

2020-04-16 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93621

Martin Jambor  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #20 from Martin Jambor  ---
Fixed for GCC 10, see the review email thread for caveats/future plans about
this.

[Bug ipa/93385] [10 Regression] wrong code with u128 modulo at -O2 -fno-dce -fno-ipa-cp -fno-tree-dce

2020-04-17 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93385

--- Comment #17 from Martin Jambor  ---
Created attachment 48302
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48302&action=edit
Untested fix

I'm playing with this - only very mildly tested - fix.

[Bug ipa/93385] [10 Regression] wrong code with u128 modulo at -O2 -fno-dce -fno-ipa-cp -fno-tree-dce

2020-04-17 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93385

--- Comment #22 from Martin Jambor  ---
(In reply to Jakub Jelinek from comment #18)
> Comment on attachment 48302 [details]
> Untested fix
> 
> + /* IPA-SRA does not analyze other types of statements.  */
> + gcc_unreachable ();
> Won't this ICE on any is_gimple_debug stmt?  Those should be just ignored
> and normal SSA_NAME handling should DTRT for those.

Yeah, it most probably will, I wrote it was only very mildly tested
(i.e. I only ran IPA testcases on it) - I wanted to post what I had
before I had to stop working on this for a few hours.

> As for PHIs, can you just gsi_remove them?
> Looking at tree-ssa-dce.c, it uses remove_phi_node rather than
> gsi_remove for PHIs.  And for non-PHIs, it calls release_defs after
> gsi_remove.

You are again most probably right, I keep forgetting about this.

> 
> Plus, I think in isra_track_scalar_value_uses for non-is_gimple_{debug,call}
> we should punt if !flag_tree_dce, i.e. when user asked not to perform dead
> code elimination.  Though, guess that hunk should be added only after this
> is tested (and perhaps the testcase or its copy should use
> -fdisable-tree-dce or whatever other way to avoid doing DCE even when
> flag_tree_dce is non-zero.

OK, that makes sense.  I'd slightly prefer the patch in comment #11
for this so that direct passes of a parameter to another function
without any modification is still not considered as doing DCE - but I
also do not really care too much.

[Bug ipa/93385] [10 Regression] wrong code with u128 modulo at -O2 -fno-dce -fno-ipa-cp -fno-tree-dce

2020-04-17 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93385

--- Comment #25 from Martin Jambor  ---
(In reply to rguent...@suse.de from comment #21)
> Btw, I'd much prefer to not first copy the stmts and then remove them.
> Instead the DCE "analysis" can be done on the original IL and stmts
> be "marked" to be elided during copying.  That saves generating
> SSA names and gimple stmts rather than needing to remove them after the
> fact.

It is of course easy to change the patch to do the analysis on the
original and just create a hash_set of statements/SSA_NAMES to not
copy.  I'll do that.

As far as remapping the removed values to ERROR_MARK, I'm not sure.
We'd need to remap some SSA_NAMES of the same DECL differently than
other names (e.g. default-definition of the removed PARM_DECL would
get remapped to ERROR_MARK but not other SSA_NAMES and similarly for
other SSA_NAMES derived from those default-defs) ...and ATM I do not
know to what extent that is a problem.  But I can try.

[Bug ipa/93385] [10 Regression] wrong code with u128 modulo at -O2 -fno-dce -fno-ipa-cp -fno-tree-dce

2020-04-20 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93385

--- Comment #30 from Martin Jambor  ---
Created attachment 48320
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48320&action=edit
Todays WIP patch

This is my todays (still very much) WIP patch.

- It marks statements which should not be copied before copying them
  and then skipping them.
- It does map SSA_NAMEs which should not survive to error_mark_node.
- Processing of calls is however still necessary, we cannot leave
  error_mark_nodes in the IL (until call redirection deals with it
  based on callee info).

But:

- It ICEs on gcc.dg/torture/pr48063.c.  I understand the problem,
  IPA-CP attempts to replace a floating-point parameter with an
  integer constant and fails but this fools the new DCE thingy into
  thinking some analysis declared the parameter unused even though it
  is used.  I'll have to make ipa_param_body_adjustments aware of
  tree_map.  (The original idea was to make it part of tree_map but
  for some reason I gave up on that.)

- There are three libgomp C++ ICEs that I know about which I have not
  even looked at.  I have not attempted any bootstrap yet.  I have not
  yet tested anything other than C/C++/Fortran.

- The new hash maps, or at least the one for statements, might be
  better placed in copy_body_data, the current place is just more
  convenient for the moment.  I do not care too much.

- Information currently stored in m_dead_ssas might be obtainable from
  decl_map in copy_body_data.

- I have not thought about debug statements yet and just ignored them
  for now.  I do want to handle them after other things work.

Any feedback welcome.

[Bug tree-optimization/94482] [8/9 Regression] Inserting into vector with optimization enabled on x86 generates incorrect result

2020-04-21 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94482

Martin Jambor  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #29 from Martin Jambor  ---
So this particular bug is fixed on trunk and both opened release branches.

Evan, if the issue you described in comment #25 persists even with
a patched compiler, I suggest you open a new bug.

[Bug ipa/94472] 400.perlbench is slower when compiled at -O2 with both PGO and LTO on AMD Zen CPUs

2020-04-28 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94472

--- Comment #3 from Martin Jambor  ---
My benchmarking setup is currently gone so unfortunately no, not easily.  I'll
be re-measuring everything on a different computer with a slightly different
CPU model soon, so after that I guess I could.  But it is most likely the
limits, yes.

[Bug ipa/94856] [10 Regression] ICE: Segmentation fault (in clone_of_p); or ICE: verify_cgraph_node failed (error: edge points to wrong declaration) since r10-4944-g1e83bd7003e03160

2020-04-30 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94856

--- Comment #7 from Martin Jambor  ---
The "edge points to wrong decl" case is a verifier error.  We have a
method which (in the course of IPA-CP) loses its this pointer because
it is unused and the pass then does not clone all the this adjusting
thunks and just makes the calls go straight to the new clone - and
then the verifier complains that the edge does not seem to point to a
clone of what it used to.  This looked weird because the verifier
actually has logic detecting this case but it turns out that it is
confused by inliner body-saving mechanism which invents a new decl for
the base function.

Inlining body-saving mechanism should correctly set former_clone_of
and then we can detect this case too.  Then we pass this particular
round of verification but the subsequent one fails because we have
inlined the function into its former thunk - which subsequently does
not have any callees, but the verifier still access them and segfaults
just like in the original -fopenacc case.  That is why the following
(yet untested) patch most likely fixes that case too:

diff --git a/gcc/cgraph.c b/gcc/cgraph.c
index 72d7cb54301..2a9813df2d9 100644
--- a/gcc/cgraph.c
+++ b/gcc/cgraph.c
@@ -3104,15 +3104,17 @@ clone_of_p (cgraph_node *node, cgraph_node *node2)
return false;
   /* In case of instrumented expanded thunks, which can have multiple
calls
 in them, we do not know how to continue and just have to be
-optimistic.  */
-  if (node->callees->next_callee)
+optimistic.  The same applies if all calls have already been inlined
+into the thunk.  */
+  if (!node->callees || node->callees->next_callee)
return true;
   node = node->callees->callee->ultimate_alias_target ();

   if (!node2->clone.param_adjustments
  || node2->clone.param_adjustments->first_param_intact_p ())
return false;
-  if (node2->former_clone_of == node->decl)
+  if (node2->former_clone_of == node->decl
+ || node2->former_clone_of == node->former_clone_of)
return true;

   cgraph_node *n2 = node2;
diff --git a/gcc/ipa-inline-transform.c b/gcc/ipa-inline-transform.c
index be60bbccb5c..e9e21cc0296 100644
--- a/gcc/ipa-inline-transform.c
+++ b/gcc/ipa-inline-transform.c
@@ -607,6 +607,8 @@ save_inline_function_body (struct cgraph_node *node)
}
 }
   *ipa_saved_clone_sources->get_create (first_clone) = prev_body_holder;
+  first_clone->former_clone_of
+= node->former_clone_of ? node->former_clone_of : node->decl;
   first_clone->clone_of = NULL;

   /* Now node in question has no clones.  */

[Bug ipa/94856] [10 Regression] ICE: Segmentation fault (in clone_of_p); or ICE: verify_cgraph_node failed (error: edge points to wrong declaration) since r10-4944-g1e83bd7003e03160

2020-04-30 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94856

--- Comment #8 from Martin Jambor  ---
I proposed the patch on the mailing list:
https://gcc.gnu.org/pipermail/gcc-patches/2020-April/544943.html

[Bug ipa/94856] [10/11 Regression] ICE: Segmentation fault (in clone_of_p); or ICE: verify_cgraph_node failed (error: edge points to wrong declaration) since r10-4944-g1e83bd7003e03160

2020-04-30 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94856

Martin Jambor  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|NEW |RESOLVED

--- Comment #11 from Martin Jambor  ---
Fixed on both master and the newly created gcc-10 branch.

[Bug libgomp/68033] OpenMP: ICE with teams distribute

2020-05-12 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68033

Martin Jambor  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #4 from Martin Jambor  ---
Confirmed, this got fixed at some point in the GCC 7 development cycle.  So
let's close the bug.   Thanks for having a look.

[Bug target/95336] Bad code gen omnetpp_r aarch64

2020-05-26 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95336

--- Comment #6 from Martin Jambor  ---
(In reply to Erick Ochoa from comment #0)

[...]

> I did a bisection from
> 
> commit f47f687a97260b1a1305cbf2d7ee3d74b2916a74
> Author: Richard Biener 
> Date:   Thu Apr 25 17:58:56 2019 +
> 
> to:
> 
> commit 4945b4c2c8628bdd61b348ea5bd1f9b72537a36e (HEAD)
> Author: Martin Liska 
> Date:   Tue May 26 09:01:41 2020 +0200
> 
> and I found that the following commit may have introduced the error:
> 
> commit ff6686d2e5f797d6c6a36ad14a7084bc1dc350e4
> Author: Martin Jambor 
> Date:   Fri Sep 20 00:25:04 2019 +0200
> 

Can you please try the previous revision (6889a3acfee) but with option
-fno-ipa-sra ?  If it fails, it means that the previous implementation
of IPA-SRA hid some other error (we have already had an aliasing bug
like that) - in that case it would be great if you could bisect again,
this time with this option.

[Bug debug/95343] New: IPA-SRA can result in bad debug info about removed function arguments

2020-05-26 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95343

Bug ID: 95343
   Summary: IPA-SRA can result in bad debug info about removed
function arguments
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: debug
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jamborm at gcc dot gnu.org
  Target Milestone: ---
  Host: x86_64-linux
Target: x86_64-linux

Created attachment 48608
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48608&action=edit
Testcase

ipa_param_adjustments::modify_call does not properly account for extra
arguments left over from clone materialization when recording debug
info.  Therefore, when the attached testcase is compiled with -O2 or
higher and run in gdb with a breakpoint is set at line 20 where we
examine the value of parameter i, it incorrectly reports 4, even
though it should be 2.

[Bug debug/95343] IPA-SRA can result in wrong debug info about removed function arguments

2020-05-26 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95343

Martin Jambor  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |jamborm at gcc dot 
gnu.org
Summary|IPA-SRA can result in bad   |IPA-SRA can result in wrong
   |debug info about removed|debug info about removed
   |function arguments  |function arguments
 Status|UNCONFIRMED |ASSIGNED
   Last reconfirmed||2020-05-26
 Ever confirmed|0   |1

--- Comment #1 from Martin Jambor  ---
The simplest fix which will make i reported as "optimized out" is the
following.  But I am testing a patch which can make gdb actually show
the correct 4.  Still, the following is usable for gcc 10 if the full
patch is deemed too risky:

diff --git a/gcc/ipa-param-manipulation.c b/gcc/ipa-param-manipulation.c
index 978916057f0..2a04f7b3ce5 100644
--- a/gcc/ipa-param-manipulation.c
+++ b/gcc/ipa-param-manipulation.c
@@ -787,7 +787,12 @@ ipa_param_adjustments::modify_call (gcall *stmt,
  if (!is_gimple_reg (old_parm) || kept[i])
continue;
  tree origin = DECL_ORIGIN (old_parm);
- tree arg = gimple_call_arg (stmt, i);
+ int index;
+ if (transitive_remapping)
+   index = index_map[i];
+ else
+   index = i;
+ tree arg = gimple_call_arg (stmt, index);

  if (!useless_type_conversion_p (TREE_TYPE (origin), TREE_TYPE (arg)))
{

[Bug debug/95343] IPA-SRA can result in wrong debug info about removed function arguments

2020-05-26 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95343

--- Comment #2 from Martin Jambor  ---
(In reply to Martin Jambor from comment #1)
> ...I am testing a patch which can make gdb actually show
> the correct 4. 

I meant the correct value 2, of course.

[Bug web/95380] ipcp-unit-growth was renamed to ipa-cp-unit-growth

2020-05-28 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95380

--- Comment #4 from Martin Jambor  ---
(In reply to Martin Liška from comment #3)
> Fixed for master, not planning to backport that.

Why not?  Are any of the parameters only in GCC 11?

Should I prepare a special GCC 10 patch just to address the ipcp-unit-growth ->
ipa-cp-unit-growth change then?

[Bug ipa/93385] [10/11 Regression] wrong code with u128 modulo at -O2 -fno-dce -fno-ipa-cp -fno-tree-dce

2020-05-28 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93385

--- Comment #35 from Martin Jambor  ---
I have proposed a patch series that deals with this issue, including proper
adjustments to debug info, on the mailing list:

https://gcc.gnu.org/pipermail/gcc-patches/2020-May/546702.html

[Bug tree-optimization/95113] [10/11 Regression] Wrong code w/ -O2 -fexceptions -fnon-call-exceptions

2020-05-28 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95113

--- Comment #4 from Martin Jambor  ---
(In reply to Arseny Solokha from comment #3)
> 
> Indeed, -fno-ipa-sra fixes it. So, a duplicate of PR93385?

Similar, but not quite the same.  I have proposed a fix on the mailing
list: https://gcc.gnu.org/pipermail/gcc-patches/2020-May/546703.html

[Bug debug/95343] IPA-SRA can result in wrong debug info about removed function arguments

2020-05-28 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95343

--- Comment #3 from Martin Jambor  ---
I have proposed a patch series on the mailing list to address PR 93385 and the
last patch in it also addresses this issue and allows gdb to print the correct
value of the removed parameter:

https://gcc.gnu.org/pipermail/gcc-patches/2020-May/546705.html

[Bug tree-optimization/95113] [10/11 Regression] Wrong code w/ -O2 -fexceptions -fnon-call-exceptions

2020-06-08 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95113

--- Comment #7 from Martin Jambor  ---
Fixed.  Thanks for reporting.

[Bug ipa/93385] [10/11 Regression] wrong code with u128 modulo at -O2 -fno-dce -fno-ipa-cp -fno-tree-dce

2020-06-08 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93385
Bug 93385 depends on bug 95113, which changed state.

Bug 95113 Summary: [10/11 Regression] Wrong code w/ -O2 -fexceptions 
-fnon-call-exceptions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95113

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

[Bug tree-optimization/95113] [10/11 Regression] Wrong code w/ -O2 -fexceptions -fnon-call-exceptions

2020-06-08 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95113

Martin Jambor  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|NEW |RESOLVED

--- Comment #8 from Martin Jambor  ---
...and marking it as such.

[Bug bootstrap/95970] gcc/go/gofrontend/types.cc:1474:34: warning: ‘this’ pointer null

2020-06-29 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95970

Martin Jambor  changed:

   What|Removed |Added

   Last reconfirmed||2020-06-29
 Ever confirmed|0   |1
 Status|UNCONFIRMED |NEW
 CC||ian at airs dot com

--- Comment #1 from Martin Jambor  ---
I hit this today too (and it indeed prevents go bootstrap), so I guess it's
confirmed.  Ian, can you have a look whether the warning is correct?  I glanced
at the code only for a little while but it looks so to me.

[Bug ipa/96040] [10/11 Regression] Compiled code causes SIGBUS at -O2

2020-07-03 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96040

Martin Jambor  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |jamborm at gcc dot 
gnu.org

--- Comment #4 from Martin Jambor  ---
I'll have a look

[Bug ipa/96040] [10/11 Regression] Compiled code causes SIGBUS at -O2

2020-07-03 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96040

--- Comment #5 from Martin Jambor  ---
IPA-split puts the double access to the union in the .part function
and keeps only the long int access in the "original" function.
IPA-SRA thinks it can work with that but the code in "transitive" call
parameter splitting apparently does not handle this case properly.

The easiest fix and probably the one most suitable for backporting is
to prevent splitting of such unions with the following:

--- a/gcc/ipa-sra.c
+++ b/gcc/ipa-sra.c
@@ -3271,7 +3271,9 @@ all_callee_accesses_present_p (isra_param_desc
*param_desc,
continue;
   param_access *pacc = find_param_access (param_desc, argacc->unit_offset,
  argacc->unit_size);
-  if (!pacc || !pacc->certain)
+  if (!pacc
+ || !pacc->certain
+ || !types_compatible_p (argacc->type, pacc->type))
return false;
 }
   return true;


Alternatively, we can of course handle the type mismatch and insert
appropriate V_C_E:

diff --git a/gcc/ipa-param-manipulation.c b/gcc/ipa-param-manipulation.c
index 2cc4bc79dc1..de9bad78712 100644
--- a/gcc/ipa-param-manipulation.c
+++ b/gcc/ipa-param-manipulation.c
@@ -641,6 +641,12 @@ ipa_param_adjustments::modify_call (gcall *stmt,
&& trans_map[j].unit_offset == apm->unit_offset)
  {
repl = trans_map[j].repl;
+   if (!useless_type_conversion_p (apm->type, TREE_TYPE (repl)))
+ {
+   repl = build1 (VIEW_CONVERT_EXPR, apm->type, repl);
+   repl = force_gimple_operand_gsi (&gsi, repl, true, NULL, true,
+GSI_SAME_STMT);
+ }
break;
  }
   if (repl)

[Bug ipa/96040] [10/11 Regression] Compiled code causes SIGBUS at -O2

2020-07-03 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96040

--- Comment #7 from Martin Jambor  ---
Yes, IPA-SRA identifies accesses by both offset and size, so the situation
would not have happened if the size was different.

[Bug ipa/96040] [10/11 Regression] Compiled code causes SIGBUS at -O2

2020-07-03 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96040

--- Comment #9 from Martin Jambor  ---
True. Richi expressed preference for avoiding the transform when there are type
mismatches, so I'm currently bootstrapping that.  I guess we can always revisit
the decision if we ever discover it would be really beneficial to perform the
split.

[Bug ipa/96040] [10/11 Regression] Compiled code causes SIGBUS at -O2

2020-07-04 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96040

Martin Jambor  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #12 from Martin Jambor  ---
Fixed.

[Bug ipa/96291] [10/11 Regression] -flto fails as "internal compiler error: Segmentation fault" during IPA pass: cp incall_for_symbol_thunks_and_aliases()

2020-07-23 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96291

Martin Jambor  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |jamborm at gcc dot 
gnu.org

--- Comment #2 from Martin Jambor  ---
I guess I should take a look

[Bug ipa/96235] Segmentation fault with "-Og -fno-dce -fno-tree-dce -finline-small-functions -fipa-sra"

2020-07-23 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96235

--- Comment #6 from Martin Jambor  ---
(In reply to Martin Liška from comment #4)
> It seems to me something related to IPA SRA.
> @Martin: Can you please take a look?

I will but -fno-dce -fno-tree-dce strongly suggest this is a duplicate of PR
93385.

[Bug ipa/96291] [10/11 Regression] -flto fails as "internal compiler error: Segmentation fault" during IPA pass: cp incall_for_symbol_thunks_and_aliases()

2020-07-27 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96291

Martin Jambor  changed:

   What|Removed |Added

   Assignee|jamborm at gcc dot gnu.org |slyfox at inbox dot ru

--- Comment #8 from Martin Jambor  ---
Sergei's patch is correct (I just suggested to write the condition
differently).

[Bug ipa/96235] Segmentation fault with "-Og -fno-dce -fno-tree-dce -finline-small-functions -fipa-sra"

2020-07-27 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96235

Martin Jambor  changed:

   What|Removed |Added

 Resolution|--- |DUPLICATE
 Status|NEW |RESOLVED

--- Comment #8 from Martin Jambor  ---
It is clearly a duplicate of PR 93385.

What was the reason to switch off DCE in the first place?  Was it just meant as
a stress test for the compiler?

I'll try to come up with somewhat less controversial patch for the problem.

*** This bug has been marked as a duplicate of bug 93385 ***

[Bug ipa/93385] [10/11 Regression] wrong code with u128 modulo at -O2 -fno-dce -fno-ipa-cp -fno-tree-dce

2020-07-27 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93385

Martin Jambor  changed:

   What|Removed |Added

 CC||suochenyao at 163 dot com

--- Comment #37 from Martin Jambor  ---
*** Bug 96235 has been marked as a duplicate of this bug. ***

[Bug target/84481] [8/9/10/11 Regression] 429.mcf with -O2 regresses by ~6% and ~4%, depending on tuning, on Zen compared to GCC 7.2

2020-07-30 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84481

--- Comment #12 from Martin Jambor  ---
I can once again confirm the slowdown on a zen1-based machine (commit
6e1e0decc9e vs gcc 7.5) but it is not present on a zen2-based one.  I wonder
whether the bug should me marked as WONTFIX.

[Bug target/84490] [8/9/10/11 regression] 436.cactusADM regressed by 6-8% percent with -Ofast on Zen and Haswell, compared to gcc 7.2

2020-07-30 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84490

--- Comment #15 from Martin Jambor  ---
The problem sometimes is still there, sometimes it isn't:

https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=37.100.0&plot.1=27.100.0&;

I wonder whether we should keep this bug opened, the benchmark seems too
erratic.

[Bug target/90234] 503.bwaves_r is 6% slower on Zen1/Zen2 CPUs at -Ofast with native march/mtune than with generic ones

2020-07-30 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90234

Martin Jambor  changed:

   What|Removed |Added

Summary|503.bwaves_r is 6% slower   |503.bwaves_r is 6% slower
   |on Zen1 CPUs at -Ofast with |on Zen1/Zen2 CPUs at -Ofast
   |native march/mtune than |with native march/mtune
   |with generic ones   |than with generic ones

--- Comment #2 from Martin Jambor  ---
I spoke too soon, I can see this in May gcc 10.1 data on zen1 machine and also
in current master (6e1e0decc9e) on a zen-2 machine, still about 6% in both
cases.

(Gcc9 does not have this problem on zen2 but does on zen1 so it looks a bit
fragile).

[Bug tree-optimization/96730] [10/11 Regression] ICE on x86_64-linux-gnu with `-O1` to `-O3` (in verify_sra_access_forest, at tree-sra.c:2352) since r10-6320-g5b9e89c922dc2e7e

2020-08-24 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96730

Martin Jambor  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |jamborm at gcc dot 
gnu.org

--- Comment #2 from Martin Jambor  ---
Mine.

[Bug tree-optimization/96730] [10/11 Regression] ICE on x86_64-linux-gnu with `-O1` to `-O3` (in verify_sra_access_forest, at tree-sra.c:2352) since r10-6320-g5b9e89c922dc2e7e

2020-08-24 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96730

--- Comment #3 from Martin Jambor  ---
I have proposed a fix on the mailing list:
https://gcc.gnu.org/pipermail/gcc-patches/2020-August/552488.html

[Bug tree-optimization/96730] [10/11 Regression] ICE on x86_64-linux-gnu with `-O1` to `-O3` (in verify_sra_access_forest, at tree-sra.c:2352) since r10-6320-g5b9e89c922dc2e7e

2020-08-25 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96730

Martin Jambor  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #6 from Martin Jambor  ---
Fixed, thanks for reporting.

  1   2   3   4   5   6   7   8   9   10   >