[Bug tree-optimization/86054] New: [8.1/9 Regression] SPEC CPU2006 416.gamess miscompare after r259592 with march=skylake-avx512

2018-06-05 Thread alexander.nesterovskiy at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86054

Bug ID: 86054
   Summary: [8.1/9 Regression] SPEC CPU2006 416.gamess miscompare
after r259592 with march=skylake-avx512
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: alexander.nesterovskiy at intel dot com
  Target Milestone: ---

Created attachment 44237
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=44237&action=edit
reproducer

r259592 introduced a runfail due to miscompare in SPEC CPU2006 416.gamess.
Minimal optset to reproduce is "-O3 -march=skylake-avx512".
There is no miscompare when vectorization is disabled with
"-fno-tree-vectorize".

This benchmark has some issues described in documentation
(https://www.spec.org/cpu2006/Docs/416.gamess.html):
"Some arrays are accessed past the end of the defined array size. This will,
however, not cause memory access faults" and
"The argument array sizes defined in some subroutines do not match the size of
the actual argument passed"

The problem is in "JKBCDF" function in "grd2c.F" and it's related to these
issues with wrong argument array sizes.
Patching the benchmark source solves a problem (but this patching is not
allowed according to SPEC rules):
"DIMENSION ABV(5,1),CV(18,1)" -> "DIMENSION ABV(5,NUMG),CV(18,NUMG)"

Reproducer is attached.
Symptoms here and in original 416.gamess case is that some loop iterations are
skipped.

[Bug tree-optimization/86054] [8/9 Regression] SPEC CPU2006 416.gamess miscompare after r259592 with march=skylake-avx512

2018-06-05 Thread alexander.nesterovskiy at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86054

--- Comment #2 from Alexander Nesterovskiy  ---
Thanks, "-fno-aggressive-loop-optimizations" helps.

[Bug tree-optimization/83326] [8 Regression] SPEC CPU2017 648.exchange2_s ~6% performance regression with r255267 (reproducer attached)

2017-12-21 Thread alexander.nesterovskiy at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83326

--- Comment #6 from Alexander Nesterovskiy  ---
Thanks! I see performance gain on 648.exchange2_s (~6% on Broadwell and ~3% on
Skylake-X) reverting performance to r255266 level (Skylake-X regression was
~3%).
And loops unrolled with 2 and 3 iterations. It's surely fixed.

[Bug tree-optimization/82604] [8 Regression] SPEC CPU2006 410.bwaves ~50% performance regression with trunk@253679 when ftree-parallelize-loops is used

2018-01-30 Thread alexander.nesterovskiy at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604

--- Comment #20 from Alexander Nesterovskiy  ---
I've made test runs on Broadwell and Skylake, RHEL 7.3.
410.bwaves became faster after r256990 but not as fast as it was on r253678. 
Comparing 410.bwaves performance, "-Ofast -funroll-loops -flto
-ftree-parallelize-loops=4": 

rev   perf. relative to r253678, %
r253678   100%
r253679   54%
...
r256989   54%
r256990   71%

CPU time distribution became more flat (~34% thread0, ~22% - threads1-3), but a
lot of time is spent spinning in 
libgomp.so.1.0.0/gomp_barrier_wait_end -> do_wait -> do_spin
and
libgomp.so.1.0.0/gomp_team_barrier_wait_end -> do_wait -> do_spin 
r253678 spin time is ~10% of CPU time 
r256990 spin time is ~30% of CPU time

[Bug ipa/84149] New: [8 Regression] SPEC CPU2017 505.mcf/605.mcf ~10% performance regression with r256888

2018-01-31 Thread alexander.nesterovskiy at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84149

Bug ID: 84149
   Summary: [8 Regression] SPEC CPU2017 505.mcf/605.mcf ~10%
performance regression with r256888
   Product: gcc
   Version: 8.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: ipa
  Assignee: unassigned at gcc dot gnu.org
  Reporter: alexander.nesterovskiy at intel dot com
CC: marxin at gcc dot gnu.org
  Target Milestone: ---

Minimal options to reproduce regression (x86, 64-bit):
-O3 -flto

The reason behind the regression is that since r256888 a cost_compare function
is not inlined into spec_qsort.
These two functions are in different source files.
I've managed to force cost_compare to be inlined by creating in the same source
file a copy of spec_qsort function with explicit calls of cost_compare.
This reverted performance to r256887 level.

[Bug tree-optimization/82604] [8 Regression] SPEC CPU2006 410.bwaves ~50% performance regression with trunk@253679 when ftree-parallelize-loops is used

2018-02-02 Thread alexander.nesterovskiy at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604

--- Comment #23 from Alexander Nesterovskiy  ---
Created attachment 43326
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=43326&action=edit
r253678 vs r256990

[Bug tree-optimization/82604] [8 Regression] SPEC CPU2006 410.bwaves ~50% performance regression with trunk@253679 when ftree-parallelize-loops is used

2018-02-02 Thread alexander.nesterovskiy at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604

--- Comment #24 from Alexander Nesterovskiy  ---
Yes, it looks like more time is being spent in synchronizing.
r256990 really changes the way autopar works:
For r253679...r256989 the most of work was in main thread0 mostly (thread0
~91%, threads1-3 ~3% each one).
For r256990 there is the same distribution as for r253678 (thread0 ~34%,
threads1-3 ~22% each one) but a lot of time is being spent spinning.
I've attached a chart comparing r253678 and r256990 in the same time scale
(~0.5 sec).
libgomp.so.1.0.0 code executed in thread1 for both cases is wait functions, and
for r256990 they are called more often.

Setting OMP_WAIT_POLICY doesn't change a lot:
for ACTIVE - performance is nearly the same as default
for PASSIVE - there is a serious performance drop for r256990 (looks reasonable
because of a lots of threads sleeps/wake-ups)

Changing parloops-schedule also have no positive effect:
r253678 performance is mostly the same for static, guided and dynamic
r256990 performance is best with static, which is default

[Bug tree-optimization/82604] [8 Regression] SPEC CPU2006 410.bwaves ~50% performance regression with trunk@253679 when ftree-parallelize-loops is used

2018-02-08 Thread alexander.nesterovskiy at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604

--- Comment #26 from Alexander Nesterovskiy  ---
Created attachment 43361
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=43361&action=edit
r253678 vs r256990_work_spin

[Bug tree-optimization/82604] [8 Regression] SPEC CPU2006 410.bwaves ~50% performance regression with trunk@253679 when ftree-parallelize-loops is used

2018-02-08 Thread alexander.nesterovskiy at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604

--- Comment #27 from Alexander Nesterovskiy  ---
Place of interest here is a loop in mat_times_vec function.
For r253678 a mat_times_vec.constprop._loopfn.0 is created with autopar.
For r256990 the mat_times_vec is inlined into bi_cgstab_block and three
functions are created by autopar:
bi_cgstab_block.constprop._loopfn.3
bi_cgstab_block.constprop._loopfn.6
bi_cgstab_block.constprop._loopfn.10
Sum of effective CPU time for these functions in all four threads is very close
for r253678 and r256990.
It looks reasonable since in both cases the equal amount of calculations is
being done.
But there is a significant difference in spinning/wait time.

Measuring with OMP_WAIT_POLICY=ACTIVE seems to be more informative - threads
never sleeps, they either working or spinning, thanks to Jakub.
r253678 case:
Main thread0:  ~0%  thread time spinning (~100% working)
Worker threads1-3: ~45% thread time spinning (~55% working)
r256990 case:
Main thread0:  ~20% thread time spinning (~80% working)
Worker threads1-3: ~50% thread time spinning (~50% working)

I've attached a second chart comparing CPU time for both cases (r253678 vs
r256990_work_spin), I think it illustrates the difference better than the first
one.

[Bug tree-optimization/86702] New: [8/9 Regression] SPEC CPU2006 400.perlbench, CPU2017 500.perlbench_r ~3% performance drop after r262247

2018-07-27 Thread alexander.nesterovskiy at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86702

Bug ID: 86702
   Summary: [8/9 Regression] SPEC CPU2006 400.perlbench, CPU2017
500.perlbench_r ~3% performance drop after r262247
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: alexander.nesterovskiy at intel dot com
  Target Milestone: ---

Created attachment 44453
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=44453&action=edit
reproducer

It looks like some branch probabilities information is being lost in some cases
after r262247 during tree-switchlower1.

As a result there are performance drops of ~3% for SPEC CPU2006/2017 perlbench
with some particular compilation options/HW configurations because of a more
heavy spilling/filling in hot block.

It can be illustrated with a small example:
---
$ cat > reproducer.c
int foo(int bar)
{
switch (bar)
{
case 0:
return bar + 5;
case 1:
return bar - 4;
case 2:
return bar + 3;
case 3:
return bar - 2;
case 4:
return bar + 1;
case 5:
return bar;
default:
return 0;
}
}
^Z
[2]+  Stopped cat > reproducer.c
$ ./r262246/bin/gcc -m64 -c -o /dev/null -O1
-fdump-tree-switchlower1=r262246_168t.switchlower1 reproducer.c
$ ./r262247/bin/gcc -m64 -c -o /dev/null -O1
-fdump-tree-switchlower1=r262247_168t.switchlower1 reproducer.c
$ cat r262246_168t.switchlower1

;; Function foo (foo, funcdef_no=0, decl_uid=2007, cgraph_uid=1,
symbol_order=0)

beginning to process the following SWITCH statement (reproducer.c:3) : ---
switch (bar_2(D))  [14.29%], case 0:  [57.14%], case 1: 
[14.29%], case 2:  [57.14%], case 3:  [14.29%], case 4 ... 5: 
[57.14%]>

;; GIMPLE switch case clusters: JT:0-5
Removing basic block 9
Merging blocks 2 and 8
Merging blocks 2 and 7

Symbols to be put in SSA form
{ D.2019 }
Incremental SSA update started at block: 0
Number of blocks in CFG: 7
Number of blocks to update: 6 ( 86%)


foo (int bar)
{
  int _1;

   [local count: 1073419798]:
  switch (bar_2(D))  [14.29%], case 0:  [57.14%], case 1:
 [14.29%], case 2:  [57.14%], case 3:  [14.29%], case 4 ... 5: 
[57.14%]>

   [local count: 613382737]:
:
  goto ; [100.00%]

   [local count: 153391689]:
:
  goto ; [100.00%]

   [local count: 153391689]:
:

   [local count: 1073741825]:
  # _1 = PHI <0(5), -3(2), 1(4), 5(3)>
:
  return _1;

}


$ cat r262247_168t.switchlower1

;; Function foo (foo, funcdef_no=0, decl_uid=2007, cgraph_uid=1,
symbol_order=0)

beginning to process the following SWITCH statement (reproducer.c:3) : ---
switch (bar_2(D))  [14.29%], case 0:  [57.14%], case 1: 
[14.29%], case 2:  [57.14%], case 3:  [14.29%], case 4 ... 5: 
[57.14%]>

;; GIMPLE switch case clusters: JT:0-5
Removing basic block 7
Merging blocks 2 and 8
Merging blocks 2 and 9

Symbols to be put in SSA form
{ D.2019 }
Incremental SSA update started at block: 0
Number of blocks in CFG: 7
Number of blocks to update: 6 ( 86%)


foo (int bar)
{
  int _1;

   [local count: 1073419798]:
  switch (bar_2(D))  [INV], case 0:  [INV], case 1: 
[INV], case 2:  [INV], case 3:  [INV], case 4 ... 5:  [INV]>

   [local count: 613382737]:
:
  goto ; [100.00%]

   [local count: 153391689]:
:
  goto ; [100.00%]

   [local count: 153391689]:
:

   [local count: 1073741825]:
  # _1 = PHI <0(5), -3(2), 1(4), 5(3)>
:
  return _1;

}
---

Same for a current trunk (I tried r263027).

[Bug tree-optimization/86702] [9 Regression] SPEC CPU2006 400.perlbench, CPU2017 500.perlbench_r ~3% performance drop after r262247

2018-08-02 Thread alexander.nesterovskiy at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86702

--- Comment #4 from Alexander Nesterovskiy  ---
I've noticed performance regressions on different targets and with different
compilation options, not only highly optimized like "-march=skylake-avx512
-Ofast -flto -funroll-loops" but with "-O2" too.
The simplest case is 500.perlbench_r with "-O2" on Broadwell executed in one
copy.

Performance drop is not in a particular place but "spread" over whole
S_regmatch function which is really big.
My guess was that loosing of these probabilities affects passes that follows
tree-switchlower1.
And it is what I see in generated assembly - some different spilling/filling
and different order of blocks.

[Bug tree-optimization/87214] New: [9 Regression] SPEC CPU2017, CPU2006 520/620, 403 runfails after r263772 with march=skylake-avx512

2018-09-04 Thread alexander.nesterovskiy at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87214

Bug ID: 87214
   Summary: [9 Regression] SPEC CPU2017, CPU2006 520/620, 403
runfails after r263772 with march=skylake-avx512
   Product: gcc
   Version: 9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: alexander.nesterovskiy at intel dot com
  Target Milestone: ---

There are runfails for the following benchmarks since r263772:
SPEC2017 520/620: (Segmentation fault, minimal optset to reproduce: "-O3
-march=skylake-avx512 -flto")
SPEC2006 445: (SPEC miscompare, minimal optset to reproduce: "-O3
-march=skylake-avx512")

Running 520.omnetpp_r under GDB:
---
...
Program received signal SIGSEGV, Segmentation fault.
0x004a611e in isName (s=, this=) at
simulator/ccomponent.cc:143
143 if (paramv[i].isName(parname))
(gdb) backtrace
#0  0x004a611e in isName (s=, this=) at
simulator/ccomponent.cc:143
#1  cComponent::findPar (this=0x76633380, parname=0x76603548 "bs") at
simulator/ccomponent.cc:143
#2  0x004a87b3 in cComponent::par(char const*) () at
simulator/ccomponent.cc:133
#3  0x004b676d in cNEDNetworkBuilder::doParam(cComponent*,
ParamElement*, bool) () at simulator/cnednetworkbuilder.cc:179
#4  0x004b8610 in doParams (isSubcomponent=false, paramsNode=, component=0x76633380, this=0x7fffaaf0) at
simulator/cnednetworkbuilder.cc:139
#5  cNEDNetworkBuilder::addParametersAndGatesTo(cComponent*, cNEDDeclaration*)
() at simulator/cnednetworkbuilder.cc:105
#6  0x0048843b in addParametersAndGatesTo (module=0x76633380,
this=) at /include/c++/9.0.0/bits/stl_tree.h:211
#7  cModuleType::create(char const*, cModule*, int, int) () at
simulator/ccomponenttype.cc:156
#8  0x0045916f in setupNetwork (network=,
this=0x7653bc40) at simulator/cnamedobject.h:117
#9  Cmdenv::run() () at simulator/cmdenv.cc:253
#10 0x005186ec in EnvirBase::run(int, char**, cConfiguration*) () at
simulator/envirbase.cc:230
#11 0x0043d60d in setupUserInterface(int, char**, cConfiguration*)
[clone .constprop.112] () at simulator/startup.cc:234
#12 0x0042446a in main (argc=1, argv=0x7fffb1c8) at
simulator/main.cc:39
---

403.gcc miscompares: 200.s, g23.s, scilab.s.
For example:
---
$ diff -u g23_ref.s g23.s | head -n 16
--- g23_ref.s
+++ g23.s
@@ -1746,19 +1746,19 @@
testq   %rbx, %rbx
jne .L904
movq%r12, %rdx
-   xorl%r8d, %r8d
+   xorl%esi, %esi
negq%rdx
 .L905:
addq%rcx, %rdx
-   leaq(%rax,%r8), %rax
+   leaq(%rax,%rsi), %rax
leaq1(%rdx), %rcx
-   cmpq%r8, %rax
+   cmpq%rsi, %rax

---

Unfortunately I didn't manage to create a reproducer.

[Bug tree-optimization/84419] New: [8 Regression] SPEC CPU2017/CPU2006 521/621, 527/627, 554/654, 445, 454, 481, 416 runfails after r256628 with march=skylake-avx512

2018-02-16 Thread alexander.nesterovskiy at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84419

Bug ID: 84419
   Summary: [8 Regression] SPEC CPU2017/CPU2006 521/621, 527/627,
554/654, 445, 454, 481, 416 runfails after r256628
with march=skylake-avx512
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: alexander.nesterovskiy at intel dot com
  Target Milestone: ---

There are runfails for the following benchmarks since r256628:
SPEC2017 fp-rate/fp-speed:
 521/621
 527/627
 554/654
SPEC2006 int-speed:
 445
SPEC2006 fp-rate/fp-speed:
 454
 481
 416

Minimal optset to reproduce is "-O3 -march=skylake-avx512".

Reverting changes in tree-ssa-loop-ivopts.c fixes the problem in current
revisions (r257682 at least).

[Bug tree-optimization/84419] [8 Regression] SPEC CPU2017/CPU2006 521/621, 527/627, 554/654, 445, 454, 481, 416 runfails after r256628 with march=skylake-avx512

2018-02-16 Thread alexander.nesterovskiy at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84419

--- Comment #2 from Alexander Nesterovskiy  ---
I've made a quite small reproducer:
---
$ cat reproducer.c
#include 
#include 

#define SIZE 400

int  foo[SIZE];
char bar[SIZE];

void __attribute__ ((noinline)) foo_func(void)
{
  int i;
  for (i = 1; i < SIZE; i++)
if (bar[i])
  foo[i] = 1;
}

int main()
{
  memset(bar, 1, sizeof(bar));
  foo_func();
  return 0;
}
$ gcc -O3 -march=skylake-avx512 -g reproducer.c
$ ./a.out
Segmentation fault  (core dumped)

[Bug target/82862] [8 Regression] SPEC CPU2006 465.tonto performance regression with r253975 (up to 40% drop for particular loop)

2018-02-19 Thread alexander.nesterovskiy at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82862

--- Comment #8 from Alexander Nesterovskiy  ---
I'd say that it's not just fixed but improved with an impressive gain.

It is about +4% on HSW AVX2 and about +8% on SKX AVX512 after r257734 (compared
to r257732) for a 465.tonto SPEC rate.
Comparing to reference r253973 it is about +2% on HSW AVX2 and +18% on SKX
AVX512 (AVX512 was greatly improved in last 3 months).

[Bug tree-optimization/84419] [8 Regression] SPEC CPU2017/CPU2006 521/621, 527/627, 554/654, 445, 454, 481, 416 runfails after r256628 with march=skylake-avx512

2018-02-19 Thread alexander.nesterovskiy at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84419

--- Comment #5 from Alexander Nesterovskiy  ---
Yes, looks like the problem is with unaligned access (there is no fail in
reproducer when starting a loop with i=0).
It seems that your patch works - there are no runfails for reproducer, 445,
521, 527, 554 (tested on SPEC train workload).
I'll report upon finishing other benchmarks.

[Bug tree-optimization/84419] [8 Regression] SPEC CPU2017/CPU2006 521/621, 527/627, 554/654, 445, 454, 481, 416 runfails after r256628 with march=skylake-avx512

2018-02-20 Thread alexander.nesterovskiy at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84419

--- Comment #6 from Alexander Nesterovskiy  ---
All the mentioned SPEC CPU2017/CPU2006 521/621, 527/627, 554/654, 445, 454,
481, 416 have finished successfully. Patch was applied to r257732.

[Bug middle-end/82344] [8 Regression] SPEC CPU2006 435.gromacs ~10% performance regression with trunk@250855

2018-03-27 Thread alexander.nesterovskiy at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82344

--- Comment #7 from Alexander Nesterovskiy  ---
Yes, I've checked it - current performance is about previous level and
execution of these piece of code takes the same amount of time.

[Bug tree-optimization/82220] New: [8 Regression] SPEC CPU2006 482.sphinx3 ~10% performance regression with trunk@250416

2017-09-15 Thread alexander.nesterovskiy at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82220

Bug ID: 82220
   Summary: [8 Regression] SPEC CPU2006 482.sphinx3 ~10%
performance regression with trunk@250416
   Product: gcc
   Version: 8.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: alexander.nesterovskiy at intel dot com
  Target Milestone: ---

Calculation of a min_profitable_iters threshold was changed in r250416
In 482.sphinx3 a wrong path is chosen in a hottest loop which leads to ~45%
performance drop for a particular loop and ~10% performance drop for a whole
test.

According to perf report:
---
Overhead   SamplesSymbol
trunk@250416 (and up-to-date trunk@252756)
  36.91%--> 527019vector_gautbl_eval_logs3
  28.03%400316mgau_eval  
   8.90%127071subvq_mgau_shortlist   

trunk@250415
  31.72%402114mgau_eval
  29.36%--> 372126vector_gautbl_eval_logs3
   9.85%124815subvq_mgau_shortlist
---

Compiled with: "-Ofast -funroll-loops -march=core-avx2 -mfpmath=sse"

[Bug tree-optimization/82220] [8 Regression] SPEC CPU2006 482.sphinx3 ~10% performance regression with trunk@250416

2017-09-15 Thread alexander.nesterovskiy at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82220

--- Comment #2 from Alexander Nesterovskiy  ---
Yes, I've applied a patch and looks like it helped:
---
Overhead   SamplesSymbol
trunk@252796 + patch
  31.57%412037mgau_eval
  30.54%--> 397608vector_gautbl_eval_logs3
   9.78%127468subvq_mgau_shortlist
---
And I see in perf that actual execution goes exactly the same way as it goes
for r250415 (corresponding parts of function bodies are actually being
executed).

[Bug rtl-optimization/82344] New: [8 Regression] SPEC CPU2006 435.gromacs ~10% performance regression with trunk@250855

2017-09-27 Thread alexander.nesterovskiy at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82344

Bug ID: 82344
   Summary: [8 Regression] SPEC CPU2006 435.gromacs ~10%
performance regression with trunk@250855
   Product: gcc
   Version: 8.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: alexander.nesterovskiy at intel dot com
  Target Milestone: ---

Created attachment 42246
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=42246&action=edit
r250854 vs r250855 generated code comparison

Compilation options that affects regression: "-Ofast -march=core-avx2
-mfpmath=sse"

Regression happened after r250855 though it looks like this commit is not of
guilty by itself but reveals something in other stages.

Changes in 123t.reassoc1 stage leads to a bit different code generation during
stages that follow it.

Place of interest is in "inl1130" subroutine (file "innerf.f") - it's a part of
a big loop with 9 similar expressions with 4-byte float variables:
---
y1 = 1.0/sqrt(x1)
y2 = 1.0/sqrt(x2)
y3 = 1.0/sqrt(x3)
y4 = 1.0/sqrt(x4)
y5 = 1.0/sqrt(x5)
y6 = 1.0/sqrt(x6)
y7 = 1.0/sqrt(x7)
y8 = 1.0/sqrt(x8)
y9 = 1.0/sqrt(x9)
---

When compiled with "-ffast-math" 1/sqrt is calculated with "vrsqrtss"
instruction followed by Newton-Raphson step with four "vmulss", one "vadss" and
two constants used.
Like here (part of r250854 code):
---
vrsqrtss xmm12, xmm12, xmm7
vmulss   xmm7,  xmm12, xmm7
vmulss   xmm0,  xmm12, DWORD PTR .LC2[rip]
vmulss   xmm8,  xmm7,  xmm12
vaddss   xmm5,  xmm8,  DWORD PTR .LC1[rip]
vmulss   xmm1,  xmm5,  xmm0
---
Input values (x1-x9) are in xmm registers mostly (x2 and x7 in memory), output
values (y1-y9) are in xmm registers.

After r250855 .LC2 constant goes into xmm7 and x7 is also goes to xmm register.
This leads to lack of temporary registers and worse instructions interleaving
as a result.
See attached picture with part of assembly listings where corresponding
y=1/sqrt parts are painted the same color.

Finally these 9 lines of code are executed about twice slower which leads to
~10% performance regression for whole test.

I've made two independent attempts to change code in order to verify the above.

1. To be sure that we loose performance exactly in this part of a loop I just
pasted ~60 assembly instructions from previous revision to a new one (after
proper renaming of course). This helped to restore performance.

2. To be sure that the problem is due to a lack of temporary registers I moved
calculation of 1/sqrt for one last line into function call. Like here:
---
... in other module to disable inlining:
function myrsqrt(x)
  implicit none
  real*4 x
  real*4 myrsqrt
  myrsqrt = 1.0/sqrt(x);
  return
end function myrsqrt

...

y1 = 1.0/sqrt(x1)
y2 = 1.0/sqrt(x2)
y3 = 1.0/sqrt(x3)
y4 = 1.0/sqrt(x4)
y5 = 1.0/sqrt(x5)
y6 = 1.0/sqrt(x6)
y7 = 1.0/sqrt(x7)
y8 = 1.0/sqrt(x8)
y9 = myrsqrt(x9)
---
Even with call/ret overhead this also helped to restore performance since it
freed some registers.

[Bug fortran/82362] New: [8 Regression] SPEC CPU2006 436.cactusADM ~7% performance deviation with trunk@251713

2017-09-29 Thread alexander.nesterovskiy at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82362

Bug ID: 82362
   Summary: [8 Regression] SPEC CPU2006 436.cactusADM ~7%
performance deviation with trunk@251713
   Product: gcc
   Version: 8.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: fortran
  Assignee: unassigned at gcc dot gnu.org
  Reporter: alexander.nesterovskiy at intel dot com
  Target Milestone: ---

r251713 brings reasonable improvement to alloca. However there is a side effect
of this patch - 436.cactusADM performance became unstable when compiled with
-Ofast -march=core-avx2 -mfpmath=sse -funroll-loops
The impact is more noticeable when compiled with auto-parallelization
-ftree-parallelize-loops=N

Comparing performance for particular 7-runs
(relative to median performance of r251711):
r251711: 92,8%   92,9%   93,0%   106,7%  107,0%  107,0%  107,2%
r251713: 99,5%   99,6%   99,8%   100,0%  100,3%  100,6%  100,6%

r251711 is prettty stable, while r251713 is +7% faster on some runs and -7%
slower on other.

There are few dynamic arrays in the body of Bench_StaggeredLeapfrog2 sub in
StaggeredLeapfrog2.fppized.f.
When compiled with "-fstack-arrays" (default for "-Ofast") arrays are allocated
by alloca.
Allocated memory size is rounded-up to 16-bytes in r251713 with code like "size
= (size + 15) & -16".
In prior revisions it differs in just one byte: "size = (size + 22) & -16"
Which actually may just waste extra 16 bytes for each array depending on
initial "size" value.

Actual r251713 code, built with
gfortran -S -masm=intel -o StaggeredLeapfrog2.fppized_r251713.s
-O3 -fstack-arrays -march=core-avx2 -mfpmath=sse -funroll-loops
-ftree-parallelize-loops=8 StaggeredLeapfrog2.fppized.f

lea rax, [15+r13*8] ; size = <...> + 15
shr rax, 4  ; zero-out
sal rax, 4  ; lower 4 bits
sub rsp, rax
mov QWORD PTR [rbp-4984], rsp   ; Array 1
sub rsp, rax
mov QWORD PTR [rbp-4448], rsp   ; Array 2
sub rsp, rax 
mov QWORD PTR [rbp-4784], rsp   ; Array 3 ... and so on


Aligning rsp to cache line size (on each allocation or even once in the
beginning) brings performance to stable high values:

lea rax, [15+r13*8] 
shr rax, 4
sal rax, 4
shr rsp, 6  ; Align rsp to
shl rsp, 6  ; 64-byte border
sub rsp, rax
mov QWORD PTR [rbp-4984], rsp
sub rsp, rax
mov QWORD PTR [rbp-4448], rsp
sub rsp, rax 
mov QWORD PTR [rbp-4784], rsp


64-byte aligned version performance
compared to the same median performance of r251711:
106,7%  107,0%  107,0%  107,1%  107,1%  107,2%  107,4%

Maybe what is necessary here is some kind of option to force array aligning for
gfortran (like "-align array64byte" for ifort) ?

[Bug fortran/82362] [8 Regression] SPEC CPU2006 436.cactusADM ~7% performance deviation with trunk@251713

2017-10-02 Thread alexander.nesterovskiy at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82362

--- Comment #4 from Alexander Nesterovskiy  ---
(In reply to Richard Biener from comment #2)
> I suppose you swapped the revs here
Yep, sorry. It's supposed to be:
r251711: 99,5%   99,6%   99,8%   100,0%  100,3%  100,6%  100,6%
r251713: 92,8%   92,9%   93,0%   106,7%  107,0%  107,0%  107,2%

[Bug tree-optimization/82604] New: [8 Regression] SPEC CPU2006 410.bwaves ~50% performance regression with trunk@253679 when ftree-parallelize-loops is used

2017-10-18 Thread alexander.nesterovskiy at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604

Bug ID: 82604
   Summary: [8 Regression] SPEC CPU2006 410.bwaves ~50%
performance regression with trunk@253679 when
ftree-parallelize-loops is used
   Product: gcc
   Version: 8.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: alexander.nesterovskiy at intel dot com
  Target Milestone: ---

Minimal options to reproduce regression (4 threads is just for example, there
can be more):
-Ofast -funroll-loops -flto -ftree-parallelize-loops=4

Auto-parallelization became mostly useless for 410.bwaves after r253679.
CPU time distributes like this:
 Thread0 Thread1 Thread2 Thread3
r253679: ~91%~3% ~3% ~3%
r253678: ~34%~22%~22%~22%

Linking with "-fopt-info-loop-optimized" shows that twice less loops have
parallelized:
---
gfortran -Ofast -funroll-loops -flto -ftree-parallelize-loops=4 -g
-fopt-info-loop-optimized=loop.optimized *.o
grep parallelizing loop.optimized -c
---
r253679: 19
r253678: 38

Most valuable missed parallelization is
"block_solver.f:170:0: note: parallelizing outer loop 2"
in the hottest function "mat_times_vec".

[Bug tree-optimization/82862] New: [8 Regression] SPEC CPU2006 465.tonto performance regression with trunk@253975 (up to 40% drop for particular loop)

2017-11-06 Thread alexander.nesterovskiy at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82862

Bug ID: 82862
   Summary: [8 Regression] SPEC CPU2006 465.tonto performance
regression with trunk@253975 (up to 40% drop for
particular loop)
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: alexander.nesterovskiy at intel dot com
  Target Milestone: ---

Created attachment 42552
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=42552&action=edit
reproducer

Regression is well noticeable when 465.tonto is compiled with:
-Ofast -march=core-avx2 -mfpmath=sse -funroll-loops

Changes in cost model leads to changes in unrolling and vectorizing of few
loops and causes increase of their execution time up to 60%.
Whole 465.tonto benchmark regression is not so big and is about 2-4% just
because the affected loops are less than 10% of all workload.

Compiling with "-fopt-info-all-optall=all.optimized" and grepping for
particular line:
r253973:
 shell1quartet.fppized.f90:4086:0: note: loop unrolled 7 times
 shell1quartet.fppized.f90:4086:0: note: loop unrolled 7 times

r253975:
 shell1quartet.fppized.f90:4086:0: note: loop vectorized
 shell1quartet.fppized.f90:4086:0: note: loop vectorized
 shell1quartet.fppized.f90:4086:0: note: loop with 6 iterations completely
unrolled
 shell1quartet.fppized.f90:4086:0: note: loop with 6 iterations completely
unrolled
 shell1quartet.fppized.f90:4086:0: note: loop unrolled 3 times
 shell1quartet.fppized.f90:4086:0: note: loop unrolled 1 times

There was a change introduced by r254012: 
 shell1quartet.fppized.f90:4086:0: note: loop vectorized
 shell1quartet.fppized.f90:4086:0: note: loop vectorized
 shell1quartet.fppized.f90:4086:0: note: loop with 3 iterations completely
unrolled
 shell1quartet.fppized.f90:4086:0: note: loop with 3 iterations completely
unrolled
 shell1quartet.fppized.f90:4086:0: note: loop unrolled 3 times
 shell1quartet.fppized.f90:4086:0: note: loop unrolled 1 times

But still there is a degradation of these particular loops up to 40%.

Reproducer is attached.

[Bug tree-optimization/83326] New: [8 Regression] SPEC CPU2017 648.exchange2_s ~6% performance regression with r255267 (reproducer attached)

2017-12-08 Thread alexander.nesterovskiy at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83326

Bug ID: 83326
   Summary: [8 Regression] SPEC CPU2017 648.exchange2_s ~6%
performance regression with r255267 (reproducer
attached)
   Product: gcc
   Version: 8.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: alexander.nesterovskiy at intel dot com
  Target Milestone: ---

Created attachment 42815
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=42815&action=edit
exchange2_reproducer

Bechmark regression is well noticeable on Broadwell/Haswell with:
-m64 -Ofast -funroll-loops -march=core-avx2 -mfpmath=sse -flto -fopenmp

Attached reproducer demonstrates ~27% longer execution with:
-m64 -O[3|fast] -funroll-loops

There are 18 similar lines in 648.exchange2_s source code which execution time
was noticeably changed after r255267.
It looks like:
---
some_int_array(index1+1:index1+2, 1:3, index2) =
some_int_array(index1+1:index1+2, 1:3, index2) - 10
---

"-fopt-info-loop-optimized" shows that each of these lines is unrolled with 2
iterations and with 3 iterations by r255266.
This seems to be reasonable since we see in source that two rows and three
columns are being modified.
For a particular line, r255266:
---
exchange2.fppized.f90:1135:0: note: loop with 2 iterations completely unrolled
(header execution count 22444250)
exchange2.fppized.f90:1135:0: note: loop with 3 iterations completely unrolled
(header execution count 16831504)
---

For r255267 it goes another way:
---
exchange2.fppized.f90:1135:0: note: loop with 2 iterations completely unrolled
(header execution count 14963581)
exchange2.fppized.f90:1135:0: note: loop with 2 iterations completely unrolled
(header execution count 11221564)
---