[Bug tree-optimization/86054] New: [8.1/9 Regression] SPEC CPU2006 416.gamess miscompare after r259592 with march=skylake-avx512
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86054 Bug ID: 86054 Summary: [8.1/9 Regression] SPEC CPU2006 416.gamess miscompare after r259592 with march=skylake-avx512 Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: alexander.nesterovskiy at intel dot com Target Milestone: --- Created attachment 44237 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=44237&action=edit reproducer r259592 introduced a runfail due to miscompare in SPEC CPU2006 416.gamess. Minimal optset to reproduce is "-O3 -march=skylake-avx512". There is no miscompare when vectorization is disabled with "-fno-tree-vectorize". This benchmark has some issues described in documentation (https://www.spec.org/cpu2006/Docs/416.gamess.html): "Some arrays are accessed past the end of the defined array size. This will, however, not cause memory access faults" and "The argument array sizes defined in some subroutines do not match the size of the actual argument passed" The problem is in "JKBCDF" function in "grd2c.F" and it's related to these issues with wrong argument array sizes. Patching the benchmark source solves a problem (but this patching is not allowed according to SPEC rules): "DIMENSION ABV(5,1),CV(18,1)" -> "DIMENSION ABV(5,NUMG),CV(18,NUMG)" Reproducer is attached. Symptoms here and in original 416.gamess case is that some loop iterations are skipped.
[Bug tree-optimization/86054] [8/9 Regression] SPEC CPU2006 416.gamess miscompare after r259592 with march=skylake-avx512
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86054 --- Comment #2 from Alexander Nesterovskiy --- Thanks, "-fno-aggressive-loop-optimizations" helps.
[Bug tree-optimization/83326] [8 Regression] SPEC CPU2017 648.exchange2_s ~6% performance regression with r255267 (reproducer attached)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83326 --- Comment #6 from Alexander Nesterovskiy --- Thanks! I see performance gain on 648.exchange2_s (~6% on Broadwell and ~3% on Skylake-X) reverting performance to r255266 level (Skylake-X regression was ~3%). And loops unrolled with 2 and 3 iterations. It's surely fixed.
[Bug tree-optimization/82604] [8 Regression] SPEC CPU2006 410.bwaves ~50% performance regression with trunk@253679 when ftree-parallelize-loops is used
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604 --- Comment #20 from Alexander Nesterovskiy --- I've made test runs on Broadwell and Skylake, RHEL 7.3. 410.bwaves became faster after r256990 but not as fast as it was on r253678. Comparing 410.bwaves performance, "-Ofast -funroll-loops -flto -ftree-parallelize-loops=4": rev perf. relative to r253678, % r253678 100% r253679 54% ... r256989 54% r256990 71% CPU time distribution became more flat (~34% thread0, ~22% - threads1-3), but a lot of time is spent spinning in libgomp.so.1.0.0/gomp_barrier_wait_end -> do_wait -> do_spin and libgomp.so.1.0.0/gomp_team_barrier_wait_end -> do_wait -> do_spin r253678 spin time is ~10% of CPU time r256990 spin time is ~30% of CPU time
[Bug ipa/84149] New: [8 Regression] SPEC CPU2017 505.mcf/605.mcf ~10% performance regression with r256888
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84149 Bug ID: 84149 Summary: [8 Regression] SPEC CPU2017 505.mcf/605.mcf ~10% performance regression with r256888 Product: gcc Version: 8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: ipa Assignee: unassigned at gcc dot gnu.org Reporter: alexander.nesterovskiy at intel dot com CC: marxin at gcc dot gnu.org Target Milestone: --- Minimal options to reproduce regression (x86, 64-bit): -O3 -flto The reason behind the regression is that since r256888 a cost_compare function is not inlined into spec_qsort. These two functions are in different source files. I've managed to force cost_compare to be inlined by creating in the same source file a copy of spec_qsort function with explicit calls of cost_compare. This reverted performance to r256887 level.
[Bug tree-optimization/82604] [8 Regression] SPEC CPU2006 410.bwaves ~50% performance regression with trunk@253679 when ftree-parallelize-loops is used
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604 --- Comment #23 from Alexander Nesterovskiy --- Created attachment 43326 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=43326&action=edit r253678 vs r256990
[Bug tree-optimization/82604] [8 Regression] SPEC CPU2006 410.bwaves ~50% performance regression with trunk@253679 when ftree-parallelize-loops is used
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604 --- Comment #24 from Alexander Nesterovskiy --- Yes, it looks like more time is being spent in synchronizing. r256990 really changes the way autopar works: For r253679...r256989 the most of work was in main thread0 mostly (thread0 ~91%, threads1-3 ~3% each one). For r256990 there is the same distribution as for r253678 (thread0 ~34%, threads1-3 ~22% each one) but a lot of time is being spent spinning. I've attached a chart comparing r253678 and r256990 in the same time scale (~0.5 sec). libgomp.so.1.0.0 code executed in thread1 for both cases is wait functions, and for r256990 they are called more often. Setting OMP_WAIT_POLICY doesn't change a lot: for ACTIVE - performance is nearly the same as default for PASSIVE - there is a serious performance drop for r256990 (looks reasonable because of a lots of threads sleeps/wake-ups) Changing parloops-schedule also have no positive effect: r253678 performance is mostly the same for static, guided and dynamic r256990 performance is best with static, which is default
[Bug tree-optimization/82604] [8 Regression] SPEC CPU2006 410.bwaves ~50% performance regression with trunk@253679 when ftree-parallelize-loops is used
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604 --- Comment #26 from Alexander Nesterovskiy --- Created attachment 43361 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=43361&action=edit r253678 vs r256990_work_spin
[Bug tree-optimization/82604] [8 Regression] SPEC CPU2006 410.bwaves ~50% performance regression with trunk@253679 when ftree-parallelize-loops is used
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604 --- Comment #27 from Alexander Nesterovskiy --- Place of interest here is a loop in mat_times_vec function. For r253678 a mat_times_vec.constprop._loopfn.0 is created with autopar. For r256990 the mat_times_vec is inlined into bi_cgstab_block and three functions are created by autopar: bi_cgstab_block.constprop._loopfn.3 bi_cgstab_block.constprop._loopfn.6 bi_cgstab_block.constprop._loopfn.10 Sum of effective CPU time for these functions in all four threads is very close for r253678 and r256990. It looks reasonable since in both cases the equal amount of calculations is being done. But there is a significant difference in spinning/wait time. Measuring with OMP_WAIT_POLICY=ACTIVE seems to be more informative - threads never sleeps, they either working or spinning, thanks to Jakub. r253678 case: Main thread0: ~0% thread time spinning (~100% working) Worker threads1-3: ~45% thread time spinning (~55% working) r256990 case: Main thread0: ~20% thread time spinning (~80% working) Worker threads1-3: ~50% thread time spinning (~50% working) I've attached a second chart comparing CPU time for both cases (r253678 vs r256990_work_spin), I think it illustrates the difference better than the first one.
[Bug tree-optimization/86702] New: [8/9 Regression] SPEC CPU2006 400.perlbench, CPU2017 500.perlbench_r ~3% performance drop after r262247
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86702 Bug ID: 86702 Summary: [8/9 Regression] SPEC CPU2006 400.perlbench, CPU2017 500.perlbench_r ~3% performance drop after r262247 Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: alexander.nesterovskiy at intel dot com Target Milestone: --- Created attachment 44453 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=44453&action=edit reproducer It looks like some branch probabilities information is being lost in some cases after r262247 during tree-switchlower1. As a result there are performance drops of ~3% for SPEC CPU2006/2017 perlbench with some particular compilation options/HW configurations because of a more heavy spilling/filling in hot block. It can be illustrated with a small example: --- $ cat > reproducer.c int foo(int bar) { switch (bar) { case 0: return bar + 5; case 1: return bar - 4; case 2: return bar + 3; case 3: return bar - 2; case 4: return bar + 1; case 5: return bar; default: return 0; } } ^Z [2]+ Stopped cat > reproducer.c $ ./r262246/bin/gcc -m64 -c -o /dev/null -O1 -fdump-tree-switchlower1=r262246_168t.switchlower1 reproducer.c $ ./r262247/bin/gcc -m64 -c -o /dev/null -O1 -fdump-tree-switchlower1=r262247_168t.switchlower1 reproducer.c $ cat r262246_168t.switchlower1 ;; Function foo (foo, funcdef_no=0, decl_uid=2007, cgraph_uid=1, symbol_order=0) beginning to process the following SWITCH statement (reproducer.c:3) : --- switch (bar_2(D)) [14.29%], case 0: [57.14%], case 1: [14.29%], case 2: [57.14%], case 3: [14.29%], case 4 ... 5: [57.14%]> ;; GIMPLE switch case clusters: JT:0-5 Removing basic block 9 Merging blocks 2 and 8 Merging blocks 2 and 7 Symbols to be put in SSA form { D.2019 } Incremental SSA update started at block: 0 Number of blocks in CFG: 7 Number of blocks to update: 6 ( 86%) foo (int bar) { int _1; [local count: 1073419798]: switch (bar_2(D)) [14.29%], case 0: [57.14%], case 1: [14.29%], case 2: [57.14%], case 3: [14.29%], case 4 ... 5: [57.14%]> [local count: 613382737]: : goto ; [100.00%] [local count: 153391689]: : goto ; [100.00%] [local count: 153391689]: : [local count: 1073741825]: # _1 = PHI <0(5), -3(2), 1(4), 5(3)> : return _1; } $ cat r262247_168t.switchlower1 ;; Function foo (foo, funcdef_no=0, decl_uid=2007, cgraph_uid=1, symbol_order=0) beginning to process the following SWITCH statement (reproducer.c:3) : --- switch (bar_2(D)) [14.29%], case 0: [57.14%], case 1: [14.29%], case 2: [57.14%], case 3: [14.29%], case 4 ... 5: [57.14%]> ;; GIMPLE switch case clusters: JT:0-5 Removing basic block 7 Merging blocks 2 and 8 Merging blocks 2 and 9 Symbols to be put in SSA form { D.2019 } Incremental SSA update started at block: 0 Number of blocks in CFG: 7 Number of blocks to update: 6 ( 86%) foo (int bar) { int _1; [local count: 1073419798]: switch (bar_2(D)) [INV], case 0: [INV], case 1: [INV], case 2: [INV], case 3: [INV], case 4 ... 5: [INV]> [local count: 613382737]: : goto ; [100.00%] [local count: 153391689]: : goto ; [100.00%] [local count: 153391689]: : [local count: 1073741825]: # _1 = PHI <0(5), -3(2), 1(4), 5(3)> : return _1; } --- Same for a current trunk (I tried r263027).
[Bug tree-optimization/86702] [9 Regression] SPEC CPU2006 400.perlbench, CPU2017 500.perlbench_r ~3% performance drop after r262247
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86702 --- Comment #4 from Alexander Nesterovskiy --- I've noticed performance regressions on different targets and with different compilation options, not only highly optimized like "-march=skylake-avx512 -Ofast -flto -funroll-loops" but with "-O2" too. The simplest case is 500.perlbench_r with "-O2" on Broadwell executed in one copy. Performance drop is not in a particular place but "spread" over whole S_regmatch function which is really big. My guess was that loosing of these probabilities affects passes that follows tree-switchlower1. And it is what I see in generated assembly - some different spilling/filling and different order of blocks.
[Bug tree-optimization/87214] New: [9 Regression] SPEC CPU2017, CPU2006 520/620, 403 runfails after r263772 with march=skylake-avx512
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87214 Bug ID: 87214 Summary: [9 Regression] SPEC CPU2017, CPU2006 520/620, 403 runfails after r263772 with march=skylake-avx512 Product: gcc Version: 9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: alexander.nesterovskiy at intel dot com Target Milestone: --- There are runfails for the following benchmarks since r263772: SPEC2017 520/620: (Segmentation fault, minimal optset to reproduce: "-O3 -march=skylake-avx512 -flto") SPEC2006 445: (SPEC miscompare, minimal optset to reproduce: "-O3 -march=skylake-avx512") Running 520.omnetpp_r under GDB: --- ... Program received signal SIGSEGV, Segmentation fault. 0x004a611e in isName (s=, this=) at simulator/ccomponent.cc:143 143 if (paramv[i].isName(parname)) (gdb) backtrace #0 0x004a611e in isName (s=, this=) at simulator/ccomponent.cc:143 #1 cComponent::findPar (this=0x76633380, parname=0x76603548 "bs") at simulator/ccomponent.cc:143 #2 0x004a87b3 in cComponent::par(char const*) () at simulator/ccomponent.cc:133 #3 0x004b676d in cNEDNetworkBuilder::doParam(cComponent*, ParamElement*, bool) () at simulator/cnednetworkbuilder.cc:179 #4 0x004b8610 in doParams (isSubcomponent=false, paramsNode=, component=0x76633380, this=0x7fffaaf0) at simulator/cnednetworkbuilder.cc:139 #5 cNEDNetworkBuilder::addParametersAndGatesTo(cComponent*, cNEDDeclaration*) () at simulator/cnednetworkbuilder.cc:105 #6 0x0048843b in addParametersAndGatesTo (module=0x76633380, this=) at /include/c++/9.0.0/bits/stl_tree.h:211 #7 cModuleType::create(char const*, cModule*, int, int) () at simulator/ccomponenttype.cc:156 #8 0x0045916f in setupNetwork (network=, this=0x7653bc40) at simulator/cnamedobject.h:117 #9 Cmdenv::run() () at simulator/cmdenv.cc:253 #10 0x005186ec in EnvirBase::run(int, char**, cConfiguration*) () at simulator/envirbase.cc:230 #11 0x0043d60d in setupUserInterface(int, char**, cConfiguration*) [clone .constprop.112] () at simulator/startup.cc:234 #12 0x0042446a in main (argc=1, argv=0x7fffb1c8) at simulator/main.cc:39 --- 403.gcc miscompares: 200.s, g23.s, scilab.s. For example: --- $ diff -u g23_ref.s g23.s | head -n 16 --- g23_ref.s +++ g23.s @@ -1746,19 +1746,19 @@ testq %rbx, %rbx jne .L904 movq%r12, %rdx - xorl%r8d, %r8d + xorl%esi, %esi negq%rdx .L905: addq%rcx, %rdx - leaq(%rax,%r8), %rax + leaq(%rax,%rsi), %rax leaq1(%rdx), %rcx - cmpq%r8, %rax + cmpq%rsi, %rax --- Unfortunately I didn't manage to create a reproducer.
[Bug tree-optimization/84419] New: [8 Regression] SPEC CPU2017/CPU2006 521/621, 527/627, 554/654, 445, 454, 481, 416 runfails after r256628 with march=skylake-avx512
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84419 Bug ID: 84419 Summary: [8 Regression] SPEC CPU2017/CPU2006 521/621, 527/627, 554/654, 445, 454, 481, 416 runfails after r256628 with march=skylake-avx512 Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: alexander.nesterovskiy at intel dot com Target Milestone: --- There are runfails for the following benchmarks since r256628: SPEC2017 fp-rate/fp-speed: 521/621 527/627 554/654 SPEC2006 int-speed: 445 SPEC2006 fp-rate/fp-speed: 454 481 416 Minimal optset to reproduce is "-O3 -march=skylake-avx512". Reverting changes in tree-ssa-loop-ivopts.c fixes the problem in current revisions (r257682 at least).
[Bug tree-optimization/84419] [8 Regression] SPEC CPU2017/CPU2006 521/621, 527/627, 554/654, 445, 454, 481, 416 runfails after r256628 with march=skylake-avx512
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84419 --- Comment #2 from Alexander Nesterovskiy --- I've made a quite small reproducer: --- $ cat reproducer.c #include #include #define SIZE 400 int foo[SIZE]; char bar[SIZE]; void __attribute__ ((noinline)) foo_func(void) { int i; for (i = 1; i < SIZE; i++) if (bar[i]) foo[i] = 1; } int main() { memset(bar, 1, sizeof(bar)); foo_func(); return 0; } $ gcc -O3 -march=skylake-avx512 -g reproducer.c $ ./a.out Segmentation fault (core dumped)
[Bug target/82862] [8 Regression] SPEC CPU2006 465.tonto performance regression with r253975 (up to 40% drop for particular loop)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82862 --- Comment #8 from Alexander Nesterovskiy --- I'd say that it's not just fixed but improved with an impressive gain. It is about +4% on HSW AVX2 and about +8% on SKX AVX512 after r257734 (compared to r257732) for a 465.tonto SPEC rate. Comparing to reference r253973 it is about +2% on HSW AVX2 and +18% on SKX AVX512 (AVX512 was greatly improved in last 3 months).
[Bug tree-optimization/84419] [8 Regression] SPEC CPU2017/CPU2006 521/621, 527/627, 554/654, 445, 454, 481, 416 runfails after r256628 with march=skylake-avx512
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84419 --- Comment #5 from Alexander Nesterovskiy --- Yes, looks like the problem is with unaligned access (there is no fail in reproducer when starting a loop with i=0). It seems that your patch works - there are no runfails for reproducer, 445, 521, 527, 554 (tested on SPEC train workload). I'll report upon finishing other benchmarks.
[Bug tree-optimization/84419] [8 Regression] SPEC CPU2017/CPU2006 521/621, 527/627, 554/654, 445, 454, 481, 416 runfails after r256628 with march=skylake-avx512
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84419 --- Comment #6 from Alexander Nesterovskiy --- All the mentioned SPEC CPU2017/CPU2006 521/621, 527/627, 554/654, 445, 454, 481, 416 have finished successfully. Patch was applied to r257732.
[Bug middle-end/82344] [8 Regression] SPEC CPU2006 435.gromacs ~10% performance regression with trunk@250855
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82344 --- Comment #7 from Alexander Nesterovskiy --- Yes, I've checked it - current performance is about previous level and execution of these piece of code takes the same amount of time.
[Bug tree-optimization/82220] New: [8 Regression] SPEC CPU2006 482.sphinx3 ~10% performance regression with trunk@250416
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82220 Bug ID: 82220 Summary: [8 Regression] SPEC CPU2006 482.sphinx3 ~10% performance regression with trunk@250416 Product: gcc Version: 8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: alexander.nesterovskiy at intel dot com Target Milestone: --- Calculation of a min_profitable_iters threshold was changed in r250416 In 482.sphinx3 a wrong path is chosen in a hottest loop which leads to ~45% performance drop for a particular loop and ~10% performance drop for a whole test. According to perf report: --- Overhead SamplesSymbol trunk@250416 (and up-to-date trunk@252756) 36.91%--> 527019vector_gautbl_eval_logs3 28.03%400316mgau_eval 8.90%127071subvq_mgau_shortlist trunk@250415 31.72%402114mgau_eval 29.36%--> 372126vector_gautbl_eval_logs3 9.85%124815subvq_mgau_shortlist --- Compiled with: "-Ofast -funroll-loops -march=core-avx2 -mfpmath=sse"
[Bug tree-optimization/82220] [8 Regression] SPEC CPU2006 482.sphinx3 ~10% performance regression with trunk@250416
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82220 --- Comment #2 from Alexander Nesterovskiy --- Yes, I've applied a patch and looks like it helped: --- Overhead SamplesSymbol trunk@252796 + patch 31.57%412037mgau_eval 30.54%--> 397608vector_gautbl_eval_logs3 9.78%127468subvq_mgau_shortlist --- And I see in perf that actual execution goes exactly the same way as it goes for r250415 (corresponding parts of function bodies are actually being executed).
[Bug rtl-optimization/82344] New: [8 Regression] SPEC CPU2006 435.gromacs ~10% performance regression with trunk@250855
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82344 Bug ID: 82344 Summary: [8 Regression] SPEC CPU2006 435.gromacs ~10% performance regression with trunk@250855 Product: gcc Version: 8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: alexander.nesterovskiy at intel dot com Target Milestone: --- Created attachment 42246 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=42246&action=edit r250854 vs r250855 generated code comparison Compilation options that affects regression: "-Ofast -march=core-avx2 -mfpmath=sse" Regression happened after r250855 though it looks like this commit is not of guilty by itself but reveals something in other stages. Changes in 123t.reassoc1 stage leads to a bit different code generation during stages that follow it. Place of interest is in "inl1130" subroutine (file "innerf.f") - it's a part of a big loop with 9 similar expressions with 4-byte float variables: --- y1 = 1.0/sqrt(x1) y2 = 1.0/sqrt(x2) y3 = 1.0/sqrt(x3) y4 = 1.0/sqrt(x4) y5 = 1.0/sqrt(x5) y6 = 1.0/sqrt(x6) y7 = 1.0/sqrt(x7) y8 = 1.0/sqrt(x8) y9 = 1.0/sqrt(x9) --- When compiled with "-ffast-math" 1/sqrt is calculated with "vrsqrtss" instruction followed by Newton-Raphson step with four "vmulss", one "vadss" and two constants used. Like here (part of r250854 code): --- vrsqrtss xmm12, xmm12, xmm7 vmulss xmm7, xmm12, xmm7 vmulss xmm0, xmm12, DWORD PTR .LC2[rip] vmulss xmm8, xmm7, xmm12 vaddss xmm5, xmm8, DWORD PTR .LC1[rip] vmulss xmm1, xmm5, xmm0 --- Input values (x1-x9) are in xmm registers mostly (x2 and x7 in memory), output values (y1-y9) are in xmm registers. After r250855 .LC2 constant goes into xmm7 and x7 is also goes to xmm register. This leads to lack of temporary registers and worse instructions interleaving as a result. See attached picture with part of assembly listings where corresponding y=1/sqrt parts are painted the same color. Finally these 9 lines of code are executed about twice slower which leads to ~10% performance regression for whole test. I've made two independent attempts to change code in order to verify the above. 1. To be sure that we loose performance exactly in this part of a loop I just pasted ~60 assembly instructions from previous revision to a new one (after proper renaming of course). This helped to restore performance. 2. To be sure that the problem is due to a lack of temporary registers I moved calculation of 1/sqrt for one last line into function call. Like here: --- ... in other module to disable inlining: function myrsqrt(x) implicit none real*4 x real*4 myrsqrt myrsqrt = 1.0/sqrt(x); return end function myrsqrt ... y1 = 1.0/sqrt(x1) y2 = 1.0/sqrt(x2) y3 = 1.0/sqrt(x3) y4 = 1.0/sqrt(x4) y5 = 1.0/sqrt(x5) y6 = 1.0/sqrt(x6) y7 = 1.0/sqrt(x7) y8 = 1.0/sqrt(x8) y9 = myrsqrt(x9) --- Even with call/ret overhead this also helped to restore performance since it freed some registers.
[Bug fortran/82362] New: [8 Regression] SPEC CPU2006 436.cactusADM ~7% performance deviation with trunk@251713
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82362 Bug ID: 82362 Summary: [8 Regression] SPEC CPU2006 436.cactusADM ~7% performance deviation with trunk@251713 Product: gcc Version: 8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: fortran Assignee: unassigned at gcc dot gnu.org Reporter: alexander.nesterovskiy at intel dot com Target Milestone: --- r251713 brings reasonable improvement to alloca. However there is a side effect of this patch - 436.cactusADM performance became unstable when compiled with -Ofast -march=core-avx2 -mfpmath=sse -funroll-loops The impact is more noticeable when compiled with auto-parallelization -ftree-parallelize-loops=N Comparing performance for particular 7-runs (relative to median performance of r251711): r251711: 92,8% 92,9% 93,0% 106,7% 107,0% 107,0% 107,2% r251713: 99,5% 99,6% 99,8% 100,0% 100,3% 100,6% 100,6% r251711 is prettty stable, while r251713 is +7% faster on some runs and -7% slower on other. There are few dynamic arrays in the body of Bench_StaggeredLeapfrog2 sub in StaggeredLeapfrog2.fppized.f. When compiled with "-fstack-arrays" (default for "-Ofast") arrays are allocated by alloca. Allocated memory size is rounded-up to 16-bytes in r251713 with code like "size = (size + 15) & -16". In prior revisions it differs in just one byte: "size = (size + 22) & -16" Which actually may just waste extra 16 bytes for each array depending on initial "size" value. Actual r251713 code, built with gfortran -S -masm=intel -o StaggeredLeapfrog2.fppized_r251713.s -O3 -fstack-arrays -march=core-avx2 -mfpmath=sse -funroll-loops -ftree-parallelize-loops=8 StaggeredLeapfrog2.fppized.f lea rax, [15+r13*8] ; size = <...> + 15 shr rax, 4 ; zero-out sal rax, 4 ; lower 4 bits sub rsp, rax mov QWORD PTR [rbp-4984], rsp ; Array 1 sub rsp, rax mov QWORD PTR [rbp-4448], rsp ; Array 2 sub rsp, rax mov QWORD PTR [rbp-4784], rsp ; Array 3 ... and so on Aligning rsp to cache line size (on each allocation or even once in the beginning) brings performance to stable high values: lea rax, [15+r13*8] shr rax, 4 sal rax, 4 shr rsp, 6 ; Align rsp to shl rsp, 6 ; 64-byte border sub rsp, rax mov QWORD PTR [rbp-4984], rsp sub rsp, rax mov QWORD PTR [rbp-4448], rsp sub rsp, rax mov QWORD PTR [rbp-4784], rsp 64-byte aligned version performance compared to the same median performance of r251711: 106,7% 107,0% 107,0% 107,1% 107,1% 107,2% 107,4% Maybe what is necessary here is some kind of option to force array aligning for gfortran (like "-align array64byte" for ifort) ?
[Bug fortran/82362] [8 Regression] SPEC CPU2006 436.cactusADM ~7% performance deviation with trunk@251713
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82362 --- Comment #4 from Alexander Nesterovskiy --- (In reply to Richard Biener from comment #2) > I suppose you swapped the revs here Yep, sorry. It's supposed to be: r251711: 99,5% 99,6% 99,8% 100,0% 100,3% 100,6% 100,6% r251713: 92,8% 92,9% 93,0% 106,7% 107,0% 107,0% 107,2%
[Bug tree-optimization/82604] New: [8 Regression] SPEC CPU2006 410.bwaves ~50% performance regression with trunk@253679 when ftree-parallelize-loops is used
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604 Bug ID: 82604 Summary: [8 Regression] SPEC CPU2006 410.bwaves ~50% performance regression with trunk@253679 when ftree-parallelize-loops is used Product: gcc Version: 8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: alexander.nesterovskiy at intel dot com Target Milestone: --- Minimal options to reproduce regression (4 threads is just for example, there can be more): -Ofast -funroll-loops -flto -ftree-parallelize-loops=4 Auto-parallelization became mostly useless for 410.bwaves after r253679. CPU time distributes like this: Thread0 Thread1 Thread2 Thread3 r253679: ~91%~3% ~3% ~3% r253678: ~34%~22%~22%~22% Linking with "-fopt-info-loop-optimized" shows that twice less loops have parallelized: --- gfortran -Ofast -funroll-loops -flto -ftree-parallelize-loops=4 -g -fopt-info-loop-optimized=loop.optimized *.o grep parallelizing loop.optimized -c --- r253679: 19 r253678: 38 Most valuable missed parallelization is "block_solver.f:170:0: note: parallelizing outer loop 2" in the hottest function "mat_times_vec".
[Bug tree-optimization/82862] New: [8 Regression] SPEC CPU2006 465.tonto performance regression with trunk@253975 (up to 40% drop for particular loop)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82862 Bug ID: 82862 Summary: [8 Regression] SPEC CPU2006 465.tonto performance regression with trunk@253975 (up to 40% drop for particular loop) Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: alexander.nesterovskiy at intel dot com Target Milestone: --- Created attachment 42552 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=42552&action=edit reproducer Regression is well noticeable when 465.tonto is compiled with: -Ofast -march=core-avx2 -mfpmath=sse -funroll-loops Changes in cost model leads to changes in unrolling and vectorizing of few loops and causes increase of their execution time up to 60%. Whole 465.tonto benchmark regression is not so big and is about 2-4% just because the affected loops are less than 10% of all workload. Compiling with "-fopt-info-all-optall=all.optimized" and grepping for particular line: r253973: shell1quartet.fppized.f90:4086:0: note: loop unrolled 7 times shell1quartet.fppized.f90:4086:0: note: loop unrolled 7 times r253975: shell1quartet.fppized.f90:4086:0: note: loop vectorized shell1quartet.fppized.f90:4086:0: note: loop vectorized shell1quartet.fppized.f90:4086:0: note: loop with 6 iterations completely unrolled shell1quartet.fppized.f90:4086:0: note: loop with 6 iterations completely unrolled shell1quartet.fppized.f90:4086:0: note: loop unrolled 3 times shell1quartet.fppized.f90:4086:0: note: loop unrolled 1 times There was a change introduced by r254012: shell1quartet.fppized.f90:4086:0: note: loop vectorized shell1quartet.fppized.f90:4086:0: note: loop vectorized shell1quartet.fppized.f90:4086:0: note: loop with 3 iterations completely unrolled shell1quartet.fppized.f90:4086:0: note: loop with 3 iterations completely unrolled shell1quartet.fppized.f90:4086:0: note: loop unrolled 3 times shell1quartet.fppized.f90:4086:0: note: loop unrolled 1 times But still there is a degradation of these particular loops up to 40%. Reproducer is attached.
[Bug tree-optimization/83326] New: [8 Regression] SPEC CPU2017 648.exchange2_s ~6% performance regression with r255267 (reproducer attached)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83326 Bug ID: 83326 Summary: [8 Regression] SPEC CPU2017 648.exchange2_s ~6% performance regression with r255267 (reproducer attached) Product: gcc Version: 8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: alexander.nesterovskiy at intel dot com Target Milestone: --- Created attachment 42815 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=42815&action=edit exchange2_reproducer Bechmark regression is well noticeable on Broadwell/Haswell with: -m64 -Ofast -funroll-loops -march=core-avx2 -mfpmath=sse -flto -fopenmp Attached reproducer demonstrates ~27% longer execution with: -m64 -O[3|fast] -funroll-loops There are 18 similar lines in 648.exchange2_s source code which execution time was noticeably changed after r255267. It looks like: --- some_int_array(index1+1:index1+2, 1:3, index2) = some_int_array(index1+1:index1+2, 1:3, index2) - 10 --- "-fopt-info-loop-optimized" shows that each of these lines is unrolled with 2 iterations and with 3 iterations by r255266. This seems to be reasonable since we see in source that two rows and three columns are being modified. For a particular line, r255266: --- exchange2.fppized.f90:1135:0: note: loop with 2 iterations completely unrolled (header execution count 22444250) exchange2.fppized.f90:1135:0: note: loop with 3 iterations completely unrolled (header execution count 16831504) --- For r255267 it goes another way: --- exchange2.fppized.f90:1135:0: note: loop with 2 iterations completely unrolled (header execution count 14963581) exchange2.fppized.f90:1135:0: note: loop with 2 iterations completely unrolled (header execution count 11221564) ---