Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: alexander.nesterovskiy at intel dot com
Target Milestone: ---
There are runfails for the following benchmarks since r263772:
SPEC2017 520
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86702
--- Comment #4 from Alexander Nesterovskiy ---
I've noticed performance regressions on different targets and with different
compilation options, not only highly optimized like "-march=skylake-avx512
-Ofast -flto -funroll-loops" but with "-O2" too
: UNCONFIRMED
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: alexander.nesterovskiy at intel dot com
Target Milestone: ---
Created attachment 44453
--> https://gcc.gnu.org/bugzi
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86054
--- Comment #2 from Alexander Nesterovskiy ---
Thanks, "-fno-aggressive-loop-optimizations" helps.
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: alexander.nesterovskiy at intel dot com
Target Milestone: ---
Created attachment 44237
--> https://gcc.gnu.org/bugzilla/attachment.cgi
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82344
--- Comment #7 from Alexander Nesterovskiy ---
Yes, I've checked it - current performance is about previous level and
execution of these piece of code takes the same amount of time.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84419
--- Comment #6 from Alexander Nesterovskiy ---
All the mentioned SPEC CPU2017/CPU2006 521/621, 527/627, 554/654, 445, 454,
481, 416 have finished successfully. Patch was applied to r257732.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84419
--- Comment #5 from Alexander Nesterovskiy ---
Yes, looks like the problem is with unaligned access (there is no fail in
reproducer when starting a loop with i=0).
It seems that your patch works - there are no runfails for reproducer, 445,
521, 5
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82862
--- Comment #8 from Alexander Nesterovskiy ---
I'd say that it's not just fixed but improved with an impressive gain.
It is about +4% on HSW AVX2 and about +8% on SKX AVX512 after r257734 (compared
to r257732) for a 465.tonto SPEC rate.
Comparin
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84419
--- Comment #2 from Alexander Nesterovskiy ---
I've made a quite small reproducer:
---
$ cat reproducer.c
#include
#include
#define SIZE 400
int foo[SIZE];
char bar[SIZE];
void __attribute__ ((noinline)) foo_func(void)
{
int i;
for (i =
Version: unknown
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: alexander.nesterovskiy at intel dot com
Target Milestone: ---
There are runfails for
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604
--- Comment #27 from Alexander Nesterovskiy ---
Place of interest here is a loop in mat_times_vec function.
For r253678 a mat_times_vec.constprop._loopfn.0 is created with autopar.
For r256990 the mat_times_vec is inlined into bi_cgstab_block and
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604
--- Comment #26 from Alexander Nesterovskiy ---
Created attachment 43361
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=43361&action=edit
r253678 vs r256990_work_spin
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604
--- Comment #24 from Alexander Nesterovskiy ---
Yes, it looks like more time is being spent in synchronizing.
r256990 really changes the way autopar works:
For r253679...r256989 the most of work was in main thread0 mostly (thread0
~91%, threads1-
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604
--- Comment #23 from Alexander Nesterovskiy ---
Created attachment 43326
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=43326&action=edit
r253678 vs r256990
Severity: normal
Priority: P3
Component: ipa
Assignee: unassigned at gcc dot gnu.org
Reporter: alexander.nesterovskiy at intel dot com
CC: marxin at gcc dot gnu.org
Target Milestone: ---
Minimal options to reproduce regression (x86, 64-bit):
-O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604
--- Comment #20 from Alexander Nesterovskiy ---
I've made test runs on Broadwell and Skylake, RHEL 7.3.
410.bwaves became faster after r256990 but not as fast as it was on r253678.
Comparing 410.bwaves performance, "-Ofast -funroll-loops -flto
-
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83326
--- Comment #6 from Alexander Nesterovskiy ---
Thanks! I see performance gain on 648.exchange2_s (~6% on Broadwell and ~3% on
Skylake-X) reverting performance to r255266 level (Skylake-X regression was
~3%).
And loops unrolled with 2 and 3 iterat
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: alexander.nesterovskiy at intel dot com
Target Milestone: ---
Created attachment 42815
--> https://gcc.gnu.
: unknown
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: alexander.nesterovskiy at intel dot com
Target Milestone: ---
Created attachment 42552
--> ht
Version: 8.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: alexander.nesterovskiy at intel dot com
Target Milestone: ---
Minimal options to reproduce
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82362
--- Comment #4 from Alexander Nesterovskiy ---
(In reply to Richard Biener from comment #2)
> I suppose you swapped the revs here
Yep, sorry. It's supposed to be:
r251711: 99,5% 99,6% 99,8% 100,0% 100,3% 100,6% 100,6%
r251713: 92,8% 92
Severity: normal
Priority: P3
Component: fortran
Assignee: unassigned at gcc dot gnu.org
Reporter: alexander.nesterovskiy at intel dot com
Target Milestone: ---
r251713 brings reasonable improvement to alloca. However there is a side effect
of this patch - 436
Severity: normal
Priority: P3
Component: rtl-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: alexander.nesterovskiy at intel dot com
Target Milestone: ---
Created attachment 42246
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=42246&
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82220
--- Comment #2 from Alexander Nesterovskiy ---
Yes, I've applied a patch and looks like it helped:
---
Overhead SamplesSymbol
trunk@252796 + patch
31.57%412037mgau_eval
30.54%--> 397608vector_gautbl_eval_logs3
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: alexander.nesterovskiy at intel dot com
Target Milestone: ---
Calculation of a min_profitable_iters threshold was changed in r250416
In 482.sphinx3 a
26 matches
Mail list logo