https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604

            Bug ID: 82604
           Summary: [8 Regression] SPEC CPU2006 410.bwaves ~50%
                    performance regression with trunk@253679 when
                    ftree-parallelize-loops is used
           Product: gcc
           Version: 8.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: alexander.nesterovskiy at intel dot com
  Target Milestone: ---

Minimal options to reproduce regression (4 threads is just for example, there
can be more):
-Ofast -funroll-loops -flto -ftree-parallelize-loops=4

Auto-parallelization became mostly useless for 410.bwaves after r253679.
CPU time distributes like this:
         Thread0 Thread1 Thread2 Thread3
r253679: ~91%    ~3%     ~3%     ~3%
r253678: ~34%    ~22%    ~22%    ~22%

Linking with "-fopt-info-loop-optimized" shows that twice less loops have
parallelized:
---
gfortran -Ofast -funroll-loops -flto -ftree-parallelize-loops=4 -g
-fopt-info-loop-optimized=loop.optimized *.o
grep parallelizing loop.optimized -c
---
r253679: 19
r253678: 38

Most valuable missed parallelization is
"block_solver.f:170:0: note: parallelizing outer loop 2"
in the hottest function "mat_times_vec".

Reply via email to