https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83326

            Bug ID: 83326
           Summary: [8 Regression] SPEC CPU2017 648.exchange2_s ~6%
                    performance regression with r255267 (reproducer
                    attached)
           Product: gcc
           Version: 8.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: alexander.nesterovskiy at intel dot com
  Target Milestone: ---

Created attachment 42815
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=42815&action=edit
exchange2_reproducer

Bechmark regression is well noticeable on Broadwell/Haswell with:
-m64 -Ofast -funroll-loops -march=core-avx2 -mfpmath=sse -flto -fopenmp

Attached reproducer demonstrates ~27% longer execution with:
-m64 -O[3|fast] -funroll-loops

There are 18 similar lines in 648.exchange2_s source code which execution time
was noticeably changed after r255267.
It looks like:
---
    some_int_array(index1+1:index1+2, 1:3, index2) =
some_int_array(index1+1:index1+2, 1:3, index2) - 10
---

"-fopt-info-loop-optimized" shows that each of these lines is unrolled with 2
iterations and with 3 iterations by r255266.
This seems to be reasonable since we see in source that two rows and three
columns are being modified.
For a particular line, r255266:
---
exchange2.fppized.f90:1135:0: note: loop with 2 iterations completely unrolled
(header execution count 22444250)
exchange2.fppized.f90:1135:0: note: loop with 3 iterations completely unrolled
(header execution count 16831504)
---

For r255267 it goes another way:
---
exchange2.fppized.f90:1135:0: note: loop with 2 iterations completely unrolled
(header execution count 14963581)
exchange2.fppized.f90:1135:0: note: loop with 2 iterations completely unrolled
(header execution count 11221564)
---

Reply via email to