https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83326
Bug ID: 83326 Summary: [8 Regression] SPEC CPU2017 648.exchange2_s ~6% performance regression with r255267 (reproducer attached) Product: gcc Version: 8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: alexander.nesterovskiy at intel dot com Target Milestone: --- Created attachment 42815 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=42815&action=edit exchange2_reproducer Bechmark regression is well noticeable on Broadwell/Haswell with: -m64 -Ofast -funroll-loops -march=core-avx2 -mfpmath=sse -flto -fopenmp Attached reproducer demonstrates ~27% longer execution with: -m64 -O[3|fast] -funroll-loops There are 18 similar lines in 648.exchange2_s source code which execution time was noticeably changed after r255267. It looks like: --- some_int_array(index1+1:index1+2, 1:3, index2) = some_int_array(index1+1:index1+2, 1:3, index2) - 10 --- "-fopt-info-loop-optimized" shows that each of these lines is unrolled with 2 iterations and with 3 iterations by r255266. This seems to be reasonable since we see in source that two rows and three columns are being modified. For a particular line, r255266: --- exchange2.fppized.f90:1135:0: note: loop with 2 iterations completely unrolled (header execution count 22444250) exchange2.fppized.f90:1135:0: note: loop with 3 iterations completely unrolled (header execution count 16831504) --- For r255267 it goes another way: --- exchange2.fppized.f90:1135:0: note: loop with 2 iterations completely unrolled (header execution count 14963581) exchange2.fppized.f90:1135:0: note: loop with 2 iterations completely unrolled (header execution count 11221564) ---