http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59967
Bug ID: 59967
Summary: Performance regression from 4.7.x to 4.8.x (loop not
unrolled)
Product: gcc
Version: 4.8.2
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: c
Assignee: unassigned at gcc dot gnu.org
Reporter: chbreitkopf at gmail dot com
Created attachment 31967
--> http://gcc.gnu.org/bugzilla/attachment.cgi?id=31967&action=edit
preprocessed source of ray/src/rt/ambient.c
gcc 4.8.x generates 10-15% slower code compared to 4.7.x for Mark Stock's
radiance benchmark (http://markjstock.org/pages/rad_bench.html).
I observed this regression on Linux x86_64, and with different CPUs (Ivy
Bridge, Haswell, AMD Phenom, Kaveri). I had suspected the new register
allocator, but the actual cause is a difference in loop unrolling.
The hotspot is the nested loops with the recursive call at the end of the
sumambient() function. When using -Ofast, gcc 4.7.x will unroll the outer loop,
which results in some optimization possibilities in the inner loop. gcc 4.8.x
does not unroll the outer loop. -funroll-loops does not change the behavior.