[Bug tree-optimization/59967] [4.8/4.9 Regression] Performance regression from 4.7.x to 4.8.x (loop not unrolled)

2014-04-02 Thread chbreitkopf at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59967

--- Comment #3 from Christoph Breitkopf  ---
It's this conditional in the inner loop. The expression becomes constant only
if both loops are unrolled (i and j are the loop counters):

   if (1<

[Bug c/59967] New: Performance regression from 4.7.x to 4.8.x (loop not unrolled)

2014-01-28 Thread chbreitkopf at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59967

Bug ID: 59967
   Summary: Performance regression from 4.7.x to 4.8.x (loop not
unrolled)
   Product: gcc
   Version: 4.8.2
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: chbreitkopf at gmail dot com

Created attachment 31967
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=31967&action=edit
preprocessed source of ray/src/rt/ambient.c

gcc 4.8.x generates 10-15% slower code compared to 4.7.x for Mark Stock's
radiance benchmark (http://markjstock.org/pages/rad_bench.html).

I observed this regression on Linux x86_64, and with different CPUs (Ivy
Bridge, Haswell, AMD Phenom, Kaveri). I had suspected the new register
allocator, but the actual cause is a difference in loop unrolling.

The hotspot is the nested loops with the recursive call at the end of the
sumambient() function. When using -Ofast, gcc 4.7.x will unroll the outer loop,
which results in some optimization possibilities in the inner loop. gcc 4.8.x
does not unroll the outer loop. -funroll-loops does not change the behavior.