Testcase: #include <stdio.h> #include <stdlib.h> #define SIZE 256*1024*1024 float *data; static inline double one() { int i; double sum; sum = 0; for (i=0; i<SIZE; i++) sum += data[i]; return sum; } int f(double); int main(int argc,char** argv) { struct timeval tv0,tv1; double s0,s1; int i;
data = malloc(SIZE*sizeof(float)); for (i=0; i<SIZE; i++) data[i] = 1; s0 = 0; for (i=0; i<SIZE; i++) s0 += data[i]; printf("%f\n", s0); s1 = one(); printf("%f\n", s1); free(data); return 0; } -------------- At -O2 -static on powerpc-darwin, we get an inner loop with: L6: lfsx f0,r2,r9 addi r2,r2,4 lfd f13,56(r1) fadd f13,f13,f0 stfd f13,56(r1) bdnz L6 That is storing to the stack, with -O2 -fno-split-wide-types, we get the correct thing: L6: lfsx f0,r2,r9 addi r2,r2,4 fadd f13,f13,f0 bdnz L6 In a way this is caused by how we expand the var-args function call: (insn 90 89 91 7 (set (subreg:DF (reg:DI 183) 0) (reg/v:DF 158 [ s0 ])) -1 (nil) (nil)) (insn 91 90 92 7 (set (reg:DI 184) (reg:DI 183)) -1 (nil) (nil)) (insn 92 91 93 7 (set (reg:DI 185) (reg:DI 183)) -1 (nil) (nil)) (insn 93 92 94 7 (set (reg:DF 186) (subreg:DF (reg:DI 184) 0)) -1 (nil) (nil)) (insn 94 93 95 7 (set (reg:DF 187) (subreg:DF (reg:DI 185) 0)) -1 (nil) (nil)) (insn 96 95 97 7 (set (reg:DF 4 r4) (reg:DF 186)) -1 (nil) (nil)) (insn 97 96 98 7 (set (reg:DF 33 f1) (reg:DF 187)) -1 (nil) (nil)) And how lower subreg comes a long and splits the sub regs of DI up. -- Summary: [4.3 Regression] lower subreg causes a performance regression in the inner loop sometimes Product: gcc Version: 4.3.0 Status: UNCONFIRMED Keywords: missed-optimization, ra Severity: normal Priority: P3 Component: rtl-optimization AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: pinskia at gcc dot gnu dot org GCC target triplet: powerpc-apple-darwin http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31455