In the attached test case, 3.4.* GCC generates better code than 4.0 (or 4.1) because it moves more loop invariant code out of the inner loop of P7Viterbi. The problem seems to be in the alias analysis which determines what can be moved out of that loop. If you change the field M, which is unused, from int to float then the 4.* GCC generates better code. I tried the structure-alias branch to see if that helped and it didn't. See the email string starting at http://gcc.gnu.org/ml/gcc/2005-03/msg00835.html for some more info.
Test case: #define L_CONST 500 void *malloc(long size); struct plan7_s { int M; int **tsc; /* transition scores [0.6][1.M-1] */ }; struct dpmatrix_s { int **mmx; }; struct dpmatrix_s *mx; void AllocPlan7Body(struct plan7_s *hmm, int M) { int i; hmm->tsc = malloc (7 * sizeof(int *)); hmm->tsc[0] = malloc ((M+16) * sizeof(int)); mx->mmx = (int **) malloc(sizeof(int *) * (L_CONST+1)); for (i = 0; i <= L_CONST; i++) { mx->mmx[i] = malloc (M+2+16); } return; } void P7Viterbi(int L, int M, struct plan7_s *hmm, int **mmx) { int i,k; for (i = 1; i <= L; i++) { for (k = 1; k <= M; k++) { mmx[i][k] = mmx[i-1][k-1] + hmm->tsc[0][k-1]; } } } main () { struct plan7_s *hmm; char dsq[L_CONST]; int i; hmm = (struct plan7_s *) malloc (sizeof (struct plan7_s)); mx = (struct dpmatrix_s *) malloc (sizeof (struct dpmatrix_s)); AllocPlan7Body(hmm, 10); for (i = 0; i < 600000; i++) { P7Viterbi(500, 10, hmm, mx->mmx); } } -- Summary: Tree loop optimizer does worse job than RTL loop optimizer Product: gcc Version: 4.1.0 Status: UNCONFIRMED Severity: normal Priority: P2 Component: tree-optimization AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: sje at cup dot hp dot com CC: gcc-bugs at gcc dot gnu dot org GCC build triplet: ia64-*-* GCC host triplet: ia64-*-* GCC target triplet: ia64-*-* http://gcc.gnu.org/bugzilla/show_bug.cgi?id=20643