http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57204
--- Comment #2 from Sasanka Nagavalli <snagavallis at outlook dot com> 2013-05-09 00:19:50 UTC --- Created attachment 30067 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=30067 Test case 2 for issue 57204 The original test and description are slightly misleading because they could be interpreted to suggest that the compiler should generate the same optimized code for the good case and bad case. This was not the intent of providing those two tests. The two tests may give the same result algorithmically if all elements of d are non-negative, but the actual operations can be different. Consider the following: Good: k=0: ... k=1: i=0: d_ik = d[1] j=0: ... j=1: t = d_ik + d[n+1] d[1] = MIN(d[1],t) j=2: t = d_ik + d[n+2] <--- This is different d[2] = MIN(d[2],t) ... Bad: k=0: ... k=1: i=0: j=0: ... j=1: t = d[1] + d[n+1] d[1] = MIN(d[1],t) j=2: t = d[1] + d[n+2] <--- This is different d[2] = MIN(d[2],t) ... Attached a second set of tests to more clearly demonstrate the issue. The following cases result in the same operations, but the compiler still generates worse code for the bad case than the good case. In the bad case, it once again generates slower code with auto-vectorization than without auto-vectorization. New Good: void foo(float * d, int n) { int i, j, k; for (k=0; k<n; ++k) { for (i=0; i<n; ++i) { for (j=0; j<k; ++j) { float t = d[i*n+k] + d[k*n+j]; d[i*n+j] = (d[i*n+j] < t) ? d[i*n+j] : t; } for (j=k; j<n; ++j) { float t = d[i*n+k] + d[k*n+j]; d[i*n+j] = (d[i*n+j] < t) ? d[i*n+j] : t; } } } }