http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51179
Uros Bizjak <ubizjak at gmail dot com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|UNCONFIRMED |NEW Last reconfirmed| |2011-11-22 CC| |irar at il dot ibm.com Component|target |tree-optimization Ever Confirmed|0 |1 --- Comment #2 from Uros Bizjak <ubizjak at gmail dot com> 2011-11-22 11:33:24 UTC --- We can start here with something that hopefully resembles your original fortran code: --cut here-- double C[10][4], B[10][10], A[10][4]; void test (void) { int i = 0, j = 0, l = 0; //for (; j < 10; j += 2) // for (; l < 10; l++) for (; i < 4; i++) { C[j+0][i] = C[j+0][i] + A[l][i] * B[j+0][l]; C[j+1][i] = C[j+1][i] + A[l][i] * B[j+1][l]; } } --cut here-- gcc -O3 -ffast-math -mfma4 -mavx: test: vmovapd A(%rip), %ymm0 vbroadcastsd B(%rip), %ymm1 vfmaddpd C(%rip), %ymm1, %ymm0, %ymm1 vmovapd %ymm1, C(%rip) vbroadcastsd B+80(%rip), %ymm1 vfmaddpd C+32(%rip), %ymm1, %ymm0, %ymm0 vmovapd %ymm0, C+32(%rip) vzeroupper ret Nice. Now uncomment the second loop ("l" index) and this kernel will break: < ... lots of code deleted ... > .L3: vmovupd (%r8,%rax), %xmm1 addl $1, %esi vinsertf128 $0x1, 16(%r8,%rax), %ymm1, %ymm1 vfmaddpd %ymm0, %ymm5, %ymm1, %ymm0 vmovapd %ymm0, (%rbx,%rax) vmovupd (%rcx,%rax), %xmm0 vinsertf128 $0x1, 16(%rcx,%rax), %ymm0, %ymm0 vfmaddpd %ymm0, %ymm4, %ymm1, %ymm0 vmovupd %xmm0, (%rcx,%rax) vextractf128 $0x1, %ymm0, 16(%rcx,%rax) addq $32, %rax cmpl %r10d, %esi jb .L3 < ... lots of code deleted ... > This already happens in the tree optimizers (vectorizer), RTL is just following this trail. Confirmed as a vectorizer problem.