------- Comment #8 from ubizjak at gmail dot com 2008-04-25 09:55 ------- The problem is indeed in trisolve:
subroutine trisolve(x,i1,i2) integer :: i1 , i2 real(dpkind),dimension(i2)::x integer :: i x(i1) = gi(i1)* x(i1) do i = i1+1 , i2 x(i) = gi(i)*(x(i)-au1(i-1)*x(i-1)) enddo do i = i2-1 , i1 , -1 x(i) = x(i) - gi(i)*au1(i)*x(i+1) enddo end subroutine trisolve Please note two very tight loops that calculate x[n] from the value x[n-1], where x[n-1] is the result of a previous step. .127t.optimized tree dump for the the first loop (the second loop is the same, only going from last to first element) in non-regressed case shows: <bb 4>: MEM[base: ivtmp.297] = MEM[base: ivtmp.295] * ((MEM[base: ivtmp.297] - MEM[base: ivtmp.300] * MEM[base: ivtmp.297, offset: 0x0fffffffffffffff8])); ivtmp.295 = ivtmp.295 + D.3347; ivtmp.297 = ivtmp.297 + 8; ivtmp.300 = ivtmp.300 + 8; ivtmp.304 = ivtmp.304 + 1; if ((integer(kind=4)) ivtmp.304 == D.1652) goto <bb 5>; else goto <bb 4>; this code results in: .L3: movsd (%r9), %xmm10 addl $4, %edx movsd (%rcx), %xmm9 X+> mulsd -8(%rcx), %xmm10 movsd 8(%rcx), %xmm7 movsd 16(%rcx), %xmm5 movsd 24(%rcx), %xmm3 subsd %xmm10, %xmm9 mulsd (%rax), %xmm9 addq %r10, %rax 1-> movsd %xmm9, (%rcx) movsd 8(%r9), %xmm8 1+> mulsd %xmm9, %xmm8 subsd %xmm8, %xmm7 mulsd (%rax), %xmm7 addq %r10, %rax 2-> movsd %xmm7, 8(%rcx) movsd 16(%r9), %xmm6 2+> mulsd %xmm7, %xmm6 subsd %xmm6, %xmm5 mulsd (%rax), %xmm5 addq %r10, %rax 3-> movsd %xmm5, 16(%rcx) movsd 24(%r9), %xmm4 addq $32, %r9 3+> mulsd %xmm5, %xmm4 subsd %xmm4, %xmm3 mulsd (%rax), %xmm3 addq %r10, %rax X-> movsd %xmm3, 24(%rcx) addq $32, %rcx cmpl %ebp, %edx jne .L3 In the code above, it can be seen how unrolled iterations are linked together. The result from previous iteration (marked with N->) enters next iteration (marked with N+>). BTW: Optimizer could also link X-> and X+> but this is probably too much... Patched gcc regressed in this area: <bb 4>: MEM[base: ivtmp.297] = MEM[base: ivtmp.295] * ((MEM[base: ivtmp.297] - MEM[base: ivtmp.300] * MEM[base: ivtmp.302])); ivtmp.295 = ivtmp.295 + D.3349; ivtmp.297 = ivtmp.297 + 8; ivtmp.300 = ivtmp.300 + 8; ivtmp.302 = ivtmp.302 + 8; ivtmp.304 = ivtmp.304 + 1; if ((integer(kind=4)) ivtmp.304 == D.1652) goto <bb 5>; else goto <bb 4>; this code results in: .L3: movsd (%r9), %xmm10 addl $4, %edx movsd (%rcx), %xmm9 X-> mulsd (%r8), %xmm10 movsd 8(%rcx), %xmm7 movsd 16(%rcx), %xmm5 movsd 24(%rcx), %xmm3 subsd %xmm10, %xmm9 mulsd (%rax), %xmm9 addq %rbx, %rax 1-> movsd %xmm9, (%rcx) movsd 8(%r9), %xmm8 1+> mulsd 8(%r8), %xmm8 subsd %xmm8, %xmm7 mulsd (%rax), %xmm7 addq %rbx, %rax 2-> movsd %xmm7, 8(%rcx) movsd 16(%r9), %xmm6 2+> mulsd 16(%r8), %xmm6 subsd %xmm6, %xmm5 mulsd (%rax), %xmm5 addq %rbx, %rax 3-> movsd %xmm5, 16(%rcx) movsd 24(%r9), %xmm4 addq $32, %r9 3+> mulsd 24(%r8), %xmm4 addq $32, %r8 subsd %xmm4, %xmm3 mulsd (%rax), %xmm3 addq %rbx, %rax X-> movsd %xmm3, 24(%rcx) addq $32, %rcx cmpl %r12d, %edx jne .L3 In the code above, the links are broken. In ".+>" case, gcc reloads from memory the same value that is otherwise available in the register, marked with ".->". -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34163