[Bug target/34163] 10% performance regression since Nov 1 on Polyhedron's "NF" on AMD64

ubizjak at gmail dot com Fri, 25 Apr 2008 02:56:15 -0700


------- Comment #8 from ubizjak at gmail dot com  2008-04-25 09:55 -------
The problem is indeed in trisolve:


subroutine trisolve(x,i1,i2)
integer :: i1 , i2
real(dpkind),dimension(i2)::x
integer :: i
x(i1) = gi(i1)* x(i1)
do i = i1+1 , i2
   x(i) = gi(i)*(x(i)-au1(i-1)*x(i-1))
enddo
do i = i2-1 , i1 , -1
   x(i) = x(i) - gi(i)*au1(i)*x(i+1)
enddo
end subroutine trisolve

Please note two very tight loops that calculate x[n] from the value x[n-1],
where x[n-1] is the result of a previous step.

.127t.optimized tree dump for the the first loop (the second loop is the same,
only going from last to first element) in non-regressed case shows:

<bb 4>:
  MEM[base: ivtmp.297] = MEM[base: ivtmp.295] * ((MEM[base: ivtmp.297] -
MEM[base: ivtmp.300] * MEM[base: ivtmp.297, offset: 0x0fffffffffffffff8]));
  ivtmp.295 = ivtmp.295 + D.3347;
  ivtmp.297 = ivtmp.297 + 8;
  ivtmp.300 = ivtmp.300 + 8;
  ivtmp.304 = ivtmp.304 + 1;
  if ((integer(kind=4)) ivtmp.304 == D.1652)
    goto <bb 5>;
  else
    goto <bb 4>;

this code results in:

.L3:
        movsd   (%r9), %xmm10
        addl    $4, %edx
        movsd   (%rcx), %xmm9
X+>     mulsd   -8(%rcx), %xmm10
        movsd   8(%rcx), %xmm7
        movsd   16(%rcx), %xmm5
        movsd   24(%rcx), %xmm3
        subsd   %xmm10, %xmm9
        mulsd   (%rax), %xmm9
        addq    %r10, %rax
1->     movsd   %xmm9, (%rcx)
        movsd   8(%r9), %xmm8
1+>     mulsd   %xmm9, %xmm8
        subsd   %xmm8, %xmm7
        mulsd   (%rax), %xmm7
        addq    %r10, %rax
2->     movsd   %xmm7, 8(%rcx)
        movsd   16(%r9), %xmm6
2+>     mulsd   %xmm7, %xmm6
        subsd   %xmm6, %xmm5
        mulsd   (%rax), %xmm5
        addq    %r10, %rax
3->     movsd   %xmm5, 16(%rcx)
        movsd   24(%r9), %xmm4
        addq    $32, %r9
3+>     mulsd   %xmm5, %xmm4
        subsd   %xmm4, %xmm3
        mulsd   (%rax), %xmm3
        addq    %r10, %rax
X->     movsd   %xmm3, 24(%rcx)
        addq    $32, %rcx
        cmpl    %ebp, %edx
        jne     .L3

In the code above, it can be seen how unrolled iterations are linked together.
The result from previous iteration (marked with N->) enters next iteration
(marked with N+>).

BTW: Optimizer could also link X-> and X+> but this is probably too much...

Patched gcc regressed in this area:

<bb 4>:
  MEM[base: ivtmp.297] = MEM[base: ivtmp.295] * ((MEM[base: ivtmp.297] -
MEM[base: ivtmp.300] * MEM[base: ivtmp.302]));
  ivtmp.295 = ivtmp.295 + D.3349;
  ivtmp.297 = ivtmp.297 + 8;
  ivtmp.300 = ivtmp.300 + 8;
  ivtmp.302 = ivtmp.302 + 8;
  ivtmp.304 = ivtmp.304 + 1;
  if ((integer(kind=4)) ivtmp.304 == D.1652)
    goto <bb 5>;
  else
    goto <bb 4>;

this code results in:

.L3:
        movsd   (%r9), %xmm10
        addl    $4, %edx
        movsd   (%rcx), %xmm9
X->     mulsd   (%r8), %xmm10
        movsd   8(%rcx), %xmm7
        movsd   16(%rcx), %xmm5
        movsd   24(%rcx), %xmm3
        subsd   %xmm10, %xmm9
        mulsd   (%rax), %xmm9
        addq    %rbx, %rax
1->     movsd   %xmm9, (%rcx)
        movsd   8(%r9), %xmm8
1+>     mulsd   8(%r8), %xmm8
        subsd   %xmm8, %xmm7
        mulsd   (%rax), %xmm7
        addq    %rbx, %rax
2->     movsd   %xmm7, 8(%rcx)
        movsd   16(%r9), %xmm6
2+>     mulsd   16(%r8), %xmm6
        subsd   %xmm6, %xmm5
        mulsd   (%rax), %xmm5
        addq    %rbx, %rax
3->     movsd   %xmm5, 16(%rcx)
        movsd   24(%r9), %xmm4
        addq    $32, %r9
3+>     mulsd   24(%r8), %xmm4
        addq    $32, %r8
        subsd   %xmm4, %xmm3
        mulsd   (%rax), %xmm3
        addq    %rbx, %rax
X->     movsd   %xmm3, 24(%rcx)
        addq    $32, %rcx
        cmpl    %r12d, %edx
        jne     .L3

In the code above, the links are broken. In ".+>" case, gcc reloads from memory
the same value that is otherwise available in the register, marked with ".->".


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34163

[Bug target/34163] 10% performance regression since Nov 1 on Polyhedron's "NF" on AMD64

Reply via email to