------- Comment #1 from luisgpm at linux dot vnet dot ibm dot com  2009-05-11 
18:04 -------
Good asm code for a hot loop in swim's "calc1" function

10001e10:       lfd     f12,-10672(r11)
10001e14:       lfd     f9,-10672(r9)
10001e18:       addi    r21,r21,16
10001e1c:       lfd     f7,-10680(r11)
10001e20:       lfd     f6,-10672(r6)
10001e24:       fmul    f3,f9,f9
10001e28:       cmpw    r21,r0
10001e2c:       fadd    f4,f7,f12
10001e30:       lfd     f22,-10680(r9)
10001e34:       lfd     f10,-10664(r9)
10001e38:       addi    r9,r9,16
10001e3c:       lfd     f23,-10672(r5)
10001e40:       lfd     f13,-10664(r5)
10001e44:       addi    r5,r5,16
10001e48:       lfd     f5,-10664(r11)
10001e4c:       fsub    f28,f23,f9
10001e50:       fsub    f25,f13,f10
10001e54:       lfd     f13,-10672(r4)
10001e58:       addi    r11,r11,16
10001e5c:       fadd    f5,f12,f5
10001e60:       fsub    f20,f13,f0
10001e64:       fmul    f9,f11,f9
10001e68:       fmadd   f27,f22,f22,f3
10001e6c:       fmadd   f30,f10,f10,f3
10001e70:       lfd     f3,-10680(r8)
10001e74:       fadd    f26,f4,f6
10001e78:       fmul    f10,f11,f10
10001e7c:       fmul    f24,f28,f2
10001e80:       fmul    f21,f25,f2
10001e84:       fmul    f4,f9,f4
10001e88:       fmadd   f22,f0,f0,f27
10001e8c:       fadd    f27,f8,f7
10001e90:       fadd    f23,f26,f8
10001e94:       fmul    f26,f0,f11
10001e98:       lfd     f8,-10664(r6)
10001e9c:       lfd     f0,-10664(r4)
10001ea0:       addi    r6,r6,16
10001ea4:       fadd    f29,f5,f8
10001ea8:       fsub    f25,f0,f13
10001eac:       addi    r4,r4,16
10001eb0:       fmsub   f28,f20,f1,f24
10001eb4:       lfd     f20,-10672(r8)
10001eb8:       fmul    f5,f10,f5
10001ebc:       addi    r8,r8,16
10001ec0:       stfd    f4,-10672(r22)
10001ec4:       stfd    f5,-10664(r22)
10001ec8:       addi    r22,r22,16
10001ecc:       fmul    f27,f26,f27
10001ed0:       fadd    f24,f6,f29
10001ed4:       fmsub   f29,f25,f1,f21
10001ed8:       fdiv    f28,f28,f23
10001edc:       fmadd   f25,f13,f13,f30
10001ee0:       fadd    f6,f6,f12
10001ee4:       fmadd   f30,f3,f3,f22
10001ee8:       stfd    f27,-10680(r3)
10001eec:       fdiv    f29,f29,f24
10001ef0:       fmadd   f3,f20,f20,f25
10001ef4:       fmul    f20,f13,f11
10001ef8:       fmadd   f7,f30,f31,f7
10001efc:       stfd    f7,-10680(r10)
10001f00:       fmadd   f12,f3,f31,f12
10001f04:       fmul    f13,f20,f6
10001f08:       stfd    f12,-10672(r10)
10001f0c:       stfd    f13,-10672(r3)
10001f10:       addi    r10,r10,16
10001f14:       addi    r3,r3,16
10001f18:       stfd    f28,-10672(r7)
10001f1c:       stfd    f29,-10664(r7)
10001f20:       addi    r7,r7,16
10001f24:       bne     10001e10 <calc1_+0x1b0>

----------
Bad asm code for the same loop

10001a60:       addis   r27,r9,-435
10001a64:       addis   r12,r11,-2176
10001a68:       lfd     f13,-7440(r27)
10001a6c:       lfd     f10,28344(r12)
10001a70:       addis   r8,r11,-1958
10001a74:       addis   r10,r11,-1740
10001a78:       fsub    f7,f10,f13
10001a7c:       lfd     f8,-704(r8)
10001a80:       lfd     f10,0(r9)
10001a84:       addis   r7,r9,-218
10001a88:       addis   r28,r9,1523
10001a8c:       lfd     f9,-29752(r10)
10001a90:       fadd    f6,f12,f10
10001a94:       fsub    f2,f8,f0
10001a98:       addis   r12,r11,218
10001a9c:       addis   r27,r9,2176
10001aa0:       fadd    f5,f11,f9
10001aa4:       fadd    f11,f11,f12
10001aa8:       addi    r9,r9,8
10001aac:       cmpw    r6,r9
10001ab0:       fmul    f1,f7,f30
10001ab4:       fmul    f7,f13,f13
10001ab8:       fmul    f13,f13,f3
10001abc:       fadd    f31,f5,f6
10001ac0:       lfd     f5,29040(r7)
10001ac4:       fmsub   f2,f2,f29,f1
10001ac8:       fmadd   f1,f0,f0,f7
10001acc:       fmul    f0,f0,f3
10001ad0:       fmul    f6,f13,f6
10001ad4:       stfd    f6,-6728(r28)
10001ad8:       fdiv    f2,f2,f31
10001adc:       fmadd   f5,f5,f5,f1
10001ae0:       fmul    f31,f0,f11
10001ae4:       fmr     f0,f8
10001ae8:       stfd    f31,0(r11)
10001aec:       fmr     f11,f9
10001af0:       addi    r11,r11,8
10001af4:       fadd    f1,f5,f4
10001af8:       fmr     f4,f7
10001afc:       fmadd   f5,f1,f28,f12
10001b00:       fmr     f12,f10
10001b04:       stfd    f5,-28344(r27)
10001b08:       stfd    f2,-29040(r12)
10001b0c:       bne+    10001a60 <calc1_+0xe0>

----------

Looking into the differences for both cases, the good code seems to be
traversing the loop in a different way than the bad one, using smaller
displacements for each load/store. The bad case uses bigger displacements.

Also, it looks like we have a bigger unrolling factor on the good case (longer
code, more loads) compared to the bad case.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40029

Reply via email to