------- Comment #1 from luisgpm at linux dot vnet dot ibm dot com 2009-05-11 18:04 ------- Good asm code for a hot loop in swim's "calc1" function
10001e10: lfd f12,-10672(r11) 10001e14: lfd f9,-10672(r9) 10001e18: addi r21,r21,16 10001e1c: lfd f7,-10680(r11) 10001e20: lfd f6,-10672(r6) 10001e24: fmul f3,f9,f9 10001e28: cmpw r21,r0 10001e2c: fadd f4,f7,f12 10001e30: lfd f22,-10680(r9) 10001e34: lfd f10,-10664(r9) 10001e38: addi r9,r9,16 10001e3c: lfd f23,-10672(r5) 10001e40: lfd f13,-10664(r5) 10001e44: addi r5,r5,16 10001e48: lfd f5,-10664(r11) 10001e4c: fsub f28,f23,f9 10001e50: fsub f25,f13,f10 10001e54: lfd f13,-10672(r4) 10001e58: addi r11,r11,16 10001e5c: fadd f5,f12,f5 10001e60: fsub f20,f13,f0 10001e64: fmul f9,f11,f9 10001e68: fmadd f27,f22,f22,f3 10001e6c: fmadd f30,f10,f10,f3 10001e70: lfd f3,-10680(r8) 10001e74: fadd f26,f4,f6 10001e78: fmul f10,f11,f10 10001e7c: fmul f24,f28,f2 10001e80: fmul f21,f25,f2 10001e84: fmul f4,f9,f4 10001e88: fmadd f22,f0,f0,f27 10001e8c: fadd f27,f8,f7 10001e90: fadd f23,f26,f8 10001e94: fmul f26,f0,f11 10001e98: lfd f8,-10664(r6) 10001e9c: lfd f0,-10664(r4) 10001ea0: addi r6,r6,16 10001ea4: fadd f29,f5,f8 10001ea8: fsub f25,f0,f13 10001eac: addi r4,r4,16 10001eb0: fmsub f28,f20,f1,f24 10001eb4: lfd f20,-10672(r8) 10001eb8: fmul f5,f10,f5 10001ebc: addi r8,r8,16 10001ec0: stfd f4,-10672(r22) 10001ec4: stfd f5,-10664(r22) 10001ec8: addi r22,r22,16 10001ecc: fmul f27,f26,f27 10001ed0: fadd f24,f6,f29 10001ed4: fmsub f29,f25,f1,f21 10001ed8: fdiv f28,f28,f23 10001edc: fmadd f25,f13,f13,f30 10001ee0: fadd f6,f6,f12 10001ee4: fmadd f30,f3,f3,f22 10001ee8: stfd f27,-10680(r3) 10001eec: fdiv f29,f29,f24 10001ef0: fmadd f3,f20,f20,f25 10001ef4: fmul f20,f13,f11 10001ef8: fmadd f7,f30,f31,f7 10001efc: stfd f7,-10680(r10) 10001f00: fmadd f12,f3,f31,f12 10001f04: fmul f13,f20,f6 10001f08: stfd f12,-10672(r10) 10001f0c: stfd f13,-10672(r3) 10001f10: addi r10,r10,16 10001f14: addi r3,r3,16 10001f18: stfd f28,-10672(r7) 10001f1c: stfd f29,-10664(r7) 10001f20: addi r7,r7,16 10001f24: bne 10001e10 <calc1_+0x1b0> ---------- Bad asm code for the same loop 10001a60: addis r27,r9,-435 10001a64: addis r12,r11,-2176 10001a68: lfd f13,-7440(r27) 10001a6c: lfd f10,28344(r12) 10001a70: addis r8,r11,-1958 10001a74: addis r10,r11,-1740 10001a78: fsub f7,f10,f13 10001a7c: lfd f8,-704(r8) 10001a80: lfd f10,0(r9) 10001a84: addis r7,r9,-218 10001a88: addis r28,r9,1523 10001a8c: lfd f9,-29752(r10) 10001a90: fadd f6,f12,f10 10001a94: fsub f2,f8,f0 10001a98: addis r12,r11,218 10001a9c: addis r27,r9,2176 10001aa0: fadd f5,f11,f9 10001aa4: fadd f11,f11,f12 10001aa8: addi r9,r9,8 10001aac: cmpw r6,r9 10001ab0: fmul f1,f7,f30 10001ab4: fmul f7,f13,f13 10001ab8: fmul f13,f13,f3 10001abc: fadd f31,f5,f6 10001ac0: lfd f5,29040(r7) 10001ac4: fmsub f2,f2,f29,f1 10001ac8: fmadd f1,f0,f0,f7 10001acc: fmul f0,f0,f3 10001ad0: fmul f6,f13,f6 10001ad4: stfd f6,-6728(r28) 10001ad8: fdiv f2,f2,f31 10001adc: fmadd f5,f5,f5,f1 10001ae0: fmul f31,f0,f11 10001ae4: fmr f0,f8 10001ae8: stfd f31,0(r11) 10001aec: fmr f11,f9 10001af0: addi r11,r11,8 10001af4: fadd f1,f5,f4 10001af8: fmr f4,f7 10001afc: fmadd f5,f1,f28,f12 10001b00: fmr f12,f10 10001b04: stfd f5,-28344(r27) 10001b08: stfd f2,-29040(r12) 10001b0c: bne+ 10001a60 <calc1_+0xe0> ---------- Looking into the differences for both cases, the good code seems to be traversing the loop in a different way than the bad one, using smaller displacements for each load/store. The bad case uses bigger displacements. Also, it looks like we have a bigger unrolling factor on the good case (longer code, more loads) compared to the bad case. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40029