------- Comment #21 from hubicka at gcc dot gnu dot org 2008-02-05 23:54 ------- Looking at the -O2 and -O2 -fno-inline-small-functions, I believe last remaining problem is our inability to hoist load of 0 out of loop:
The fill loop without inlining is taking the value as argument: .L7: fstl (%eax) addl $8, %eax cmpl %eax, %edx jne .L7 fstp %st(0) With inlining we however constant propagate that later results in load/store pair: .L8: flds .LC0 fstpl (%eax) addl $8, %eax cmpl %eax, %edx jne .L8 This is done intentionally: /* Hoisting constant pool constants into stack regs may cost more than just single register. On x87, the balance is affected both by the small number of FP registers, and by its register stack organization, that forces us to add compensation code in and around the loop to shuffle the operands to the top of stack before use, and pop them from the stack after the loop finishes. To model this effect, we increase the number of registers needed for stack registers by two: one register push, and one register pop. This usually has the effect that FP constant loads from the constant pool are not moved out of the loop. Note that this also means that dependent invariants can not be moved. However, the primary purpose of this pass is to move loop invariant address arithmetic out of loops, and address arithmetic that depends on floating point constants is unlikely to ever occur. */ Obviously this heuristic is misbehaving in such a simple cases where no other registers are carried over loop. One obvious problem is also that it is in effect for SSE codegen too. I am testing following patch that solves the second problem: Index: loop-invariant.c =================================================================== *** loop-invariant.c (revision 131965) --- loop-invariant.c (working copy) *************** get_inv_cost (struct invariant *inv, int *** 1012,1017 **** --- 1012,1018 ---- rtx set = single_set (inv->insn); if (set && IS_STACK_MODE (GET_MODE (SET_SRC (set))) + && (!TARGET_SSE_MATH || !SSE_FLOAT_MODE_P (GET_MODE (SET_SRC (set)))) && constant_pool_constant_p (SET_SRC (set))) (*regs_needed) += 2; } and cure this problem for -mfpmath=SSE at least. On 64bit target this now does good job. For 32bit however we get another transformation: .L8: movl $0, (%eax) movl $1074266112, 4(%eax) addl $8, %eax cmpl %eax, %edx jne .L8 .L2: At least on Athlon this slows down due to partial memory stall. This can be fixed by following: Index: config/i386/i386.md =================================================================== *** config/i386/i386.md (revision 131965) --- config/i386/i386.md (working copy) *************** *** 2690,2704 **** [(set (match_operand:DF 0 "nonimmediate_operand" "=f,m,f,*r ,o ,Y2*x,Y2*x,Y2*x ,m ") (match_operand:DF 1 "general_operand" ! "fm,f,G,*roF,F*r,C ,Y2*x,mY2*x,Y2*x"))] "!(MEM_P (operands[0]) && MEM_P (operands[1])) && ((optimize_size || !TARGET_INTEGER_DFMODE_MOVES) && !TARGET_64BIT) && (reload_in_progress || reload_completed || (ix86_cmodel == CM_MEDIUM || ix86_cmodel == CM_LARGE) || (!(TARGET_SSE2 && TARGET_SSE_MATH) && optimize_size && standard_80387_constant_p (operands[1])) || GET_CODE (operands[1]) != CONST_DOUBLE ! || memory_operand (operands[0], DFmode))" { switch (which_alternative) { --- 2690,2708 ---- [(set (match_operand:DF 0 "nonimmediate_operand" "=f,m,f,*r ,o ,Y2*x,Y2*x,Y2*x ,m ") (match_operand:DF 1 "general_operand" ! "fm,f,G,*roF,*Fr,C ,Y2*x,mY2*x,Y2*x"))] "!(MEM_P (operands[0]) && MEM_P (operands[1])) && ((optimize_size || !TARGET_INTEGER_DFMODE_MOVES) && !TARGET_64BIT) && (reload_in_progress || reload_completed || (ix86_cmodel == CM_MEDIUM || ix86_cmodel == CM_LARGE) || (!(TARGET_SSE2 && TARGET_SSE_MATH) && optimize_size + && !memory_operand (operands[0], DFmode) && standard_80387_constant_p (operands[1])) || GET_CODE (operands[1]) != CONST_DOUBLE ! || ((optimize_size ! || !TARGET_MEMORY_MISMATCH_STALL ! || reload_in_progress || reload_completed) ! && memory_operand (operands[0], DFmode)))" { switch (which_alternative) { Now with SSE codegen or with the STACK_REGS heuristics bit commented out we get better score with -O2 than with -O2 -fno-inline-small-functions. I guess the heuristic can be made more selective as currently I think it just disable all the hoists that is just wrong. Honza -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=23322