4.3 regression] performance regression

hubicka at gcc dot gnu dot org Tue, 05 Feb 2008 15:54:50 -0800


------- Comment #21 from hubicka at gcc dot gnu dot org  2008-02-05 23:54 
-------
Looking at the -O2 and -O2 -fno-inline-small-functions, I believe last
remaining problem is our inability to hoist load of 0 out of loop:


The fill loop without inlining is taking the value as argument:
.L7:
        fstl    (%eax)
        addl    $8, %eax
        cmpl    %eax, %edx
        jne     .L7
        fstp    %st(0)

With inlining we however constant propagate that later results in load/store
pair:
.L8:
        flds    .LC0
        fstpl   (%eax)
        addl    $8, %eax
        cmpl    %eax, %edx
        jne     .L8

This is done intentionally:

    /* Hoisting constant pool constants into stack regs may cost more than
       just single register.  On x87, the balance is affected both by the
       small number of FP registers, and by its register stack organization,
       that forces us to add compensation code in and around the loop to
       shuffle the operands to the top of stack before use, and pop them
       from the stack after the loop finishes.

       To model this effect, we increase the number of registers needed for
       stack registers by two: one register push, and one register pop.
       This usually has the effect that FP constant loads from the constant
       pool are not moved out of the loop.

       Note that this also means that dependent invariants can not be moved.
       However, the primary purpose of this pass is to move loop invariant
       address arithmetic out of loops, and address arithmetic that depends
       on floating point constants is unlikely to ever occur.  */

Obviously this heuristic is misbehaving in such a simple cases where no other
registers are carried over loop.  One obvious problem is also that it is in
effect for SSE codegen too.  I am testing following patch that solves the
second problem:

Index: loop-invariant.c
===================================================================
*** loop-invariant.c    (revision 131965)
--- loop-invariant.c    (working copy)
*************** get_inv_cost (struct invariant *inv, int
*** 1012,1017 ****
--- 1012,1018 ----
      rtx set = single_set (inv->insn);
      if (set
         && IS_STACK_MODE (GET_MODE (SET_SRC (set)))
+        && (!TARGET_SSE_MATH || !SSE_FLOAT_MODE_P (GET_MODE (SET_SRC (set))))
         && constant_pool_constant_p (SET_SRC (set)))
        (*regs_needed) += 2;
    }

and cure this problem for -mfpmath=SSE at least. On 64bit target this now does
good job. For 32bit however we get another transformation:

.L8:
        movl    $0, (%eax)
        movl    $1074266112, 4(%eax)
        addl    $8, %eax
        cmpl    %eax, %edx
        jne     .L8
.L2:

At least on Athlon this slows down due to partial memory stall.
This can be fixed by following:

Index: config/i386/i386.md
===================================================================
*** config/i386/i386.md (revision 131965)
--- config/i386/i386.md (working copy)
***************
*** 2690,2704 ****
    [(set (match_operand:DF 0 "nonimmediate_operand"
                        "=f,m,f,*r  ,o  ,Y2*x,Y2*x,Y2*x ,m  ")
        (match_operand:DF 1 "general_operand"
!                       "fm,f,G,*roF,F*r,C   ,Y2*x,mY2*x,Y2*x"))]
    "!(MEM_P (operands[0]) && MEM_P (operands[1]))
     && ((optimize_size || !TARGET_INTEGER_DFMODE_MOVES) && !TARGET_64BIT)
     && (reload_in_progress || reload_completed
         || (ix86_cmodel == CM_MEDIUM || ix86_cmodel == CM_LARGE)
         || (!(TARGET_SSE2 && TARGET_SSE_MATH) && optimize_size
           && standard_80387_constant_p (operands[1]))
         || GET_CODE (operands[1]) != CONST_DOUBLE
!        || memory_operand (operands[0], DFmode))"
  {
    switch (which_alternative)
      {
--- 2690,2708 ----
    [(set (match_operand:DF 0 "nonimmediate_operand"
                        "=f,m,f,*r  ,o  ,Y2*x,Y2*x,Y2*x ,m  ")
        (match_operand:DF 1 "general_operand"
!                       "fm,f,G,*roF,*Fr,C   ,Y2*x,mY2*x,Y2*x"))]
    "!(MEM_P (operands[0]) && MEM_P (operands[1]))
     && ((optimize_size || !TARGET_INTEGER_DFMODE_MOVES) && !TARGET_64BIT)
     && (reload_in_progress || reload_completed
         || (ix86_cmodel == CM_MEDIUM || ix86_cmodel == CM_LARGE)
         || (!(TARGET_SSE2 && TARGET_SSE_MATH) && optimize_size
+            && !memory_operand (operands[0], DFmode)
           && standard_80387_constant_p (operands[1]))
         || GET_CODE (operands[1]) != CONST_DOUBLE
!        || ((optimize_size
!             || !TARGET_MEMORY_MISMATCH_STALL
!           || reload_in_progress || reload_completed)
!          && memory_operand (operands[0], DFmode)))"
  {
    switch (which_alternative)
      {

Now with SSE codegen or with the STACK_REGS heuristics bit commented out we get
better score with -O2 than with -O2 -fno-inline-small-functions.  I guess the
heuristic can be made more selective as currently I think it just disable all
the hoists that is just wrong.

Honza


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=23322

[Bug target/23322] [4.1/4.2/4.3 regression] performance regression

Reply via email to