6 regression] loop performance regression

amker at gcc dot gnu.org Tue, 19 May 2015 19:21:55 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256


amker at gcc dot gnu.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |amker at gcc dot gnu.org

--- Comment #52 from amker at gcc dot gnu.org ---
I don't understand powerpc assembly well, but this looks like the same problem
on aarch64/arm.  Ah, and we are even looking at same function...

I think this is a general issue caused by inconsistency between tree level
ivopt and rtl level loop unroller.  To be specific, how we handle unrolled
induction variable registers after unrolling.

The core loop on aarch64 with options "-O3 -funroll-all-loops -mcpu=cortex-a57"
gave below output:

.L3:
        add     x2, x0, 16
        ldr     q16, [x17, x0]
        add     x10, x0, 32
        add     x9, x0, 48
        add     x8, x0, 64
        ldr     q17, [x17, x2]
        add     x3, x0, 80
        add     x6, x0, 96
        add     x5, x0, 112
        add     w1, w1, 8
        ldr     q19, [x17, x10]
        cmp     w1, w14
        ldr     q18, [x17, x9]
        ldr     q20, [x17, x8]
        ldr     q21, [x17, x3]
        ldr     q22, [x17, x6]
        ldr     q23, [x17, x5]
        str     q16, [x18, x0]
        add     x0, x0, 128
        str     q17, [x18, x2]
        str     q19, [x18, x10]
        str     q18, [x18, x9]
        str     q20, [x18, x8]
        str     q21, [x18, x3]
        str     q22, [x18, x6]
        str     q23, [x18, x5]
        bcc     .L3 

The tree ivopt dump is quite neat:

  <bb 6>:
  # ivtmp.16_28 = PHI <ivtmp.16_25(9), 0(5)>
  # ivtmp.19_42 = PHI <ivtmp.19_41(9), 0(5)>
  vect__4.13_62 = MEM[base: vectp_a.12_58, index: ivtmp.19_42, offset: 0B];
  MEM[base: vectp_c.15_63, index: ivtmp.19_42, offset: 0B] = vect__4.13_62;
  ivtmp.16_25 = ivtmp.16_28 + 1;
  ivtmp.19_41 = ivtmp.19_42 + 16;
  if (ivtmp.16_25 < bnd.7_36)
    goto <bb 9>;
  else
    goto <bb 7>;

  ...

  <bb 9>:
  goto <bb 6>;

But after rtl unroller, we have options like "-fsplit-ivs-in-unroller" and
"-fweb".  These two options try to split the long live range of induction
vairables into seperated ones.  Evetually, with folloing fwprop and IRA, we
have multiple ivs for each original iv.  

I see two possible fixes here.  One is to implement a tree level unroller
before IVOPT and remove the rtl one.  The rtl one is some kind of too
aggressive that we didn't enable it by default with "O3".
Another is change how we handle unrolled iv in rtl unroller.  It splits
unrolled iv to avoid pseudo register with long live range since that may affect
rtl optimizers.  This assumption may hold before, but seems not true to me
nowadays, especially for induction variables.  Because on tree level ivopts, we
already made the assumption that each iv occupies a register, also ivs are
intensively used thus should live in one single hard register.  For this
specific case, we can refactor [base+index] out of memory reference and use
[new_base], [new_base+4], [new_base+8], ... etc. in unrolling.  If tree ivopts
choosses [reg+offset] addressing mode, we only need to generate instruction
sequence like "[reg+offset], [reg+(offset+4)], [reg+(offset+8)]... reg = reg +
urolled_times*step"

Thanks,
bin

[Bug target/29256] [4.8/4.9/5/6 regression] loop performance regression

Reply via email to