https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62173

--- Comment #16 from Jiong Wang <jiwang at gcc dot gnu.org> ---
After some work on this issue, things have gone beyond my expectations. To
summarize my understanding of this issue and what I have got:

Reason of regression
======
  the testcase contains a loop which is hotcode, and there is one critical 
loop invariant, actually base address of local array, if it can't be ivopted
then there will be big performance regression on this testcase.

  because of one gcc tree level ivopt issue, this variable will not be 
ivopted on 64bit machine, mips, powerpc, aarch64 etc.

  mostly because there some type precision issues in gcc tree level code that 
gcc is doing too conservative decision on overflow checking and thus abort the
ivopt in early stage. For details, please see above discussion.

  Then why there is regression on AArch64 since

     r213488 "[AArch64] Improve TARGET_LEGITIMIZE_ADDRESS_P hook" ?

  it's because except tree level ivopt pass, gcc also do extra ivopt on
rtl level by the rtl pre pass. previously, we have inefficient implementation
on aarch64 TARGET_LEGITIMIZE_ADDRESS_P while this inefficient implementation
happen to generated some instruction pattern that rtl level ivopt could catch 
and that critical variable finally ivopted by rtl pass.

  r213488 fixed the inefficiency of TARGET_LEGITIMIZE_ADDRESS_P, and as a
side-effect, we will not generate that specific instruction pattern again, thus
no ivopt on rtl pre, and thus regression happen.

  before r213488, ivopt happen on AArch64 only, while after r213488, AArch64
act the same as all other 64bit targets, mips64, powerpc64, sparc64 etc.

To Fix
======
  * the idea way is fix this issue on tree-ssa-loop-ivopt pass.
    We need to correct type precision & overflow checking issue, then gcc could
    ivopt the critical variable as early as in tree level for all 64bit
targets.

  * gcc rtl level ivopt could be improved also. we could re-shuffle rtl
    pattern, then rtl pre pass could catch more ivopt opportunities and
    act as a supplement to tree ivopt.

What I have tried
=================
  I have tried to fix this issue on rtl level. We need two patches.

    * one patch to correct aarch64 TARGET_LEGITIMIZE_ADDRESS hook to generate
      more efficient rtl pattern for array indexed address.

    * one patch to re-shuffle rtl pattern to let rtl pre pass catch more ivopt
      opportunities.

  For targets except AArch64, patch 2 alone is enough to let ivopt happen in
rtl
pre pass.

  While for AArch64, we need to correct TARGET_LEGITIMIZE_ADDRESS first, to   
don't split valid "offset" into "magic-base + offset" which make the rtl
sequences unnecessarily complexer and cause troubles for later rtl pass.

  but this fix on AArch64 TARGET_LEGITIMIZE_ADDRESS hook then exposed another
issue on tree-ssa-loop-ivopt. When running benchmarks, I found some post-index
addressing opportunities are missing after we correct
TARGET_LEGITIMIZE_ADDRESS.
The correct of TARGET_LEGITIMIZE_ADDRESS will actually correct address cost
which tree-ssa-loop-ivopt will query, and looks like there are glitches in
tree-ssa-loop-ivopt cost model that gcc ivopt some simple post-index
addressable
variable, and thus generated sub-optimal code.

  And the patch for re-shuffle rtl pattern alone,
https://gcc.gnu.org/ml/gcc-patches/2014-12/msg00355.html, has critical problem.
It's unsafe to re-use those REG_NOTE info in those places. So the patch needs
re-work, or we need to
investigate a better solution to fix this on rtl level.

Future work
===========
  I hope someone could have a looks at the tree level fix which is the ideal 
fix.

  For RTL level fix, now the issue dependent chain for my solution becomes:

    fix-on-rtl-level
      \
       v
      re-shuffle insn for rtl pre
        \
         v
        fix AArch64 TARGET_LEGITIMIZE_ADDRESS (B)
          \
           v
          fix tree-ssa-loop-ivopt cost model to
          make better decision between ivopt
          and post-index. (A)

I think task A and B are important as they are quite general improvements to
AArch64 and gcc generic code besides this testcase issue.

I'd prepare a testcase and create a seperate ticket for task A first.

And, I don't think we can fix this issue in 5.0 release, so move the target to
the next release?

Thanks.

Reply via email to