8 Regression]: Performance regression versus 4.7.3, 4.8.1 is ~15% slower

rguenth at gcc dot gnu.org Wed, 28 Feb 2018 02:19:39 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57534


Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |amker at gcc dot gnu.org,
                   |                            |rguenth at gcc dot gnu.org

--- Comment #21 from Richard Biener <rguenth at gcc dot gnu.org> ---
One major flaw here is that IVOPTs does nothing on this loop because it doesn't
find any induction variables.  So basically this is highlighting the fact that
we leave addressing mode computation to the RTL combiner.  SLSR was supposed
to be driving the way to do addressing mode computation on non-loopy code
and what it does is make using autoinc/dec easier (but as you saw later
forwprop passes wreck somewhat with this).  It would be nice if IVOPTs
could consider 'ind' as "induction variable" and be able to do addressing
mode selection on scalar code.  Bin, what would be required to do this?
We'd basically consider all SSA names used in addresses as IVs (or maybe
only multi-uses?).

So my suggestion would be to see if you can make SLSR generate TARGET_MEM_REFs
based on some common infrastructure with IVOPTs.

I also wonder why we have

        leaq    0(,%rdx,8), %rax
        movsd   (%rbx,%rdx,8), %xmm1
        addsd   8(%rbx,%rax), %xmm1
...

and not

       leaq    0(,%rdx,8), %rax
       movsd   (%rbx,%rax), %xmm1
       addsd   8(%rbx,%rax), %xmm1

possibly RTL fwprop work.  And then the question is whether using
8(%rbx,%rdx,8)
would be really better -- IIRC the most complex addressing modes need more
uops.  Of course CSEing 0(%rbx,%rdx,8) ould have enabled to use (%rax),
8(%rax),
etc. as Jeff says.

So for the reassoc pass it's major issue is that it works locally (per
single-use chain) rather than globally in conjunction with CSE.  The
only way it enables CSE is by sorting the chain against common criteria
but that criteria is similarly "local".  But I'm not aware of any
global CSE + reassoc implementations / papers.

Yes, I still want to explore "lowering" GIMPLE to -fwrapv (as RTL is) late
in the pipeline (after loop and late VRP, some pass shuffling is necessary).

[Bug tree-optimization/57534] [6/7/8 Regression]: Performance regression versus 4.7.3, 4.8.1 is ~15% slower

Reply via email to