Re: [ping patch] Predict for loop exits in short-circuit conditions

Jan Hubicka Mon, 08 Oct 2012 03:02:15 -0700

> On Mon, Oct 8, 2012 at 11:04 AM, Jan Hubicka <hubi...@ucw.cz> wrote:
> >> On Mon, Oct 8, 2012 at 4:50 AM, Dehao Chen <de...@google.com> wrote:
> >> > Attached is the updated patch. Yes, if we add a VRP pass before
> >> > profile pass, this patch would be unnecessary. Should we add a VRP
> >> > pass?
> >>
> >> No, we don't want VRP in early optimizations.
> >
> > I am not quite sure about that.  VRP
> >  1) makes branch prediction work better by doing jump threading early
> 
> Well ... but jump threading may need basic-block duplication which may
> increase code size.  Also VRP and FRE have pass ordering issues.
> 
> >  2) is, after FRE, most effective tree pass on removing code by my profile
> >     statistics.
> 
> We also don't have DSE in early opts.  I don't want to end up with the
> situation that we do everything in early opts ... we should do _less_ there
> (but eventually iterate properly when processing cycles).


Yep, i am not quite sure about most sane variant.  Missed simple jump threading
in early opts definitely confuse both profile estimate and inline size
estimates.  But I am also not thrilled by adding more passes to early opts at
all.  Also last time I looked into this, CCP missed a lot of CCP oppurtunities
making VRP to artifically look like more useful.

Have patch that bit improves profile updating after jump threading (i.e.
re-does the profile for simple cases), but still jump threading is the most
common case for profile become inconsistent after expand.

On a related note, with -fprofile-report I can easilly track how much of code
each pass in the queue removed.  I was thinking about running this on Mozilla
and -O1 and removing those passes that did almost nothing.  Those are mostly
re-run passes, both at Gimple and RTL level. Our passmanager is not terribly
friendly for controlling pass per-repetition.

With introduction of -Og pass queue, do you think introducing -O1 pass queue
for late tree passes (that will be quite short) is sane? What about RTL
level?  I guess we can split the queues for RTL optimizations, too.
All optimizations passes prior register allocation are sort of optional
and I guess there are also -Og candidates.

I hoever find the 3 times duplicated queues bit uncool, too, but I guess
it is most compatible with PM organization.

At -O3 the most effective passes on combine.c
are:

cfg (because of cfg cleanup) -1.5474%
Early inlning -0.4991%
FRE -7.9369%
VRP -0.9321% (if run early), ccp does -0.2273%
tailr -0.5305%

After IPA
copyrename -2.2850% (it packs cleanups after inlining)
forwprop -0.5432%
VRP -0.9700% (if rerun after early passes, otherwise it is about 2%)
PRE -2.4123%
DOM -0.5182%

RTL passes
into_cfglayout -3.1400% (i.e. first cleanup_cfg)
fwprop1 -3.0467%
cprop -2.7786%
combine -3.3346%
IRA -3.4912% (i.e. the cost model preffers hard regs)
bbro -0.9765%

The numbers on tramp3d and LTO cc1 binary and not that different.
Honza

Re: [ping patch] Predict for loop exits in short-circuit conditions

Reply via email to