> In an internal application I noticed that the ipa-inliner is quite
> sensitive to AFDO counts and that seems to make the performance worse.
> Did you notice this? This was before some of your changes. I will try
> again.

The cases I looked into were mixture of late inlining and ipa-cp cloning
being disabled by the 0 counts as well as loop optimization.  Since
hotness of the loop is determined by frequency of header block, when we
end up with fake 0 there, we disable the code expanding optimization.

The code is obviously structured around assumption that it will be able
to do all relevant inlines which happened when compiling the train run
before AFDO instrumentation phase.  If less or more inlining happens
profile is lost.

Current code has problems with this - orignal Google setup iterated
early inliner many times and also ran within LIPO which were able to
early inline cross-module by simply parsing extra sources when needed.
The current code runs early inliner once without vPT and then iterates
early inliner late within AFDO pass.
This is not working very well, since the resulting CFG is unoptimized
and thus has many basic blocks that will later disappear AFDO again has
no data for.

I have WIP patch (where I need to debug an ICE) to also do VPT during
early opts so we can do all non-cross-module inlining at that stage.
Even this is not quite safe, since inlining is non-transitive.
If we inlined foo into offline copy of bar does not mean that every
inline copy of bar should have foo inlined as well.  I don't think

The way AFDO inlining works, we will first inline bar to foo and then we
will consider inlning foo further.  This means that transitively bar
will be inlinined bypassing any code size checks.  Each time this happen
annotation pass will not have data on inlined copy of bar.

An option is to make AFDO inliner an IPA pass and re-early optimize
functions that changed.  I don't think this is necesarily that bad
solution but need to be evaulated once things works better.
I don't remember Google having solution for this - I think withi their
large APP that trained just very minor percentage of the binary this was
lost in oise.
> 
> > This is a bit wild, but hope things will settle donw once we chase out
> > obvious problems (such as losing the profile of functions that has not been
> > inlined).
> 
> AFIU, the scalling of local_profile below is to get the local-count 
> comparable to AFDO count. However, could we also extend propagation of AFDO 
> profile along  BB and along CFG such that we minimise our reliance on local 
> count?. 

There is propagation of AFDO data using Kirhoff law, but it gets stuck
when some edge counts can not be fully determined from the profile data.
Original AFDO implementation had max cut based solver which is
theoretically better solution and it came from original paper AFDO was
based on. They decided to go for easier approach, since auto-profile
data has inconsistencies you do not want to propagate very globally.

I think we can modify -fprofile-use so it can run after AFDO and provide
data comparing AFDO and real profile.

In addition to working with you on the issues of profile being lost with
LTO, cloning and other cases, my plan is to
 1) finish the VPT reorganization
 2) make AFD reader to scale up the profile since at least in data from
 SPEC or profiledbootstrap the counters are quite small integers which
 makes furhter scaling to produce 0s that breaks various heuristics.
 3) implement local profiles with global AFDO 0 counnt so we get
 hot/cold functions identified correctly again
 4) see how much the afdo propagation can be improved.  There are quite
 obvious limitations in current code. It is also slow since instead of
 worklist it does iteration

Hopefully after this stage the afdo will +- work and we can look into
performance issues...

https://lnt.opensuse.org/db_default/v4/SPEC/67738?compare_to=67761
compares afdo -Ofast -flto to -Ofast -flto with no feedback
https://lnt.opensuse.org/db_default/v4/SPEC/67738?compare_to=67753
compares afdo -Ofast -flto to real FDO -Ofast -flto

So last runs are closer to having no feedback.  Many regresions are gone
but still there are some serious to look at:

SPEC/SPEC2017/INT/520.omnetpp_r         20.62%
SPEC/SPEC2017/FP/549.fotonik3d_r        19.16%
SPEC/SPEC2017/FP/527.cam4_r             14.31%
SPEC/SPEC2017/FP/510.parest_r           14.19%
SPEC/SPEC2017/INT/500.perlbench_r       13.01%
SPEC/SPEC2017/FP/511.povray_r           12.68%
SPEC/SPEC2017/FP/503.bwaves_r           7.81% 
SPEC/SPEC2017/INT/505.mcf_r             7.29% 
SPEC/SPEC2017/FP/507.cactuBSSN_r        6.69% 
SPEC/SPEC2017/INT/502.gcc_r             6.15% 

I think we will want to improve the profiling setup by running
train tasks multiple times since we gather too little data.  In my
benchmarks I simply use ref runs as train runs which solves some of the
regressions seen above (omnetpp and perlbench works well for me).

Honza

Reply via email to