> In an internal application I noticed that the ipa-inliner is quite > sensitive to AFDO counts and that seems to make the performance worse. > Did you notice this? This was before some of your changes. I will try > again.
The cases I looked into were mixture of late inlining and ipa-cp cloning being disabled by the 0 counts as well as loop optimization. Since hotness of the loop is determined by frequency of header block, when we end up with fake 0 there, we disable the code expanding optimization. The code is obviously structured around assumption that it will be able to do all relevant inlines which happened when compiling the train run before AFDO instrumentation phase. If less or more inlining happens profile is lost. Current code has problems with this - orignal Google setup iterated early inliner many times and also ran within LIPO which were able to early inline cross-module by simply parsing extra sources when needed. The current code runs early inliner once without vPT and then iterates early inliner late within AFDO pass. This is not working very well, since the resulting CFG is unoptimized and thus has many basic blocks that will later disappear AFDO again has no data for. I have WIP patch (where I need to debug an ICE) to also do VPT during early opts so we can do all non-cross-module inlining at that stage. Even this is not quite safe, since inlining is non-transitive. If we inlined foo into offline copy of bar does not mean that every inline copy of bar should have foo inlined as well. I don't think The way AFDO inlining works, we will first inline bar to foo and then we will consider inlning foo further. This means that transitively bar will be inlinined bypassing any code size checks. Each time this happen annotation pass will not have data on inlined copy of bar. An option is to make AFDO inliner an IPA pass and re-early optimize functions that changed. I don't think this is necesarily that bad solution but need to be evaulated once things works better. I don't remember Google having solution for this - I think withi their large APP that trained just very minor percentage of the binary this was lost in oise. > > > This is a bit wild, but hope things will settle donw once we chase out > > obvious problems (such as losing the profile of functions that has not been > > inlined). > > AFIU, the scalling of local_profile below is to get the local-count > comparable to AFDO count. However, could we also extend propagation of AFDO > profile along BB and along CFG such that we minimise our reliance on local > count?. There is propagation of AFDO data using Kirhoff law, but it gets stuck when some edge counts can not be fully determined from the profile data. Original AFDO implementation had max cut based solver which is theoretically better solution and it came from original paper AFDO was based on. They decided to go for easier approach, since auto-profile data has inconsistencies you do not want to propagate very globally. I think we can modify -fprofile-use so it can run after AFDO and provide data comparing AFDO and real profile. In addition to working with you on the issues of profile being lost with LTO, cloning and other cases, my plan is to 1) finish the VPT reorganization 2) make AFD reader to scale up the profile since at least in data from SPEC or profiledbootstrap the counters are quite small integers which makes furhter scaling to produce 0s that breaks various heuristics. 3) implement local profiles with global AFDO 0 counnt so we get hot/cold functions identified correctly again 4) see how much the afdo propagation can be improved. There are quite obvious limitations in current code. It is also slow since instead of worklist it does iteration Hopefully after this stage the afdo will +- work and we can look into performance issues... https://lnt.opensuse.org/db_default/v4/SPEC/67738?compare_to=67761 compares afdo -Ofast -flto to -Ofast -flto with no feedback https://lnt.opensuse.org/db_default/v4/SPEC/67738?compare_to=67753 compares afdo -Ofast -flto to real FDO -Ofast -flto So last runs are closer to having no feedback. Many regresions are gone but still there are some serious to look at: SPEC/SPEC2017/INT/520.omnetpp_r 20.62% SPEC/SPEC2017/FP/549.fotonik3d_r 19.16% SPEC/SPEC2017/FP/527.cam4_r 14.31% SPEC/SPEC2017/FP/510.parest_r 14.19% SPEC/SPEC2017/INT/500.perlbench_r 13.01% SPEC/SPEC2017/FP/511.povray_r 12.68% SPEC/SPEC2017/FP/503.bwaves_r 7.81% SPEC/SPEC2017/INT/505.mcf_r 7.29% SPEC/SPEC2017/FP/507.cactuBSSN_r 6.69% SPEC/SPEC2017/INT/502.gcc_r 6.15% I think we will want to improve the profiling setup by running train tasks multiple times since we gather too little data. In my benchmarks I simply use ref runs as train runs which solves some of the regressions seen above (omnetpp and perlbench works well for me). Honza