> On 13/06/25 14:51, Jan Hubicka wrote:
> > External email: Use caution opening links or attachments
> > 
> > 
> > > From: Dhruv Chawla <dhr...@nvidia.com>
> > Hi,
> > > 
> > > For reasons explained in the patch, this patch prevents the loss of 
> > > profile
> > > information when inlining occurs in the profiled binary but not in the
> > > auto-profile pass as a decision. As an example, for this code:
> > 
> > I was wondering about this problem too
> > > - Annotation, merging and inlining form a messy set of dependencies in
> > >    the auto-profile pass. The order that functions get annotated in
> > >    affects the decisions that the inliner makes, but the order of
> > >    visiting them is effectively random due to the use of
> > >    FOR_EACH_FUNCTION.
> > > 
> > > - The main issue is that annotation is performed after inlining. This is
> > >    meant to more accurately mirror the hot path in the profiled binary,
> > >    however there is no guarantee of this because of the randomness in the
> > >    order of visitation.
> > I tought the extra early inlining invocation just queries the AFDO data,
> > not annotated function body (i.e.  is done inter-procedurally all before
> > the annotation starts).
> > 
> > I.e. we do
> > 1) read afdo gcov file
> > 2) do regular early optimizations
> > 3) do the extra early-inliner invocatoin of afdo pass
> > 4) annotate CFG
> 
> Unfortunately not:
> 
> auto_profile (void)
> {
>   <...>
>   FOR_EACH_FUNCTION (node)
>   {
>     <...>
>     unsigned int todo = 0;
>     for (int i = 0; i < 10; i++)
>       {
>       if (!flag_value_profile_transformations
>           || !autofdo::afdo_vpt_for_early_inline (&promoted_stmts))
>         break;
>       todo |= early_inline ();
>       }
> 
>     todo |= early_inline ();
>     autofdo::afdo_annotate_cfg (promoted_stmts);
>     compute_function_frequency ();
> 
> The early inliner is invoked on each function before it is annotated. It
> also looks like the pass aggressively tries to do VPT before annotation.

Yes, I think we first load auto-fdo data (at invocation of the copmiler)
and then during normal early optimize we early inline direct calls and
already use afdo data.  This is for inlining indirect calls (and it is
why the pass does VPT) which is less common scenario.  I agree that we
it would make sense to split this into full IPA passes. Ie. in

pass 1:

  FOR_EACH_FUNCTION
    vpr + early inlining of indirect calls + possibly re-do early
    optimizations when happened.

pass 2:

  Lookup all inlined instances in AFDO profile and if inlining did not
  happen, merge to corresponding offline copies (as you do) and possibly
  release them.

pass 3:
  FOR_EACH_FUNCTION
    autofdo::afdo_annotate_cfg (promoted_stmts);
    compute_function_frequency ();

In longer term, it would be nice to be able to load FDO and auto-FDO at
LTO link-time. That would make usage easier (and faster), since one can
only re-link with LTO instead of rebuilding whole tree.  It would also
give us a chance to solve problems we have with cross-module inlining.
Those exists both in -fprofile-use and auto-profile paths, since with
-fprofile-use we may mix up comdats.

To implement that we could stream CFG separately from rest of function
body and be able to load it at WPA time.  It would probaly need
restructuring CFG representaiton to a more lean base that is used for
this and on which we can do some basic algorithms, like dominance
computation.

> 
> Because the early inliner is invoked while annotation is being done,
> its possible that all known information has not been propagated to the
> total_count of the function_instance when the early inliner is invoked.
> 
> Another problem here is that get_inline_stack returns an empty stack if
> no inlining occurred in the corresponding GIMPLE statement. So if an
Hmm, so we have a bug here?  
afdo_callsite_hot_enough_for_early_inline should return true if called
on non-inlined call edge that is inlined in train run and has enough
sampes in it to be considered hot.

Note that in deeper inline chains
  foo->bar->bar2->bar3
We may need to iterae early inliner since afdo inlining is top-down.
It will not inline bar2->bar3 since if bar2 is fully inlined it will not
even have afdo profile.  It will also not inline bar->bar2 for same
reason.  When processing foo->bar it should notice the profile and see
that the chain was inlined in train run and is hot and inline foo->bar.
We should proceed then by checking fpp->bar2, but that needs iteration.

> inline callsite does exist in the profile at the current GIMPLE
> statement but no inlining actually occurs during auto-profile, the
> information is just dropped.

Yep, we want to merge in that case, but only if we do not early inline
for whatever reason.

I would still keep the logic htat attempts to make inlining + early
optimization to happen before annotation.  It is useful - updating
profile after inlining is inherently inprecise since code is specialized
for a given context and the combined profile is no longer valid.

One feature I was thinking of was to impement context dependency into
-fprofile-use as well to solve this problem.  With all the C++
abstraction it is very common to have non-trivial functoins (i.e. not
small enough to early inline) to have very different behaviour in
different contexts (such as small versus big std::vectors etc.)

Afdo gives us some of context sensitivity for free and we ought make to
use of it.

Honza

Reply via email to