Hi Honza,
Thanks for the fixes.

> On 18 Jun 2025, at 1:48 am, Jan Hubicka <hubi...@ucw.cz> wrote:
> 
> External email: Use caution opening links or attachments
> 
> 
> Hi,
> this patch makes afdo_adjust_guessed_profile more agressive on finding scales
> on the boundaries of connected components with no annotation.  Originaly I
> looked for edges into or out of the component with known AFDO counts and I 
> also
> haled edges from basic block with known AFDO count and known static 
> probability
> estimate.
> 
> Common problem is with components not containing any in edges, but only out
> edges (i.e.  those with ENTRY_BLOCK).  In this case I added logic that looks
> for edges out of the component to BBs with known AFDO count.  If all flow to
> the BB is either from the component or has AFDO count, we can deterine scale
> precisely.  It may happen that there are edges from other components. In this
> case we know upper bound and use it, since it is better than nothing.
> 
> I also noticed that some components have 0 count in all profile and then 
> scaling
> gives up, which is fixed.  I also optimized the code a bit by replacing
> map holding current component with an array holding component ID and broke out
> saling logic into separate functions.
> 
> The patch fixes perl regression I introduced in last change.
> according to
> https://lnt.opensuse.org/db_default/v4/SPEC/67674
> there were improvements (percentage is runtime change):
> 
> 538.imagick_r   -32.52%
> 549.fotonik3d_r -22.68%
> 520.omnetpp_r   -12.37%
> 503.bwaves_r    -8.71%
> 508.namd_r      -5.10%
> 526.blender_r   -2.11%
> 
> and regressions:
> 
> 554.roms_r      45.95%
> 527.cam4_r      21.69%
> 511.povray_r    13.59%
> 500.perlbench_r 10.19%
> 507.cactuBSSN_r 9.81%
> 510.parest_r    9.69%
> 548.exchange2_r 8.42%
> 502.gcc_r       5.10%
> 544.nab_r       3.76%
> 519.lbm_r       2.34%
> 541.leela_r     2.16%
> 525.x264_r      2.14%
> 
In an internal application I noticed that the ipa-inliner is quite sensitive to 
AFDO counts and that seems to make the performance worse. Did you notice this? 
This was before some of your changes. I will try again.

> This is a bit wild, but hope things will settle donw once we chase out
> obvious problems (such as losing the profile of functions that has not been
> inlined).

AFIU, the scalling of local_profile below is to get the local-count comparable 
to AFDO count. However, could we also extend propagation of AFDO profile along  
BB and along CFG such that we minimise our reliance on local count?. 

> 
> autofdobootstrapped/regtested x86_64-linux
> 
> gcc/ChangeLog:
> 
>        * auto-profile.cc (afdo_indirect_call): Compute speculative edge
>        probability.
>        (add_scale): Break out from ...
>        (scale_bbs): Break out from ...
>        (afdo_adjust_guessed_profile): ... here; use componet array instead of
>        current_component hash_map; handle components with only 0 profile;
>        be more agressive on finding scales along the boundary.
> 
> diff --git a/gcc/auto-profile.cc b/gcc/auto-profile.cc
> index 8918d1e9f09..10b4ec822f9 100644
> --- a/gcc/auto-profile.cc
> +++ b/gcc/auto-profile.cc
> @@ -1076,10 +1076,10 @@ afdo_indirect_call (gimple_stmt_iterator *gsi, const 
> icall_target_map &map,
>       fprintf (dump_file, "\n");
>     }
> 
> -  /* FIXME: Count should be initialized.  */
>   struct cgraph_edge *new_edge
> -      = indirect_edge->make_speculative (direct_call,
> -                                        profile_count::uninitialized ());
> +      = indirect_edge->make_speculative
> +               (direct_call,
> +                indirect_edge->count.apply_scale (99, 100));

Why this? Isn’t it supposed to bee the count for direct_call which should be 
max_iter->second?

Thanks,
Kugan

>   cgraph_edge::redirect_call_stmt_to_callee (new_edge);
>   gimple_remove_histogram_value (cfun, stmt, hist);
>   inline_call (new_edge, true, NULL, NULL, false);
> @@ -1549,6 +1549,60 @@ cmp (const void *a, const void *b)
>   return 0;
> }
> 
> +/* Add scale ORIG/ANNOTATED to SCALES.  */
> +
> +static void
> +add_scale (vec <sreal> *scales, profile_count annotated, profile_count orig)
> +{
> +  if (dump_file)
> +    {
> +      orig.dump (dump_file);
> +      fprintf (dump_file, " should be ");
> +      annotated.dump (dump_file);
> +      fprintf (dump_file, "\n");
> +    }
> +  if (orig.nonzero_p ())
> +    {
> +      sreal scale
> +       = annotated.guessed_local ()
> +               .to_sreal_scale (orig);
> +      if (dump_file)
> +       fprintf (dump_file, "    adding scale %.16f\n",
> +                scale.to_double ());
> +      scales->safe_push (scale);
> +    }
> +}
> +
> +/* Scale counts of all basic blocks in BBS by SCALE and convert them to
> +   IPA quality.  */
> +
> +static void
> +scale_bbs (const vec <basic_block> &bbs, sreal scale)
> +{
> +  if (dump_file)
> +    fprintf (dump_file, "  Scaling by %.16f\n", scale.to_double ());
> +  for (basic_block b : bbs)
> +    if (!(b->count == profile_count::zero ())
> +       && b->count.initialized_p ())
> +      {
> +       profile_count o = b->count;
> +       b->count = b->count.force_guessed () * scale;
> +
> +       /* If we scaled to 0, make it auto-fdo since that is treated
> +          less agressively.  */
> +       if (!b->count.nonzero_p () && o.nonzero_p ())
> +         b->count = profile_count::zero ().afdo ();
> +       if (dump_file)
> +         {
> +           fprintf (dump_file, "    bb %i count updated ", b->index);
> +           o.dump (dump_file);
> +           fprintf (dump_file, " -> ");
> +           b->count.dump (dump_file);
> +           fprintf (dump_file, "\n");
> +         }
> +      }
> +}
> +
> /* In case given basic block was fully optimized out, AutoFDO
>    will have no data about it.  In this case try to preserve static profile.
>    Identify connected components (in undirected form of CFG) which has
> @@ -1558,26 +1612,33 @@ cmp (const void *a, const void *b)
> void
> afdo_adjust_guessed_profile (bb_set *annotated_bb)
> {
> -  auto_sbitmap visited (last_basic_block_for_fn (cfun));
>   /* Basic blocks of connected component currently processed.  */
> -  auto_vec <basic_block, 20> bbs (n_basic_blocks_for_fn (cfun) + 1);
> +  auto_vec <basic_block, 20> bbs (n_basic_blocks_for_fn (cfun));
>   /* Scale factors found.  */
> -  auto_vec <sreal, 20> scales (n_basic_blocks_for_fn (cfun) + 1);
> -  auto_vec <basic_block, 20> stack (n_basic_blocks_for_fn (cfun) + 1);
> -
> -  bitmap_clear (visited);
> +  auto_vec <sreal, 20> scales;
> +  auto_vec <basic_block, 20> stack (n_basic_blocks_for_fn (cfun));
> 
>   basic_block seed_bb;
> -  FOR_BB_BETWEEN (seed_bb, ENTRY_BLOCK_PTR_FOR_FN (cfun),
> -                 EXIT_BLOCK_PTR_FOR_FN (cfun), next_bb)
> -   if (!is_bb_annotated (seed_bb, *annotated_bb)
> -       && bitmap_set_bit (visited, seed_bb->index))
> +  unsigned int component_id = 1;
> +
> +  /* Map from basic block to its component.
> +     0   is used for univisited BBs,
> +     1   means that BB is annotated,
> +     >=2 is an id of the component BB belongs to.  */
> +  auto_vec <unsigned int, 20> component;
> +  component.safe_grow (last_basic_block_for_fn (cfun));
> +  FOR_ALL_BB_FN (seed_bb, cfun)
> +    component[seed_bb->index]
> +       = is_bb_annotated (seed_bb, *annotated_bb) ? 1 : 0;
> +  FOR_ALL_BB_FN (seed_bb, cfun)
> +   if (!component[seed_bb->index])
>      {
> -       hash_set <basic_block> current_component;
> -
>        stack.quick_push (seed_bb);
> +       component_id++;
>        bbs.truncate (0);
>        scales.truncate (0);
> +       component[seed_bb->index] = component_id;
> +       profile_count max_count = profile_count::zero ();
> 
>        /* Identify connected component starting in BB.  */
>        if (dump_file)
> @@ -1588,19 +1649,33 @@ afdo_adjust_guessed_profile (bb_set *annotated_bb)
>           basic_block b = stack.pop ();
> 
>           bbs.quick_push (b);
> -          current_component.add (b);
> +          max_count = max_count.max (b->count);
> 
>           for (edge e: b->preds)
> -            if (!is_bb_annotated (e->src, *annotated_bb)
> -                && bitmap_set_bit (visited, e->src->index))
> -               stack.quick_push (e->src);
> +            if (!component[e->src->index])
> +              {
> +                 stack.quick_push (e->src);
> +                 component[e->src->index] = component_id;
> +              }
>           for (edge e: b->succs)
> -            if (!is_bb_annotated (e->dest, *annotated_bb)
> -                && bitmap_set_bit (visited, e->dest->index))
> -               stack.quick_push (e->dest);
> +            if (!component[e->dest->index])
> +              {
> +                 stack.quick_push (e->dest);
> +                 component[e->dest->index] = component_id;
> +              }
>         }
>        while (!stack.is_empty ());
> 
> +       /* If all blocks in components has 0 count, we do not need
> +         to scale, only we must convert to IPA quality.  */
> +       if (!max_count.nonzero_p ())
> +        {
> +          if (dump_file)
> +            fprintf (dump_file, "  All counts are 0; scale = 1\n");
> +          scale_bbs (bbs, 1);
> +          continue;
> +        }
> +
>        /* Now visit the component and try to figure out its desired
>          frequency.  */
>        for (basic_block b : bbs)
> @@ -1653,84 +1728,83 @@ afdo_adjust_guessed_profile (bb_set *annotated_bb)
>               }
>             else
>               {
> -                gcc_checking_assert (current_component.contains (e->src));
>                 current_component_count += e->count ();
> +                gcc_checking_assert (component[e->src->index] == 
> component_id);
>               }
>           if (boundary && current_component_count.initialized_p ())
>             {
> -              profile_count in_count = b->count - current_component_count;
>               if (dump_file)
> -                {
> -                  fprintf (dump_file, "    bb %i in count ", b->index);
> -                  in_count.dump (dump_file);
> -                  fprintf (dump_file, " should be ");
> -                  annotated_count.dump (dump_file);
> -                  fprintf (dump_file, "\n");
> -                }
> -              if (in_count.nonzero_p ())
> -                {
> -                  sreal scale
> -                    = annotated_count.guessed_local ()
> -                            .to_sreal_scale (in_count);
> -                  if (dump_file)
> -                    fprintf (dump_file, "    adding scale %.16f\n",
> -                             scale.to_double ());
> -                  scales.safe_push (scale);
> -                }
> +                fprintf (dump_file, "    bb %i in count ", b->index);
> +              add_scale (&scales,
> +                         annotated_count,
> +                         b->count - current_component_count);
>             }
>           for (edge e: b->succs)
>             if (AFDO_EINFO (e)->is_annotated ())
>               {
> -                profile_count out_count = e->count ();
> -                profile_count annotated_count = AFDO_EINFO (e)->get_count ();
>                 if (dump_file)
> -                  {
> -                    fprintf (dump_file, "    edge %i->%i count ",
> -                             b->index, e->dest->index);
> -                    out_count.dump (dump_file);
> -                    fprintf (dump_file, " should be ");
> -                    annotated_count.dump (dump_file);
> -                    fprintf (dump_file, "\n");
> -                  }
> -                if (out_count.nonzero_p ())
> -                  {
> -                    sreal scale
> -                      = annotated_count.guessed_local ()
> -                              .to_sreal_scale (out_count);
> -                    if (dump_file)
> -                      fprintf (dump_file, "    adding scale %.16f\n",
> -                               scale.to_double ());
> -                    scales.safe_push (scale);
> -                  }
> +                  fprintf (dump_file, "    edge %i->%i count ",
> +                           b->index, e->dest->index);
> +                add_scale (&scales, AFDO_EINFO (e)->get_count (), e->count 
> ());
> +              }
> +            else if (is_bb_annotated (e->dest, *annotated_bb))
> +              {
> +                profile_count annotated_count = e->dest->count;
> +                profile_count out_count = profile_count::zero ();
> +                bool ok = true;
> +                for (edge e2: e->dest->preds)
> +                  if (AFDO_EINFO (e2)->is_annotated ())
> +                    annotated_count -= AFDO_EINFO (e2)->get_count ();
> +                  else if (component[e->src->index] == component_id)
> +                    out_count += e->count ();
> +                  else if (e->probability.nonzero_p ())
> +                    {
> +                      ok = false;
> +                      break;
> +                    }
> +                if (!ok)
> +                  continue;
> +                if (dump_file)
> +                  fprintf (dump_file,
> +                           "    edge %i->%i has annotated sucessor; count ",
> +                           b->index, e->dest->index);
> +                add_scale (&scales, annotated_count, e->count ());
>               }
> 
>         }
> +
> +       /* If we failed to find annotated entry or exit edge,
> +         look for exit edges and scale profile so the dest
> +         BB get all flow it needs.  This is inprecise because
> +         the edge is not annotated and thus BB has more than
> +         one such predecessor.  */
> +       if (!scales.length ())
> +        for (basic_block b : bbs)
> +          if (b->count.nonzero_p ())
> +            for (edge e: b->succs)
> +              if (is_bb_annotated (e->dest, *annotated_bb))
> +                {
> +                  profile_count annotated_count = e->dest->count;
> +                  for (edge e2: e->dest->preds)
> +                    if (AFDO_EINFO (e2)->is_annotated ())
> +                      annotated_count -= AFDO_EINFO (e2)->get_count ();
> +                  if (dump_file)
> +                    fprintf (dump_file,
> +                             "    edge %i->%i has annotated sucessor;"
> +                             " upper bound count ",
> +                             b->index, e->dest->index);
> +                  add_scale (&scales, annotated_count, e->count ());
> +                }
>        if (!scales.length ())
> -        continue;
> +        {
> +          if (dump_file)
> +            fprintf (dump_file,
> +                     "  Can not determine count from the boundary; giving 
> up");
> +          continue;
> +        }
> +       gcc_checking_assert (scales.length ());
>        scales.qsort (cmp);
> -       sreal scale = scales[scales.length () / 2];
> -       if (dump_file)
> -        fprintf (dump_file, "  Scaling by %.16f\n", scale.to_double ());
> -       for (basic_block b : bbs)
> -        if (!(b->count == profile_count::zero ())
> -            && b->count.initialized_p ())
> -          {
> -            profile_count o = b->count;
> -            b->count = b->count.force_guessed () * scale;
> -
> -            /* If we scaled to 0, make it auto-fdo since that is treated
> -               less agressively.  */
> -            if (!b->count.nonzero_p () && o.nonzero_p ())
> -              b->count = profile_count::zero ().afdo ();
> -            if (dump_file)
> -              {
> -                fprintf (dump_file, "    bb %i count updated ", b->index);
> -                o.dump (dump_file);
> -                fprintf (dump_file, " -> ");
> -                b->count.dump (dump_file);
> -                fprintf (dump_file, "\n");
> -              }
> -          }
> +       scale_bbs (bbs, scales[scales.length () / 2]);
>      }
> }
> 

Reply via email to