Hi Honza,
On Sun, Jun 29, 2025 at 10:45 PM Jan Hubicka <[email protected]> wrote:
>
> >
> >
> > > On 24 Jun 2025, at 7:43 pm, Jan Hubicka <[email protected]> wrote:
> > >
> > > External email: Use caution opening links or attachments
> > >
> > >
> > > Hi,
> > > this pass removes early-inlining from afdo pass since all inlining
> > > should now happen from early inliner. I tedted this on spec and there
> > > are 3 inlines happening here which are blocked at early-inline time by
> > > hitting large function growth limit. We probably want to bypass that
> > > limit, I will look into that incrementaly.
> >
> > Thanks for doing this. Is the inlining difference here is due to annotation
> > that happens in auto-profile pass in the earlier implementation?
> >
> > One unrelated question about scaling profiles. We seem to scale-up AFDO
> > with and_count_scale and scale down local_profile in some other cases.
> > Should we instead scale up AFDO profile to local_profile scale. Lot of the
> > inlining and other parameters seem to work well with that.
>
> Hi,
> I implemented simple afdo/fdo profile comparator. I think we will need to
> figure out
> what to print and what to look for. It currently records the afdo, fdo count
> pairs
> and then try to scale afdo counts to fdo counts printng differences.
I tried your patch on the current master and when I try running as you
mentioned, it looks like PGO counters are not read. I am seeing
something like (fdo is always zero):
update_stmt.part.0/2212 bb 0 (cold) afdo 3400152 (guessed) scaled 0
(guessed) fdo 0 (precise) diff 0, +0.00%
update_stmt.part.0/2212 bb 4 (cold) afdo 3400152 (auto FDO) scaled 0
(guessed) fdo 0 (precise) diff 0, +0.00%
Also, the patch didn't apply cleanly to the current master. Is the
patch complete?
Thanks,
Kugan
>
> It prints info as follows:
>
> brute/32 bb 60 (cold) afdo 0 (auto FDO) scaled 0 (guessed) fdo 4847 (precise)
> diff -4847, -100.00%
> preds
> succs 59 169
> brute/32 bb 61 (hot) afdo 629997 (auto FDO) scaled 24978219 (guessed) fdo
> 9332718 (precise) diff 15645501, +167.64%
> preds
> succs 59 62
> brute/32 bb 62 (hot) afdo 2753633 (auto FDO) scaled 109176468 (guessed) fdo
> 37330872 (precise) diff 71845596, +192.46%
> preds
> succs 61 68 69 63
> brute/32 bb 63 (hot) afdo 2031304 (auto FDO) scaled 80537456 (guessed) fdo
> 27998154 (precise) diff 52539302, +187.65%
> preds
> succs 62 64
> brute/32 bb 64 (very hot) afdo 2661301 (auto FDO) scaled 105515674 (guessed)
> fdo 111992616 (precise) diff -6476942, -5.78%
> preds
> succs 63 67 68 65
>
> I.e. function name, bb index, hotness according to afdo (I added very hot
> variant as those we want to look at first), afdo count,
> attempt to scale it correctly, fdo count and difference.
>
> So I think very hot with very large negative diff are first to look at.
>
> jh@shroud:~/cpu2017/benchspec/CPU/548.exchange2_r/build/build_peak_autofdo-cmp-m64.0000>
> grep "(very hot).*, -9[0-9]\\." *.profile | sort -t")" -n -k3
> digits_2/30 bb 248 (very hot) afdo 1106295 (auto FDO) scaled 43862556
> (guessed) fdo 2180591954 (precise) diff -2136729398, -97.99%
> digits_2/30 bb 276 (very hot) afdo 7848783 (auto FDO) scaled 311189765
> (guessed) fdo 13764381875 (precise) diff -13453192110, -97.74%
> digits_2/30 bb 308 (very hot) afdo 8952263 (auto FDO) scaled 354940711
> (guessed) fdo 10821023213 (precise) diff -10466082502, -96.72%
> digits_2/30 bb 340 (very hot) afdo 8148862 (auto FDO) scaled 323087343
> (guessed) fdo 7398657585 (precise) diff -7075570242, -95.63%
> digits_2/30 bb 4 (very hot) afdo 9774243 (auto FDO) scaled 387530701
> (guessed) fdo 4039694145 (precise) diff -3652163444, -90.41%
> hidden_triplets/8 bb 380 (very hot) afdo 153532 (guessed) scaled 6087261
> (guessed) fdo 162401310 (precise) diff -156314049, -96.25%
> hidden_triplets/8 bb 381 (very hot) afdo 136644 (guessed) scaled 5417682
> (guessed) fdo 146161179 (precise) diff -140743497, -96.29%
> hidden_triplets/8 bb 383 (very hot) afdo 136644 (guessed) scaled 5417682
> (guessed) fdo 146161179 (precise) diff -140743497, -96.29%
> hidden_triplets/8 bb 387 (very hot) afdo 53337 (guessed) scaled 2114714
> (guessed) fdo 162209420 (precise) diff -160094706, -98.70%
> hidden_triplets/8 bb 388 (very hot) afdo 47470 (guessed) scaled 1882098
> (guessed) fdo 145988478 (precise) diff -144106380, -98.71%
> hidden_triplets/8 bb 390 (very hot) afdo 47470 (guessed) scaled 1882098
> (guessed) fdo 145988478 (precise) diff -144106380, -98.71%
> new_solver/9 bb 139 (very hot) afdo 365547 (guessed) scaled 14493264
> (guessed) fdo 399160962 (precise) diff -384667698, -96.37%
> new_solver/9 bb 140 (very hot) afdo 345441 (guessed) scaled 13696098
> (guessed) fdo 361455947 (precise) diff -347759849, -96.21%
> new_solver/9 bb 142 (very hot) afdo 326442 (guessed) scaled 12942823
> (guessed) fdo 352025127 (precise) diff -339082304, -96.32%
> new_solver/9 bb 160 (very hot) afdo 365547 (guessed) scaled 14493264
> (guessed) fdo 397240050 (precise) diff -382746786, -96.35%
> new_solver/9 bb 161 (very hot) afdo 328992 (guessed) scaled 13043926
> (guessed) fdo 359535035 (precise) diff -346491109, -96.37%
> new_solver/9 bb 163 (very hot) afdo 310897 (guessed) scaled 12326492
> (guessed) fdo 350104215 (precise) diff -337777723, -96.48%
> new_solver/9 bb 218 (very hot) afdo 365547 (guessed) scaled 14493264
> (guessed) fdo 399683609 (precise) diff -385190345, -96.37%
> new_solver/9 bb 219 (very hot) afdo 345442 (guessed) scaled 13696138
> (guessed) fdo 361977164 (precise) diff -348281026, -96.22%
> new_solver/9 bb 221 (very hot) afdo 326442 (guessed) scaled 12942823
> (guessed) fdo 352546094 (precise) diff -339603271, -96.33%
> new_solver/9 bb 239 (very hot) afdo 365547 (guessed) scaled 14493264
> (guessed) fdo 397175144 (precise) diff -382681880, -96.35%
> new_solver/9 bb 240 (very hot) afdo 328992 (guessed) scaled 13043926
> (guessed) fdo 359468699 (precise) diff -346424773, -96.37%
> new_solver/9 bb 242 (very hot) afdo 310898 (guessed) scaled 12326532
> (guessed) fdo 350037629 (precise) diff -337711097, -96.48%
> new_solver/9 bb 303 (very hot) afdo 124141 (guessed) scaled 4921962 (guessed)
> fdo 112036204 (precise) diff -107114242, -95.61%
> new_solver/9 bb 304 (very hot) afdo 117313 (guessed) scaled 4651244 (guessed)
> fdo 112036204 (precise) diff -107384960, -95.85%
> new_solver/9 bb 334 (very hot) afdo 124141 (guessed) scaled 4921962 (guessed)
> fdo 112036204 (precise) diff -107114242, -95.61%
> new_solver/9 bb 335 (very hot) afdo 117313 (guessed) scaled 4651244 (guessed)
> fdo 112036204 (precise) diff -107384960, -95.85%
> new_solver/9 bb 369 (very hot) afdo 124141 (guessed) scaled 4921962 (guessed)
> fdo 113795352 (precise) diff -108873390, -95.67%
> new_solver/9 bb 370 (very hot) afdo 117313 (guessed) scaled 4651244 (guessed)
> fdo 113795352 (precise) diff -109144108, -95.91%
> new_solver/9 bb 400 (very hot) afdo 124141 (guessed) scaled 4921962 (guessed)
> fdo 113795352 (precise) diff -108873390, -95.67%
> new_solver/9 bb 401 (very hot) afdo 117313 (guessed) scaled 4651244 (guessed)
> fdo 113795352 (precise) diff -109144108, -95.91%
>
> One needs to do 3 builds
> 1) build with -g to train auto-fdo + train run
> 2) build with -fauto-profile -fprofile-generate to produce instrumentiton +
> train run
> 3) build with -fauto-profile -fprofile-use -fdump-tree-profile to produce
> stats.
>
> I use:
>
> OPTIMIZE = -Ofast -fopt-info-optimized-optall -march=native -mtune=native
> -Wno-error=implicit-int -Wno-error=implicit-function-declaration
> -Wno-error=declaration-missing-parameter-type -Wno-error=return-mismatch
> -Wno-error=int-conversion #-flto=auto
> #OPTIMIZE = -Ofast -g -fopt-info-optimized-optall -march=native
> -mtune=native --param max-peel-times=128 --param max-peel-branches=256
> --param max-peeled-insns=1024 --param profile-histogram-size-lin=129
>
> fdo_pre0 = rm -rf ${benchmark}.data ${benchmark}.gcov; \\
>
> fdo_run1 = perf record -e ex_ret_brn_tkn:Pu -c 100000 -b -o ${benchmark}.data
> -- ${command}; \\
> create_gcov --binary=${baseexe} --profile=${benchmark}.data
> --gcov=current.gcov -gcov_version=2; \\
> if test -e ${benchmark}.gcov ; then profile_merger current.gcov
> ${benchmark}.gcov --output_file ${benchmark}.gcov ; else mv current.gcov
> ${benchmark}.gcov ; fi \\
>
> PASS1_OPTIMIZE = -g -fno-reorder-blocks-and-partition -fno-ipa-icf -fno-lto
> PASS2_OPTIMIZE = -fauto-profile=${benchmark}.gcov -fprofile-generate -g
> PASS3_OPTIMIZE = -fauto-profile=${benchmark}.gcov
> -fdump-ipa-all-details-blocks -fdump-tree-optimized-details-blocks
> -fdump-tree-einline-details -g -fprofile-use
>
> I am flying from China today so will see how much I can play around with the
> patch.
> Any improvements would be welcome.
>
> gcc/ChangeLog:
>
> * auto-profile.cc
> (autofdo_source_profile::offline_external_functions): Add
> missing newline.
> (afdo_annotate_cfg): Update max bb count.
> * coverage.cc (coverage_init): Add AUTO_PROFILE parameter.
> * coverage.h (coverage_init): Update prototype.
> * passes.cc (finish_optimization_passes): Do not call end_branch_prob.
> (pass_manager::dump_profile_report): Also watch for afdo.
> * profile.cc (struct afdo_fdo_record): New structure.
> (compute_branch_probabilities): Record info about afdo/fdo compare
> (end_branch_prob): Print the records.
> * toplev.cc (do_compile): Initialize both FDO and AFDO if requested.
> * tree-profile.cc (tree_profiling): End profiling
> (pass_ipa_tree_profile::gate): Also run with auto-profile.
>
> gcc/lto/ChangeLog:
>
> * lto-symtab.cc (lto_symtab_merge_symbols_1):
>
> diff --git a/gcc/auto-profile.cc b/gcc/auto-profile.cc
> index 44e7faa8fee..b6dd552acab 100644
> --- a/gcc/auto-profile.cc
> +++ b/gcc/auto-profile.cc
> @@ -1308,7 +1308,7 @@ autofdo_source_profile::offline_external_functions ()
> if (dump_file)
> fprintf (dump_file,
> "string table in auto-profile contains"
> - " duplicated name %s", n1);
> + " duplicated name %s\n", n1);
> to_symbol_name.put (i, index);
> }
> continue;
> @@ -3119,6 +3119,7 @@ afdo_annotate_cfg (void)
> FOR_ALL_BB_FN (bb, cfun)
> if (bb->count.quality () == GUESSED_LOCAL)
> bb->count = bb->count.global0afdo ();
> + update_max_bb_count ();
>
> loop_optimizer_finalize ();
> free_dominance_info (CDI_DOMINATORS);
> diff --git a/gcc/coverage.cc b/gcc/coverage.cc
> index c0ae76a40ef..af9091c4561 100644
> --- a/gcc/coverage.cc
> +++ b/gcc/coverage.cc
> @@ -1251,7 +1251,7 @@ coverage_obj_finish (vec<constructor_elt, va_gc> *ctor,
> of notes file. */
>
> void
> -coverage_init (const char *filename)
> +coverage_init (const char *filename, bool auto_profile)
> {
> /* If we are in LTO, the profile will be read from object files. */
> if (in_lto_p)
> @@ -1315,7 +1315,7 @@ coverage_init (const char *filename)
> strcpy (da_file_name + prefix_len + len, GCOV_DATA_SUFFIX);
>
> bbg_file_stamp = local_tick;
> - if (flag_auto_profile)
> + if (auto_profile)
> read_autofdo_file ();
> else if (flag_branch_probabilities)
> read_counts_file ();
> diff --git a/gcc/coverage.h b/gcc/coverage.h
> index 9afbb605482..1f4b521bdea 100644
> --- a/gcc/coverage.h
> +++ b/gcc/coverage.h
> @@ -22,7 +22,7 @@ along with GCC; see the file COPYING3. If not see
>
> #include "gcov-io.h"
>
> -extern void coverage_init (const char *);
> +extern void coverage_init (const char *, bool);
> extern void coverage_finish (void);
> extern void coverage_remove_note_file (void);
>
> diff --git a/gcc/passes.cc b/gcc/passes.cc
> index 6c67ffe56ba..a33c8d924a5 100644
> --- a/gcc/passes.cc
> +++ b/gcc/passes.cc
> @@ -355,13 +355,6 @@ finish_optimization_passes (void)
> gcc::dump_manager *dumps = m_ctxt->get_dumps ();
>
> timevar_push (TV_DUMP);
> - if (coverage_instrumentation_p () || flag_test_coverage
> - || flag_branch_probabilities)
> - {
> - dumps->dump_start (m_pass_profile_1->static_pass_number, NULL);
> - end_branch_prob ();
> - dumps->dump_finish (m_pass_profile_1->static_pass_number);
> - }
>
> /* Do whatever is necessary to finish printing the graphs. */
> for (i = TDI_end; (dfi = dumps->get_dump_file_info (i)) != NULL; ++i)
> @@ -2036,6 +2029,7 @@ pass_manager::dump_profile_report () const
> fprintf (dump_file, "| %12.0f", profile_record[i].time);
> /* Time units changes with profile estimate and feedback. */
> if (i == m_pass_profile_1->static_pass_number
> + || i == m_pass_ipa_auto_profile_1->static_pass_number
> || i == m_pass_ipa_tree_profile_1->static_pass_number)
> fprintf (dump_file, "-------------");
> else if (rel_time_change)
> diff --git a/gcc/profile.cc b/gcc/profile.cc
> index 6234dd2d4e2..ecfc9bdc254 100644
> --- a/gcc/profile.cc
> +++ b/gcc/profile.cc
> @@ -423,6 +423,20 @@ cmp_stats (const void *ptr1, const void *ptr2)
> return 0;
> }
>
> +struct afdo_fdo_record
> +{
> + cgraph_node *node;
> + struct bb_record
> + {
> + int index;
> + profile_count afdo;
> + profile_count fdo;
> + vec <int> preds;
> + vec <int> succs;
> + };
> + vec <bb_record> bbs;
> +};
> +static vec <afdo_fdo_record> afdo_fdo_records;
>
> /* Compute the branch probabilities for the various branches.
> Annotate them accordingly.
> @@ -472,6 +486,22 @@ compute_branch_probabilities (unsigned cfg_checksum,
> unsigned lineno_checksum)
> BB_INFO (EXIT_BLOCK_PTR_FOR_FN (cfun))->succ_count = 2;
> BB_INFO (ENTRY_BLOCK_PTR_FOR_FN (cfun))->pred_count = 2;
>
> + afdo_fdo_record record = {cgraph_node::get (current_function_decl),
> vNULL};;
> + if (dump_file && flag_auto_profile)
> + {
> + FOR_ALL_BB_FN (bb, cfun)
> + {
> + record.bbs.safe_push ({bb->index, bb->count.ipa (),
> + profile_count::uninitialized (), vNULL,
> vNULL});
> + record.bbs.last ().preds.reserve (EDGE_COUNT (bb->preds));
> + for (auto &e : bb->preds)
> + record.bbs.last ().succs.safe_push (e->src->index);
> + record.bbs.last ().succs.reserve (EDGE_COUNT (bb->succs));
> + for (auto &e : bb->succs)
> + record.bbs.last ().succs.safe_push (e->dest->index);
> + }
> + }
> +
> num_edges = read_profile_edge_counts (exec_counts);
>
> if (dump_file)
> @@ -744,7 +774,6 @@ compute_branch_probabilities (unsigned cfg_checksum,
> unsigned lineno_checksum)
> num_branches++;
> }
> }
> -
> if (exec_counts
> && (bb_gcov_count (ENTRY_BLOCK_PTR_FOR_FN (cfun))
> || !flag_profile_partial_training))
> @@ -812,6 +841,18 @@ compute_branch_probabilities (unsigned cfg_checksum,
> unsigned lineno_checksum)
> delete edge_gcov_counts;
> edge_gcov_counts = NULL;
>
> + if (dump_file && flag_auto_profile)
> + {
> + int i = 0;
> + FOR_ALL_BB_FN (bb, cfun)
> + {
> + gcc_checking_assert (record.bbs[i].index == bb->index);
> + record.bbs[i].fdo = bb->count.ipa();
> + i++;
> + }
> + afdo_fdo_records.safe_push (record);
> + }
> +
> update_max_bb_count ();
>
> if (dump_file)
> @@ -1804,6 +1845,49 @@ end_branch_prob (void)
> }
> fprintf (dump_file, "Total number of conditions: %d\n",
> total_num_conds);
> + if (afdo_fdo_records.length ())
> + {
> + profile_count fdo_sum = profile_count::zero ();
> + profile_count afdo_sum = profile_count::zero ();
> + for (const auto &r : afdo_fdo_records)
> + for (const auto &b : r.bbs)
> + if (b.fdo.initialized_p () && b.afdo.initialized_p ())
> + {
> + fdo_sum += b.fdo;
> + afdo_sum += b.afdo;
> + }
> + for (auto &r : afdo_fdo_records)
> + {
> + for (auto &b : r.bbs)
> + if (b.fdo.initialized_p () && b.afdo.initialized_p ())
> + {
> + profile_count scaled = b.afdo.apply_scale (fdo_sum,
> afdo_sum);
> + fprintf (dump_file, "%s bb %i (%s) afdo ",
> r.node->dump_name (), b.index,
> + maybe_hot_count_p (NULL, b.fdo.apply_scale (1,
> 1000)) ? "very hot"
> + : maybe_hot_count_p (NULL, b.fdo) ? "hot" :
> "cold");
> + b.afdo.dump (dump_file);
> + fprintf (dump_file, " scaled ");
> + scaled.dump (dump_file);
> + fprintf (dump_file, " fdo ");
> + b.fdo.dump (dump_file);
> + fprintf (dump_file, " diff %" PRId64 ", %+2.2f%%\n",
> + scaled.to_gcov_type () - b.fdo.to_gcov_type (),
> + (scaled.to_gcov_type () - b.fdo.to_gcov_type ())
> * 100.0
> + / MAX (b.fdo.to_gcov_type (), 1));
> + fprintf (dump_file, " preds");
> + for (int val : b.preds)
> + fprintf (dump_file, " %i", val);
> + b.preds.release ();
> + fprintf (dump_file, "\n succs");
> + for (int val : b.succs)
> + fprintf (dump_file, " %i", val);
> + b.succs.release ();
> + fprintf (dump_file, "\n");
> + }
> + r.bbs.release ();
> + }
> + }
> + afdo_fdo_records.release ();
> }
> }
>
> diff --git a/gcc/toplev.cc b/gcc/toplev.cc
> index 00a8ccb7a69..e6eba05ab0b 100644
> --- a/gcc/toplev.cc
> +++ b/gcc/toplev.cc
> @@ -2197,7 +2197,9 @@ do_compile ()
>
> symtab->initialize ();
> init_final (main_input_filename);
> - coverage_init (aux_base_name);
> + coverage_init (aux_base_name, false);
> + if (flag_auto_profile)
> + coverage_init (aux_base_name, true);
> statistics_init ();
> debuginfo_init ();
> invoke_plugin_callbacks (PLUGIN_START_UNIT, NULL);
> diff --git a/gcc/tree-profile.cc b/gcc/tree-profile.cc
> index fed218eb60b..fe20e84838d 100644
> --- a/gcc/tree-profile.cc
> +++ b/gcc/tree-profile.cc
> @@ -2031,6 +2031,7 @@ tree_profiling (void)
> handle_missing_profiles ();
>
> del_node_map ();
> + end_branch_prob ();
> return 0;
> }
>
> @@ -2065,10 +2066,8 @@ public:
> bool
> pass_ipa_tree_profile::gate (function *)
> {
> - /* When profile instrumentation, use or test coverage shall be performed.
> - But for AutoFDO, this there is no instrumentation, thus this pass is
> - disabled. */
> - return (!in_lto_p && !flag_auto_profile
> + /* When profile instrumentation, use or test coverage shall be performed.
> */
> + return (!in_lto_p
> && (flag_branch_probabilities || flag_test_coverage
> || coverage_instrumentation_p ())
> && !seen_error ());