On Sat, Aug 17, 2013 at 1:44 PM, Jan Hubicka <hubi...@ucw.cz> wrote: >> >> I added both of these and ran into issues due to profile maintenance. >> For example, there were non-zero blocks in the cold section because >> pro_and_epilogue split a simple return block that was previously reach >> by both hot and cold paths. The new return block that was then only >> reached via the cold path did not have its count properly updated to >> reflect this, and since with this patch, blocks dominated by cold >> blocks are remarked cold, we ended up with a non-zero count block in >> the cold section. And there were 0 count blocks reached by non-zero >> edges because copyprop did not clean up edge weights after removing >> some branches and blocks, leading to non-zero edge weights that had >> previously targeted a branch that was removed, now targeting a 0 count >> block that the removed branch always branched around. > > I see, can you please send fixes for the problems you identified? > Thanks for owrking on this!
I don't have fixes at this point - I just identified the phase and transformation from looking at the dump. But I'll try to fix them soon while I'm working on performance tuning for splitting. I have a feeling there are probably a bunch of places where the profile isn't getting updated properly, unfortunately. >> >> In any case, the good news is in that the cases I looked at, the >> splitting code is doing the right thing and these blocks that were >> marked cold really were cold. It would be great to fix the profile >> maintenance issues, but that in the meantime the above sanity checks >> are too aggressive. > > We can keep them and output info into dump file - it is what most of > the profile sanity checking does anyway. Ok, I will add that. > > Did you try to use Martin's linker script to turn text.unlikely > section unexecutable? I think that way we will easily find what really > causes us to use it during startup of trained application (just like > Martin does for gimp). I haven't - where can I get that script? >> >> I think it makes sense to commit the current patch if possible, as it >> is making the splitting more sane. > > My only concern about the patch is that I am not convinced the dominator > based code has chance to work reliably enough so we won't see too many > accesses into the cold section. Remember it isn't using dominance anymore. The latest patch was instead ensuring the most frequent path between hot blocks and the entry/exit are marked hot. That should be better than the dominance approach used in the earlier version. > We can commit it and work on better solution incrementally but it will > probably mean replacing it later. If you think it makes things easier > to work on it incrementally, I think the patch is OK. Yes, I think this is a big step forward from what is there now for splitting, which does the splitting purely based on bb count in isolation. I don't have a much better solution in mind yet. >> >> > - I'll try building and profiling gimp myself to see if I can >> > reproduce the issue with code executing out of the cold section. >> >> I have spent some time this week trying to get the latest gimp Martin >> pointed me to configured and built, but it took awhile to track down >> and configure/build all of the required versions of dependent >> packages. I'm still hitting some issues trying to get it compiled, so >> it may not yet be configured properly. I'll take a look again early >> next week. > > I do not think there is anything special about gimp. You can probably > take any other bigger app, like GCC itself. With profiledbootstrap > and linker script to lock unlikely section you should get ICEs where > we jump into cold secton and should not. Ok, please point me to the linker script and I will try gcc profiledbootstrap as well. I wanted to try gimp if possible as I haven't seen this much jumping to the cold section in some of the internal apps I tried. Thanks, Teresa >> >> Teresa >> >> patch for updating counts based on estimated frequencies to address >> inlined comdats with 0 profile counts: >> >> 013-08-16 Teresa Johnson <tejohn...@google.com> >> >> * tree-inline.c (copy_bb): Compute count based on frequency. >> (copy_edges_for_bb): Ditto. >> (copy_cfg_body): Ditto. >> (copy_body): Pass down frequency. >> (expand_call_inline): Ditto. >> (tree_function_versioning): Ditto. >> * predict.c (init_and_estimate_bb_frequencies): New function. >> (rebuild_frequencies): Invoke init_and_estimate_bb_frequencies. >> * predict.h (init_and_estimate_bb_frequencies): Declare. >> * profile.c (branch_prob): Invoke init_and_estimate_bb_frequencies. >> * ipa-inline-transform.c (update_noncloned_frequencies): Scale edge >> counts. >> (clone_inlined_nodes): Compute edge count scale if needed. > > I do not see why inliner needs to care about scaling more than it does right > now. So you have init_and_estimate_bb_frequencies that force profile guessing > on a given function body. In addition to that I thing you need something like > freqs_to_counts that will compute counts based on freqs with given scale > (actually you can do that as part of propagation before frequencies are > scalled > to the usual 0...FREQ_MAX scale and precision is lost). > > Because offline COMDAT functoin will be porduced for every COMDAT used, I > think > it is bad to porduce any COMDAT (or any reachable function via calls with > non-0 > count) that has empty profile (either because it got lost by COMDAT merging > or because of reading mismatch). > > So I guess you can just check functions with 0 count and non-0 count callers > and initialize their guessed profile. > Some capping will probably be needed to not propagate insanely large numbers.. > > Since new direct calls can be discovered later, inline may want to do that > again each time it inlines non-0 count call of COMDAT with 0 count... > > Honza >> >> Index: tree-inline.c >> =================================================================== >> --- tree-inline.c (revision 201644) >> +++ tree-inline.c (working copy) >> @@ -1502,7 +1502,7 @@ remap_gimple_stmt (gimple stmt, copy_body_data *id >> >> static basic_block >> copy_bb (copy_body_data *id, basic_block bb, int frequency_scale, >> - gcov_type count_scale) >> + gcov_type count_scale, gcov_type freq_to_count_scale) >> { >> gimple_stmt_iterator gsi, copy_gsi, seq_gsi; >> basic_block copy_basic_block; >> @@ -1519,7 +1519,13 @@ copy_bb (copy_body_data *id, basic_block bb, int f >> basic_block_info automatically. */ >> copy_basic_block = create_basic_block (NULL, (void *) 0, >> (basic_block) prev->aux); >> - copy_basic_block->count = apply_scale (bb->count, count_scale); >> + copy_basic_block->count >> + = (count_scale >> + ? apply_scale (bb->count, count_scale) >> + /* When the callee bb counts were all zero (e.g. this was a COMDAT >> + that didn't get profile counts) then we compute the new bb >> counts >> + via the statically-estimated frequencies. */ >> + : RDIV ((gcov_type)bb->frequency * freq_to_count_scale, >> BB_FREQ_MAX)); >> >> /* We are going to rebuild frequencies from scratch. These values >> have just small importance to drive canonicalize_loop_headers. */ >> @@ -1888,7 +1894,8 @@ update_ssa_across_abnormal_edges (basic_block bb, >> debug stmts are left after a statement that must end the basic block. */ >> >> static bool >> -copy_edges_for_bb (basic_block bb, gcov_type count_scale, basic_block >> ret_bb, >> +copy_edges_for_bb (basic_block bb, gcov_type count_scale, >> + basic_block ret_bb, >> bool can_make_abnormal_goto) >> { >> basic_block new_bb = (basic_block) bb->aux; >> @@ -1912,7 +1919,14 @@ static bool >> && old_edge->dest->aux != EXIT_BLOCK_PTR) >> flags |= EDGE_FALLTHRU; >> new_edge = make_edge (new_bb, (basic_block) old_edge->dest->aux, >> flags); >> - new_edge->count = apply_scale (old_edge->count, count_scale); >> + basic_block new_src_bb = (basic_block) old_edge->src->aux; >> + new_edge->count >> + = (count_scale >> + ? apply_scale (old_edge->count, count_scale) >> + // The bb counts have already been scaled with >> freq_to_count_scale >> + // when that is non-zero, so just scale that new bb count by >> + // the edge probability. >> + : apply_probability (new_src_bb->count, >> old_edge->probability)); >> new_edge->probability = old_edge->probability; >> } >> >> @@ -2282,7 +2296,8 @@ redirect_all_calls (copy_body_data * id, basic_blo >> another function. Walks FN via CFG, returns new fndecl. */ >> >> static tree >> -copy_cfg_body (copy_body_data * id, gcov_type count, int frequency_scale, >> +copy_cfg_body (copy_body_data * id, gcov_type count, >> + int frequency, int frequency_scale, >> basic_block entry_block_map, basic_block exit_block_map, >> bitmap blocks_to_copy, basic_block new_entry) >> { >> @@ -2293,15 +2308,20 @@ static tree >> basic_block bb; >> tree new_fndecl = NULL; >> bool need_debug_cleanup = false; >> - gcov_type count_scale; >> + gcov_type count_scale = 0; >> + gcov_type freq_to_count_scale = 0; >> int last; >> int incoming_frequency = 0; >> gcov_type incoming_count = 0; >> >> - if (ENTRY_BLOCK_PTR_FOR_FUNCTION (src_cfun)->count) >> - count_scale >> - = GCOV_COMPUTE_SCALE (count, >> - ENTRY_BLOCK_PTR_FOR_FUNCTION >> (src_cfun)->count); >> + basic_block entry_bb = ENTRY_BLOCK_PTR_FOR_FUNCTION (src_cfun); >> + if (entry_bb->count) >> + count_scale = GCOV_COMPUTE_SCALE (count, entry_bb->count); >> + /* When the callee bb counts were all zero (e.g. this was a COMDAT >> + that didn't get profile counts) then we compute the new bb counts >> + via the statically-estimated frequencies. */ >> + else if (entry_bb->frequency) >> + freq_to_count_scale = RDIV (count * frequency, entry_bb->frequency); >> else >> count_scale = REG_BR_PROB_BASE; >> >> @@ -2323,7 +2343,13 @@ static tree >> incoming_frequency += EDGE_FREQUENCY (e); >> incoming_count += e->count; >> } >> - incoming_count = apply_scale (incoming_count, count_scale); >> + incoming_count >> + = (count_scale >> + ? apply_scale (incoming_count, count_scale) >> + /* When the callee bb counts were all zero (e.g. this was a >> COMDAT >> + that didn't get profile counts) then we compute the >> new bb counts >> + via the statically-estimated frequencies. */ >> + : RDIV (incoming_frequency * freq_to_count_scale, >> BB_FREQ_MAX)); >> incoming_frequency >> = apply_scale ((gcov_type)incoming_frequency, frequency_scale); >> ENTRY_BLOCK_PTR->count = incoming_count; >> @@ -2350,7 +2376,8 @@ static tree >> FOR_EACH_BB_FN (bb, cfun_to_copy) >> if (!blocks_to_copy || bitmap_bit_p (blocks_to_copy, bb->index)) >> { >> - basic_block new_bb = copy_bb (id, bb, frequency_scale, count_scale); >> + basic_block new_bb = copy_bb (id, bb, frequency_scale, count_scale, >> + freq_to_count_scale); >> bb->aux = new_bb; >> new_bb->aux = bb; >> new_bb->loop_father = entry_block_map->loop_father; >> @@ -2364,7 +2391,8 @@ static tree >> FOR_ALL_BB_FN (bb, cfun_to_copy) >> if (!blocks_to_copy >> || (bb->index > 0 && bitmap_bit_p (blocks_to_copy, bb->index))) >> - need_debug_cleanup |= copy_edges_for_bb (bb, count_scale, >> exit_block_map, >> + need_debug_cleanup |= copy_edges_for_bb (bb, count_scale, >> + exit_block_map, >> can_make_abormal_goto); >> >> if (new_entry) >> @@ -2562,7 +2590,8 @@ copy_tree_body (copy_body_data *id) >> another function. */ >> >> static tree >> -copy_body (copy_body_data *id, gcov_type count, int frequency_scale, >> +copy_body (copy_body_data *id, gcov_type count, int frequency, >> + int frequency_scale, >> basic_block entry_block_map, basic_block exit_block_map, >> bitmap blocks_to_copy, basic_block new_entry) >> { >> @@ -2571,7 +2600,8 @@ static tree >> >> /* If this body has a CFG, walk CFG and copy. */ >> gcc_assert (ENTRY_BLOCK_PTR_FOR_FUNCTION (DECL_STRUCT_FUNCTION (fndecl))); >> - body = copy_cfg_body (id, count, frequency_scale, entry_block_map, >> exit_block_map, >> + body = copy_cfg_body (id, count, frequency, frequency_scale, >> + entry_block_map, exit_block_map, >> blocks_to_copy, new_entry); >> copy_debug_stmts (id); >> >> @@ -4172,7 +4202,7 @@ expand_call_inline (basic_block bb, gimple stmt, c >> function in any way before this point, as this CALL_EXPR may be >> a self-referential call; if we're calling ourselves, we need to >> duplicate our body before altering anything. */ >> - copy_body (id, bb->count, >> + copy_body (id, bb->count, bb->frequency, >> GCOV_COMPUTE_SCALE (cg_edge->frequency, CGRAPH_FREQ_BASE), >> bb, return_block, NULL, NULL); >> >> @@ -5299,8 +5329,9 @@ tree_function_versioning (tree old_decl, tree new_ >> } >> >> /* Copy the Function's body. */ >> - copy_body (&id, old_entry_block->count, REG_BR_PROB_BASE, >> - ENTRY_BLOCK_PTR, EXIT_BLOCK_PTR, blocks_to_copy, new_entry); >> + copy_body (&id, old_entry_block->count, old_entry_block->frequency, >> + REG_BR_PROB_BASE, ENTRY_BLOCK_PTR, EXIT_BLOCK_PTR, >> + blocks_to_copy, new_entry); >> >> /* Renumber the lexical scoping (non-code) blocks consecutively. */ >> number_blocks (new_decl); >> Index: predict.c >> =================================================================== >> --- predict.c (revision 201644) >> +++ predict.c (working copy) >> @@ -2976,6 +2976,24 @@ make_pass_strip_predict_hints (gcc::context *ctxt) >> return new pass_strip_predict_hints (ctxt); >> } >> >> +/* Initialize loop edges and compute estimated bb frequencies when there >> + is no profile data available. */ >> + >> +void >> +init_and_estimate_bb_frequencies (void) >> +{ >> + if (profile_status == PROFILE_READ && counts_to_freqs ()) >> + return; >> + >> + loop_optimizer_init (0); >> + add_noreturn_fake_exit_edges (); >> + mark_irreducible_loops (); >> + connect_infinite_loops_to_exit (); >> + estimate_bb_frequencies (); >> + remove_fake_exit_edges (); >> + loop_optimizer_finalize (); >> +} >> + >> /* Rebuild function frequencies. Passes are in general expected to >> maintain profile by hand, however in some cases this is not possible: >> for example when inlining several functions with loops freuqencies might >> run >> @@ -2986,15 +3004,7 @@ rebuild_frequencies (void) >> { >> timevar_push (TV_REBUILD_FREQUENCIES); >> if (profile_status == PROFILE_GUESSED) >> - { >> - loop_optimizer_init (0); >> - add_noreturn_fake_exit_edges (); >> - mark_irreducible_loops (); >> - connect_infinite_loops_to_exit (); >> - estimate_bb_frequencies (); >> - remove_fake_exit_edges (); >> - loop_optimizer_finalize (); >> - } >> + init_and_estimate_bb_frequencies (); >> else if (profile_status == PROFILE_READ) >> counts_to_freqs (); >> else >> Index: predict.h >> =================================================================== >> --- predict.h (revision 201644) >> +++ predict.h (working copy) >> @@ -38,6 +38,7 @@ enum prediction >> extern void predict_insn_def (rtx, enum br_predictor, enum prediction); >> extern int counts_to_freqs (void); >> extern void estimate_bb_frequencies (void); >> +extern void init_and_estimate_bb_frequencies (void); >> extern const char *predictor_name (enum br_predictor); >> extern tree build_predict_expr (enum br_predictor, enum prediction); >> extern void tree_estimate_probability (void); >> Index: profile.c >> =================================================================== >> --- profile.c (revision 201644) >> +++ profile.c (working copy) >> @@ -1305,6 +1305,12 @@ branch_prob (void) >> >> values.release (); >> free_edge_list (el); >> + >> + /* Call after setting profile_status to PROFILE_READ, will then >> + invoke counts_to_freqs and if the sum of the counts is zero, will >> + estimate the frequencies. */ >> + init_and_estimate_bb_frequencies (); >> + >> coverage_end_function (lineno_checksum, cfg_checksum); >> } >> ^L >> Index: ipa-inline-transform.c >> =================================================================== >> --- ipa-inline-transform.c (revision 201644) >> +++ ipa-inline-transform.c (working copy) >> @@ -51,7 +51,7 @@ int nfunctions_inlined; >> >> static void >> update_noncloned_frequencies (struct cgraph_node *node, >> - int freq_scale) >> + gcov_type count_scale, int freq_scale) >> { >> struct cgraph_edge *e; >> >> @@ -60,14 +60,16 @@ update_noncloned_frequencies (struct cgraph_node * >> freq_scale = 1; >> for (e = node->callees; e; e = e->next_callee) >> { >> + e->count = apply_scale (e->count, count_scale); >> e->frequency = e->frequency * (gcov_type) freq_scale / >> CGRAPH_FREQ_BASE; >> if (e->frequency > CGRAPH_FREQ_MAX) >> e->frequency = CGRAPH_FREQ_MAX; >> if (!e->inline_failed) >> - update_noncloned_frequencies (e->callee, freq_scale); >> + update_noncloned_frequencies (e->callee, count_scale, freq_scale); >> } >> for (e = node->indirect_calls; e; e = e->next_callee) >> { >> + e->count = apply_scale (e->count, count_scale); >> e->frequency = e->frequency * (gcov_type) freq_scale / >> CGRAPH_FREQ_BASE; >> if (e->frequency > CGRAPH_FREQ_MAX) >> e->frequency = CGRAPH_FREQ_MAX; >> @@ -169,7 +171,13 @@ clone_inlined_nodes (struct cgraph_edge *e, bool d >> } >> duplicate = false; >> e->callee->symbol.externally_visible = false; >> - update_noncloned_frequencies (e->callee, e->frequency); >> + // In the case of a COMDAT, the callee's count may be from other >> + // modules, and we need to scale it for the current module's calls >> + // (e.g. e->count may be 0 despite e->callee->count > 0). >> + gcov_type count_scale = REG_BR_PROB_BASE; >> + if (e->callee->count > e->count) >> + count_scale = GCOV_COMPUTE_SCALE (e->count, e->callee->count); >> + update_noncloned_frequencies (e->callee, count_scale, >> e->frequency); >> } >> else >> { -- Teresa Johnson | Software Engineer | tejohn...@google.com | 408-460-2413