> Hi all, > > This is a patch submission following-up from the RFC at: > https://gcc.gnu.org/pipermail/gcc/2024-November/245076.html > The patch is rebased and retested against current trunk, some debugging code > removed, comments improved and some fixes added as I've we've done more > testing. > > ------------------------>8----------------------------------------------------- > Implement partitioning and cloning in the callgraph to help locality. > A new -flto-partition=locality flag is used to enable this. > The majority of the logic is in the new IPA pass in ipa-locality-cloning.cc > The optimization has two components: > * Partitioning the callgraph so as to group callers and callees that > frequently > call each other in the same partition > * Cloning functions that straddle multiple callchains and allowing each clone > to be local to the partition of its callchain. > > The majority of the logic is in the new IPA pass in ipa-locality-cloning.cc. > It creates a partitioning plan and does the prerequisite cloning. > The partitioning is then implemented during the existing LTO partitioning > pass. > > To guide these locality heuristics we use PGO data. > In the absence of PGO data we use a static heuristic that uses the accumulated > estimated edge frequencies of the callees for each function to guide the > reordering. > We are investigating some more elaborate static heuristics, in particular > using > the demangled C++ names to group template instantiatios together. > This is promising but we are working out some kinks in the implementation > currently and want to send that out as a follow-up once we're more confident > in it. > > A new bootstrap-lto-locality bootstrap config is added that allows us to test > this on GCC itself with either static or PGO heuristics. > GCC bootstraps with both (normal LTO bootstrap and profiledbootstrap). > > With this optimization we are seeing good performance gains on some large > internal workloads that stress the parts of the processor that is sensitive > to code locality, but we'd appreciate wider performance evaluation. > > Bootstrapped and tested on aarch64-none-linux-gnu. > Ok for mainline? > Thanks, > Kyrill > > Signed-off-by: Prachi Godbole <pgodb...@nvidia.com> > Co-authored-by: Kyrylo Tkachov <ktkac...@nvidia.com> > > config/ChangeLog: > * bootstrap-lto-locality.mk: New file. > > gcc/ChangeLog: > * Makefile.in (OBJS): Add ipa-locality-cloning.o > (GTFILES): Add ipa-locality-cloning.cc dependency. > * common.opt (lto_partition_model): Add locality value. > * flag-types.h (lto_partition_model): Add LTO_PARTITION_LOCALITY > value. > (enum lto_locality_cloning_model): Define. > * lto-cgraph.cc (lto_set_symtab_encoder_in_partition): Add > dumping of node > and index. > * params.opt (lto_locality_cloning_model): New enum. > (lto-partition-locality-cloning): New param. > (lto-partition-locality-frequency-cutoff): Likewise. > (lto-partition-locality-size-cutoff): Likewise. > (lto-max-locality-partition): Likewise. > * passes.def: Add pass_ipa_locality_cloning. > * timevar.def (TV_IPA_LC): New timevar. > * tree-pass.h (make_pass_ipa_locality_cloning): Declare. > * ipa-locality-cloning.cc: New file. > * ipa-locality-cloning.h: New file. > > gcc/lto/ChangeLog: > * lto-partition.cc: Include ipa-locality-cloning.h > (add_node_references_to_partition): Define. > (create_partition): Likewise. > (lto_locality_map): Likewise. > (lto_promote_cross_file_statics): Add extra dumping. > * lto-partition.h (lto_locality_map): Declare. > * lto.cc (do_whole_program_analysis): Handle > LTO_PARTITION_LOCALITY. > > > diff --git a/config/bootstrap-lto-locality.mk > b/config/bootstrap-lto-locality.mk > new file mode 100644 > index 00000000000..7792bbe6823 > --- /dev/null > +++ b/config/bootstrap-lto-locality.mk > @@ -0,0 +1,20 @@ > +# This option enables LTO and locality partitioning for stage2 and stage3 in > slim mode > + > +STAGE2_CFLAGS += -flto=jobserver -frandom-seed=1 -flto-partition=locality > +STAGE3_CFLAGS += -flto=jobserver -frandom-seed=1 -flto-partition=locality > +STAGEprofile_CFLAGS += -flto=jobserver -frandom-seed=1 > -flto-partition=locality > +STAGEtrain_CFLAGS += -flto=jobserver -frandom-seed=1 -flto-partition=locality > +STAGEfeedback_CFLAGS += -flto=jobserver -frandom-seed=1 > -flto-partition=locality > + > +# assumes the host supports the linker plugin > +LTO_AR = $$r/$(HOST_SUBDIR)/prev-gcc/gcc-ar$(exeext) > -B$$r/$(HOST_SUBDIR)/prev-gcc/ > +LTO_RANLIB = $$r/$(HOST_SUBDIR)/prev-gcc/gcc-ranlib$(exeext) > -B$$r/$(HOST_SUBDIR)/prev-gcc/ > +LTO_NM = $$r/$(HOST_SUBDIR)/prev-gcc/gcc-nm$(exeext) > -B$$r/$(HOST_SUBDIR)/prev-gcc/ > + > +LTO_EXPORTS = AR="$(LTO_AR)"; export AR; \ > + RANLIB="$(LTO_RANLIB)"; export RANLIB; \ > + NM="$(LTO_NM)"; export NM; > +LTO_FLAGS_TO_PASS = AR="$(LTO_AR)" RANLIB="$(LTO_RANLIB)" NM="$(LTO_NM)" > + > +do-compare = $(SHELL) $(srcdir)/contrib/compare-lto $$f1 $$f2 > +extra-compare = gcc/lto1$(exeext)
With bootstrap option, do we have some way to measure the impact? Perhaps using perf stat for code cache misses or number of loaded text segment pages to compile some testcase? Do you have some performance/code size and compile time data on this? In particular how well the size of individual partitions created is balanced? > -new partition for every symbol where possible. Specifying @samp{none} > -as an algorithm disables partitioning and streaming completely. > +new partition for every symbol where possible or @samp{locality} to maximize > +the locality of the functions in the layout of the final binary. Specifying > +@samp{locality} can lead to the compiler cloning some functions in order to > +improve their locality across multiple callchains. > +Specifying @samp{none} as an algorithm disables partitioning and streaming > +completely. Current ballanced and 1to1 partitioners partition to linear intervals of order given by either node->order or node->tp_first_run. (see tp_first_run_node_cmp). So we may want to untie the code layout from partitioning itself. With Maritn Liska we experimented with this in the past but since we were not able to get consistent improvements, we eventually gave up. if locality partitioning works, I guess I can recover this next stage1. I am not sure what we will want to do with -flto-partition=locality. Perhaps we can have flag like -freorder-for-locality which will imply enabling the IPA pass and switching partitioning algorithm unless explicitely specified by user (in which case we can output error, sorry or something like that). > + > +/* Implement cloning required to improve partitioning of the callgraph > + for locality considerations. > + Locality code placement is done in 2 parts. > + 1. IPA pass to be executed after ipa-inline and before ipa-pure-const. > + Execute stage prepares the plan to place all nodes into partitions. > + 2. WPA Partition stage actually implements the plan. */ Maybe some overall comment on the algorithm implemented would be nice. > + > + > +/* Add symbol NODE to partition PART. */ > +static void > +add_node_to_partition (locality_partition part, cgraph_node *node) > +{ > + struct cgraph_edge *e; > + if (node_partitioned_p (node)) > + return; > + > + part->nodes.safe_push (node); > + node->aux = (void *) (uintptr_t) (part->part_id); > + > + if (!node->alias && node->get_partitioning_class () == SYMBOL_PARTITION) > + part->insns += ipa_size_summaries->get (node)->size; > + > + /* Add all inline clones and callees that are duplicated. */ > + for (e = node->callees; e; e = e->next_callee) > + if (!e->inline_failed) > + add_node_to_partition (part, e->callee); > + /* omp declare_variant_alt or transparent_alias with definition or linker > + discardable (non-local comdat but not forced and not > + used by non-LTO). */ > + else if (e->callee->get_partitioning_class () == SYMBOL_DUPLICATE) > + add_node_to_partition (part, e->callee); > + > + /* Add all thunks associated with the function. */ > + for (e = node->callers; e; e = e->next_caller) > + if (e->caller->thunk && !e->caller->inlined_to) > + add_node_to_partition (part, e->caller); > +} > + uuuu> +/* Return TRUE if NODE is in PARTITION. */ > +static bool > +node_in_partition_p (locality_partition partition, cgraph_node *node) > +{ > + return (std::find (partition->nodes.begin (), partition->nodes.end (), > node) > + != partition->nodes.end ()); > +} won't this cause problems with quadratic time complexity? > + > +/* Helper function for qsort; sort nodes by profile count. */ > +static int > +compare_edge_profile_counts (const void *pa, const void *pb) > +{ > + const locality_order *a = *static_cast<const locality_order *const *> (pa); > + const locality_order *b = *static_cast<const locality_order *const *> (pb); > + > + profile_count cnt1 = a->node->count.ipa (); > + profile_count cnt2 = b->node->count.ipa (); > + if (!cnt1.compatible_p (cnt2)) > + return static_profile_cmp (pa, pb); > + > + if (cnt1 < cnt2) > + return 1; > + if (cnt1 > cnt2) > + return -1; > + return static_profile_cmp (pa, pb); > +} I hope that this is transitive :) > + cgraph_node *cl_node = NULL; > + vec<cgraph_edge *> redirect_callers; > + redirect_callers.create (1); Why create (1) is necessary? > + if (cnode->clone_of) > + { > + if (dump_file) > + fprintf (dump_file, "It's a clone %s\n", cnode->dump_asm_name ()); > + clone_info *info = clone_info::get (cnode); > + cl_node > + = cnode->create_virtual_clone (redirect_callers, > + info ? info->tree_map : NULL, > + info ? info->param_adjustments : NULL, > + suffix, cl_num++); > + cgraph_node *orig_node = cnode->clone_of; > + > + if (cl_node->next_sibling_clone) > + cl_node->next_sibling_clone->prev_sibling_clone > + = cl_node->prev_sibling_clone; > + if (cl_node->prev_sibling_clone) > + cl_node->prev_sibling_clone->next_sibling_clone > + = cl_node->next_sibling_clone; > + else > + cnode->clones = cl_node->next_sibling_clone; > + > + cl_node->next_sibling_clone = orig_node->clones; > + cl_node->prev_sibling_clone = NULL; > + orig_node->clones->prev_sibling_clone = cl_node; > + orig_node->clones = cl_node; > + cl_node->clone_of = orig_node; This is done to avoid clones of clones? This should be allowed even though we probably do not produce them. Where it breaks? Overal it looks like somehting that should be done by create_virtual_clone or its wrapper rather being part of an ipa pass. > + } > + else > + { > + tree old_decl = cnode->decl; > + tree new_decl = copy_node (old_decl); > + > + /* Generate a new name for the new version. */ > + const char *name = IDENTIFIER_POINTER (DECL_NAME (old_decl)); > + DECL_NAME (new_decl) = clone_function_name (name, suffix, cl_num); > + SET_DECL_ASSEMBLER_NAME (new_decl, > + clone_function_name (old_decl, suffix, cl_num)); > + cl_num++; > + if (dump_file) > + fprintf (dump_file, "\tNew name %s\n", > + IDENTIFIER_POINTER (DECL_ASSEMBLER_NAME (new_decl))); > + > + SET_DECL_RTL (new_decl, NULL); > + DECL_EXTERNAL (new_decl) = 0; > + TREE_PUBLIC (new_decl) = 0; > + DECL_COMDAT (new_decl) = 0; > + DECL_WEAK (new_decl) = 0; > + DECL_VIRTUAL_P (new_decl) = 0; > + DECL_STATIC_CONSTRUCTOR (new_decl) = 0; > + DECL_STATIC_DESTRUCTOR (new_decl) = 0; > + DECL_SET_IS_OPERATOR_NEW (new_decl, 0); > + DECL_SET_IS_OPERATOR_DELETE (new_decl, 0); > + DECL_IS_REPLACEABLE_OPERATOR (new_decl) = 0; There is node->make_decl_local that should do all this, but it also should happen automatically when you call create_clone. There is set_new_clone_decl_and_node_flags doing that. > + > + cl_node = cnode->create_clone (new_decl, cnode->count > /*profile_count*/, I don't think cloned node should have cnode->count since that means that all execution goes to the new node. I think you need to compute sum of all counts of callers and use it here also caring for self-recursion. > + true /*update_original*/, redirect_callers, > + false /*call_duplication_hook*/, > + NULL /*new_inlined_to*/, > + NULL /*param_adjustments*/, suffix); > + > + cl_node->lowered = true; > + cl_node->externally_visible = 0; > + cl_node->local = 1; > + cl_node->semantic_interposition = 0; This is also something that should be part of cloning machinery. > +/* Redirect recursive edges of CLONE to correctly point to CLONE. As part of > + cloning process, all callee edges of a node are just duplicated but not > + redirected. Therefore, these edges still call to original of CLONE. > + > + For non-inlined CLONEs, NEW_CALLEE == CLONE and ORIG_CALLEE is CLONE's > + original node. > + > + For inlined node, self recursion to CLONE's original same as non-inlined, > + additionally, calls to CLONE->inlined_to are also recursive: > + NEW_CALLEE == CLONE->inlined_into and > + ORIG_CALLEE == original node of CLONE->inlined_into. */ > + > +static void > +adjust_recursive_callees (cgraph_node *clone, cgraph_node *new_callee, > + cgraph_node *orig_callee) > +{ > + cgraph_node *alias = NULL; > + for (cgraph_edge *e = clone->callees; e; e = e->next_callee) > + { > + if (!e->inline_failed) > + continue; > + > + /* Only self-cycle or local alias are handled. */ > + cgraph_node *callee = e->callee; > + if (callee == orig_callee) > + { > + cgraph_node **cl = node_to_clone.get (orig_callee); > + gcc_assert (cl && *cl == new_callee); > + e->redirect_callee_duplicating_thunks (new_callee); > + if (dump_file) > + fprintf (dump_file, "recursive call from %s to %s orig %s\n", > + e->caller->dump_asm_name (), e->callee->dump_asm_name (), > + callee->dump_asm_name ()); > + } > + else if (callee->alias > + && e->callee->ultimate_alias_target () == orig_callee) > + { > + if (!alias) > + { > + alias = dyn_cast<cgraph_node *> ( > + new_callee->noninterposable_alias ()); > + } > + e->redirect_callee_duplicating_thunks (alias); > + if (dump_file) > + fprintf (dump_file, "recursive call from %s to %s orig %s\n", > + e->caller->dump_asm_name (), e->callee->dump_asm_name (), > + callee->dump_asm_name ()); > + } > + } > + new_callee->expand_all_artificial_thunks (); > + if (alias) > + alias->expand_all_artificial_thunks (); Why do you need to expand the thunk here? > +} > + > +/* Create clones for CALLER's inlined callees, ORIG_INLINED_TO is the > original > + node from clone_as_needed () such that new_inlined_to is a clone of it. > */ > + > +static void > +inline_clones (cgraph_node *caller, cgraph_node *orig_inlined_to) > +{ > + struct cgraph_edge *edge; > + for (edge = caller->callees; edge; edge = edge->next_callee) > + { > + struct cgraph_node *callee = edge->callee; > + if (edge->inline_failed) > + continue; > + > + if (callee->inlined_to != orig_inlined_to) > + continue; > + > + struct cgraph_node *new_inlined_to, *cl; > + if (caller->inlined_to) > + new_inlined_to = caller->inlined_to; > + else > + new_inlined_to = caller; > + > + cl = callee->create_clone (callee->decl, > + callee->count, // profile_count, > + true, // update_original, > + vNULL, > + false, // call_duplication_hook, > + new_inlined_to, // new_inlined_to, > + NULL, // param_adjustments, > + "locality_clone"); // suffix > + edge->redirect_callee (cl); > + > + cl->lowered = true; > + cl->externally_visible = 0; > + cl->local = 1; > + cl->semantic_interposition = 0; It should not be necessary to tamper with those flags. > + node_to_clone.put (callee, cl); > + clone_to_node.put (cl, callee); > + > + if (callee->thunk) > + { > + thunk_info *info = thunk_info::get (callee); > + *thunk_info::get_create (cl) = *info; > + } > + > + adjust_recursive_callees (cl, new_inlined_to, orig_inlined_to); > + adjust_recursive_callees (cl, cl, callee); > + if (dump_file) > + { > + fprintf (dump_file, "Inline cloned\n"); > + cl->dump (dump_file); > + } > + > + /* Recursively inline till end of this callchain. */ > + inline_clones (cl, orig_inlined_to); > + } > +} Can't you just use clone_inlined_nodes to do this job? It also takes care of updating the profile. > + > +/* Clone EDGE->CALLEE if it or a clone of it is not already in PARTITION. > + Redirect all callers of EDGE->CALLEE that are in PARTITION, not just the > + EDGE. If a clone is already present in PARTITION, redirect all edges from > + EDGE->CALLER to EDGE->CALLEE. This is because we only visit one edge per > + caller to callee and redirect for all others from there. > + > + If cloning, also recursively clone inlined functions till the end of the > + callchain because inlined clones have 1-1 exclusive copy and edge from > + caller to inlined node. > + > + There are 2 flows possible: > + 1. Only redirect > + 1.1. cnode is already in current partition - cnode mustn't be a > + locality_clone -> nothing to do Probably profile update is needed, right? > + > +static cgraph_node * > +clone_node_as_needed (cgraph_edge *edge, locality_partition partition, > + int &cl_num) > +{ > + struct cgraph_node *cnode = edge->callee; You probably want ultimat_alias_target and also watch for interposability somewhere, since such edges can not be safely redirected. > + struct cgraph_node *caller = edge->caller; > + > +/* Accumulate frequency of all edges from EDGE->caller to EDGE->callee. */ > + > +static sreal > +accumulate_incoming_edge_frequency (cgraph_edge *edge) > +{ > + sreal count = 0; > + struct cgraph_edge *e; > + for (e = edge->callee->callers; e; e = e->next_caller) Probably edge->caller->callees? > + { > + /* Make a local decision about all edges for EDGE->caller but not the > + other nodes already in the partition. Their edges will be visited > + later or may have been visited before and not for the > + cut-off criteria. */ > + if (e->caller == edge->caller) > + { > + profile_count caller_count = e->caller->inlined_to > + ? e->caller->inlined_to->count > + : e->caller->count; > + if (e->count.compatible_p (caller_count)) > + count += e->sreal_frequency (); > + } Why you have the compatibility check here? > + } > + return count; > +} > + > +/* Determine if EDGE->CALLEE is suitable for cloning. It is assummed that > the > + callee is not an inlined node. */ > + > +static bool > +suitable_for_locality_cloning_p (cgraph_edge *edge, > + lto_locality_cloning_model cm) > +{ > + cgraph_node *node = edge->callee; > + if (!node->versionable) > + return false; > + > + if (!node->can_change_signature) > + return false; You are not really changing signatures, so perhaps this check can be dropped? > + > + if (node->clone_of) > + return false; Also there is logic for cloning clones above? > + > + if (edge->recursive_p ()) > + return false; > + > + if (!node->definition) > + return false; > + > + if (cm == LTO_LOCALITY_NON_INTERPOSABLE_CLONING > + && node->get_availability () == AVAIL_INTERPOSABLE) > + return false; You want to test this on edge basis (by getting ultimate alias target). Interposability changes via aliases and other means. > +static void > +partition_callchain (cgraph_edge *edge, locality_partition partition, > + bool clone_further_p, > + lto_locality_cloning_model cloning_model, > + double freq_cutoff, int size, int &cl_num) > +{ > + cgraph_node *node = edge->callee; Probably ultimate_alias_target here... > + > +/* Determine order of all external nodes if PGO profile is available. > + Store the order in ORDER. */ > + > +static bool > +locality_determine_ipa_order (auto_vec<locality_order *> *order) > +{ > + struct cgraph_node *node; > + auto_vec<locality_order *> non_comparable_nodes; > + FOR_EACH_DEFINED_FUNCTION (node) > + if (node->get_partitioning_class () == SYMBOL_PARTITION) > + { > + if (node->no_reorder) > + { > + if (dump_file) > + fprintf (dump_file, "no reorder %s\n", node->dump_asm_name ()); > + return false; > + } > + else if (!node->callers) Here you may need to also walk aliases to see if there are callers. > +/* Add all references of NODE into PARTITION. */ > + > +static void > +add_node_references_to_partition (ltrans_partition partition, symtab_node > *node) We could also refactor this to be used by other partitioners. balanced partitioning has similar logic in it even though it iterates only references. > +{ > + struct ipa_ref *ref = NULL; > + varpool_node *vnode; > + for (int j = 0; node->iterate_reference (j, ref); j++) > + if (is_a <varpool_node *> (ref->referred)) > + { > + vnode = dyn_cast <varpool_node *> (ref->referred); > + if (!symbol_partitioned_p (vnode) > + && !vnode->no_reorder > + && vnode->get_partitioning_class () == SYMBOL_PARTITION) > + { > + add_symbol_to_partition (partition, vnode); > + if (dump_file) > + fprintf (dump_file, "Varpool Node: %s\n", vnode->dump_asm_name > ()); > + add_node_references_to_partition (partition, vnode); > + } > + } > + > + for (int j = 0; node->iterate_referring (j, ref); j++) > + if (is_a <varpool_node *> (ref->referring)) > + { > + vnode = dyn_cast <varpool_node *> (ref->referring); > + gcc_assert (vnode->definition); > + if (!symbol_partitioned_p (vnode) > + && !vnode->no_reorder > + && !vnode->can_remove_if_no_refs_p () > + && vnode->get_partitioning_class () == SYMBOL_PARTITION) > + { > + add_symbol_to_partition (partition, vnode); > + if (dump_file) > + fprintf (dump_file, "Varpool Node: %s\n", vnode->dump_asm_name > ()); > + add_node_references_to_partition (partition, vnode); > + } > + } > + if (cgraph_node *cnode = dyn_cast <cgraph_node *> (node)) > + { > + struct cgraph_edge *e; > + > + /* Add all inline clones and callees that are duplicated. */ > + for (e = cnode->callees; e; e = e->next_callee) > + if (e->callee->get_partitioning_class () == SYMBOL_DUPLICATE) > + add_node_references_to_partition (partition, e->callee); > + > + /* Add all thunks associated with the function. */ > + for (e = cnode->callers; e; e = e->next_caller) > + if (e->caller->thunk && !e->caller->inlined_to) > + add_node_references_to_partition (partition, e->caller); > + } > + > +} I am sorry for getting so late to this patch. Honza