Ping. Thanks, Kyrill
> On 6 Mar 2025, at 09:25, Kyrylo Tkachov <ktkac...@nvidia.com> wrote: > > Hi all, > > Implement partitioning and cloning in the callgraph to help locality. > A new -fipa-reorder-for-locality flag is used to enable this. > The majority of the logic is in the new IPA pass in ipa-locality-cloning.cc > The optimization has two components: > * Partitioning the callgraph so as to group callers and callees that > frequently > call each other in the same partition > * Cloning functions that straddle multiple callchains and allowing each clone > to be local to the partition of its callchain. > > The majority of the logic is in the new IPA pass in ipa-locality-cloning.cc. > It creates a partitioning plan and does the prerequisite cloning. > The partitioning is then implemented during the existing LTO partitioning > pass. > > To guide these locality heuristics we use PGO data. > In the absence of PGO data we use a static heuristic that uses the accumulated > estimated edge frequencies of the callees for each function to guide the > reordering. > We are investigating some more elaborate static heuristics, in particular > using > the demangled C++ names to group template instantiatios together. > This is promising but we are working out some kinks in the implementation > currently and want to send that out as a follow-up once we're more confident > in it. > > A new bootstrap-lto-locality bootstrap config is added that allows us to test > this on GCC itself with either static or PGO heuristics. > GCC bootstraps with both (normal LTO bootstrap and profiledbootstrap). > > As this new pass enables a new partitioning scheme it is incompatible with > explicit -flto-partition= options so an error is introduced when the user > uses both flags explicitly. > > With this optimization we are seeing good performance gains on some large > internal workloads that stress the parts of the processor that is sensitive > to code locality, but we'd appreciate wider performance evaluation. > > Bootstrapped and tested on aarch64-none-linux-gnu. > Ok for mainline? > Thanks, > Kyrill > > Signed-off-by: Prachi Godbole <pgodb...@nvidia.com> > Co-authored-by: Kyrylo Tkachov <ktkac...@nvidia.com> > > config/ChangeLog: > > * bootstrap-lto-locality.mk: New file. > > gcc/ChangeLog: > > * Makefile.in (OBJS): Add ipa-locality-cloning.o. > * cgraph.h (set_new_clone_decl_and_node_flags): Declare prototype. > * cgraphclones.cc (set_new_clone_decl_and_node_flags): Remove static > qualifier. > * common.opt (fipa-reorder-for-locality): New flag. > (LTO_PARTITION_DEFAULT): Declare. > (flto-partition): Change default to LTO_PARTITION_DFEAULT. > * doc/invoke.texi: Document -fipa-reorder-for-locality. > * flag-types.h (enum lto_locality_cloning_model): Declare. > (lto_partitioning_model): Add LTO_PARTITION_DEFAULT. > * lto-cgraph.cc (lto_set_symtab_encoder_in_partition): Add dumping of > node and index. > * opts.cc (validate_ipa_reorder_locality_lto_partition): Define. > (finish_options): Handle LTO_PARTITION_DEFAULT. > * params.opt (lto_locality_cloning_model): New enum. > (lto-partition-locality-cloning): New param. > (lto-partition-locality-frequency-cutoff): Likewise. > (lto-partition-locality-size-cutoff): Likewise. > (lto-max-locality-partition): Likewise. > * passes.def: Register pass_ipa_locality_cloning. > * timevar.def (TV_IPA_LC): New timevar. > * tree-pass.h (make_pass_ipa_locality_cloning): Declare. > * ipa-locality-cloning.cc: New file. > * ipa-locality-cloning.h: New file. > > gcc/lto/ChangeLog: > > * lto-partition.cc (add_node_references_to_partition): Define. > (create_partition): Likewise. > (lto_locality_map): Likewise. > (lto_promote_cross_file_statics): Add extra dumping. > * lto-partition.h (lto_locality_map): Declare prototype. > * lto.cc (do_whole_program_analysis): Handle > flag_ipa_reorder_for_locality. > > <0001-Locality-cloning-pass-was-Introduce-flto-partition-l.patch>