> validate_ipa_reorder_locality_lto_partition (opts, opts_set); I know this patch has already been merged into the trunk. But I think the below piece of code change in opts.cc is questionable, it would completely override any user-specified partition model, suppose that user wants a traditional all-in-one lto compilation like "-flto-partition=none", without "-fipa-reorder-for-locality".
> if (opts_set->x_flag_lto_partition != LTO_PARTITION_DEFAULT) > opts_set->x_flag_lto_partition = opts->x_flag_lto_partition = > LTO_PARTITION_BALANCED; Regards, Feng ________________________________________ From: Kyrylo Tkachov <ktkac...@nvidia.com> Sent: Saturday, November 16, 2024 1:04 AM To: GCC Patches Cc: Jan Hubicka; Martin Jambor; Richard Biener Subject: [PATCH] Introduce -flto-partition=locality Hi all, This is a patch submission following-up from the RFC at: https://gcc.gnu.org/pipermail/gcc/2024-November/245076.html The patch is rebased and retested against current trunk, some debugging code removed, comments improved and some fixes added as I've we've done more testing. ------------------------>8----------------------------------------------------- Implement partitioning and cloning in the callgraph to help locality. A new -flto-partition=locality flag is used to enable this. The majority of the logic is in the new IPA pass in ipa-locality-cloning.cc The optimization has two components: * Partitioning the callgraph so as to group callers and callees that frequently call each other in the same partition * Cloning functions that straddle multiple callchains and allowing each clone to be local to the partition of its callchain. The majority of the logic is in the new IPA pass in ipa-locality-cloning.cc. It creates a partitioning plan and does the prerequisite cloning. The partitioning is then implemented during the existing LTO partitioning pass. To guide these locality heuristics we use PGO data. In the absence of PGO data we use a static heuristic that uses the accumulated estimated edge frequencies of the callees for each function to guide the reordering. We are investigating some more elaborate static heuristics, in particular using the demangled C++ names to group template instantiatios together. This is promising but we are working out some kinks in the implementation currently and want to send that out as a follow-up once we're more confident in it. A new bootstrap-lto-locality bootstrap config is added that allows us to test this on GCC itself with either static or PGO heuristics. GCC bootstraps with both (normal LTO bootstrap and profiledbootstrap). With this optimization we are seeing good performance gains on some large internal workloads that stress the parts of the processor that is sensitive to code locality, but we'd appreciate wider performance evaluation. Bootstrapped and tested on aarch64-none-linux-gnu. Ok for mainline? Thanks, Kyrill Signed-off-by: Prachi Godbole <pgodb...@nvidia.com> Co-authored-by: Kyrylo Tkachov <ktkac...@nvidia.com> config/ChangeLog: * bootstrap-lto-locality.mk: New file. gcc/ChangeLog: * Makefile.in (OBJS): Add ipa-locality-cloning.o (GTFILES): Add ipa-locality-cloning.cc dependency. * common.opt (lto_partition_model): Add locality value. * flag-types.h (lto_partition_model): Add LTO_PARTITION_LOCALITY value. (enum lto_locality_cloning_model): Define. * lto-cgraph.cc (lto_set_symtab_encoder_in_partition): Add dumping of node and index. * params.opt (lto_locality_cloning_model): New enum. (lto-partition-locality-cloning): New param. (lto-partition-locality-frequency-cutoff): Likewise. (lto-partition-locality-size-cutoff): Likewise. (lto-max-locality-partition): Likewise. * passes.def: Add pass_ipa_locality_cloning. * timevar.def (TV_IPA_LC): New timevar. * tree-pass.h (make_pass_ipa_locality_cloning): Declare. * ipa-locality-cloning.cc: New file. * ipa-locality-cloning.h: New file. gcc/lto/ChangeLog: * lto-partition.cc: Include ipa-locality-cloning.h (add_node_references_to_partition): Define. (create_partition): Likewise. (lto_locality_map): Likewise. (lto_promote_cross_file_statics): Add extra dumping. * lto-partition.h (lto_locality_map): Declare. * lto.cc (do_whole_program_analysis): Handle LTO_PARTITION_LOCALITY.