Hi Honza, > On 13 Apr 2025, at 23:19, Jan Hubicka <hubi...@ucw.cz> wrote: > >> +@opindex fipa-reorder-for-locality >> +@item -fipa-reorder-for-locality >> +Group call chains close together in the binary layout to improve code code >> +locality. This option is incompatible with an explicit >> +@option{-flto-partition=} option since it enforces a custom partitioning >> +scheme. > > Please also cross-link this with -fprofile-reorder-functions and > -freorder-functions, which does similar thing. > If you see how to clean-up the description of the other two so user is > not confused. > > Perhaps say that -freorder-functions only partitions functions into > never-executed/cold/normal/hot and -fprofile-reroder-functions is aiming > for program startup optimization (it reorders by measured first time the > function is executed. By accident it seems to kind of work for > locality.
Yeah, the option names are quite similar aren't they? I’ve attempted to disambiguate them a bit in their description. I’m attaching a diff from the previous version (as the full updated patch) to make it easier to see what’s adjusted. > >> + >> +/* Helper function of to accumulate call counts. */ >> +static bool >> +accumulate_profile_counts_after_cloning (cgraph_node *node, void *data) >> +{ >> + struct profile_stats *stats = (struct profile_stats *) data; >> + for (cgraph_edge *e = node->callers; e; e = e->next_caller) >> + { >> + if (e->caller == stats->target) >> + { >> + if (stats->rec_count.compatible_p (e->count.ipa ())) >> + stats->rec_count += e->count.ipa (); >> + } >> + else >> + { >> + if (stats->nonrec_count.compatible_p (e->count.ipa ())) >> + stats->nonrec_count += e->count.ipa (); >> + } > In case part of profile is missing (which may happen if one unit has -O0 > or so) , we may have counts to be uninitialized. Uninitialized counts are > compatible with everything, but any arithmetics with it will produce > uninitialized result which will likely confuse code later. So I would > skip edges with uninitialized counts. > > On the other hand ipa counts are always compatible, so compatible_p > should be redundat. Main reaosn for existence of compatible_p is that we > can have local profiles that are 0 or unknown at IPA level. The ipa () > conversion turns all counts into IPA counts and those are compatible > with each other. > > I suppose compatibe_p test is there since the code ICEd in past,but I > think it was because of missing ipa() conversion. > > >> + } >> + return false; >> +} >> + >> +/* NEW_NODE is a previously created clone of ORIG_NODE already present in >> + current partition. EDGES contains newly redirected edges to NEW_NODE. >> + Adjust profile information for both nodes and the edge. */ >> + >> +static void >> +adjust_profile_info_for_non_self_rec_edges (auto_vec<cgraph_edge *> &edges, >> + cgraph_node *new_node, >> + cgraph_node *orig_node) >> +{ >> + profile_count orig_node_count = orig_node->count.ipa (); >> + profile_count edge_count = profile_count::zero (); >> + profile_count final_new_count = profile_count::zero (); >> + profile_count final_orig_count = profile_count::zero (); >> + >> + for (unsigned i = 0; i < edges.length (); ++i) >> + edge_count += edges[i]->count.ipa (); > Here I would again skip uninitialized. It is probably legal for -O0 > function to end up in partition. >> + >> + final_orig_count = orig_node_count - edge_count; >> + >> + /* NEW_NODE->count was adjusted for other callers when the clone was >> + first created. Just add the new edge count. */ >> + if (new_node->count.compatible_p (edge_count)) >> + final_new_count = new_node->count + edge_count; > And here compatible_p should be unnecesary. >> +/* Accumulate frequency of all edges from EDGE->caller to EDGE->callee. */ >> + >> +static sreal >> +accumulate_incoming_edge_frequency (cgraph_edge *edge) >> +{ >> + sreal count = 0; >> + struct cgraph_edge *e; >> + for (e = edge->callee->callers; e; e = e->next_caller) >> + { >> + /* Make a local decision about all edges for EDGE->caller but not the >> + other nodes already in the partition. Their edges will be visited >> + later or may have been visited before and not fit the >> + cut-off criteria. */ >> + if (e->caller == edge->caller) >> + { >> + profile_count caller_count = e->caller->inlined_to >> + ? e->caller->inlined_to->count >> + : e->caller->count; >> + if (e->count.compatible_p (caller_count)) > Here again compatiblity check should not be necessary, since the counts > belong to one function body (after inlining) and should be compatible. > inliner calls e->sreal_frequency all the time withotu further checks. > Yeah, I’ve adjusted these uses and used checks for initialized_p where you highlighted. Indeed it seems to work with a profiled bootstrap and the ICEs we were seeing earlier in development aren’t appearing. > Patch is OK with these changes. I apologize for letting the review slip > for so long. I was setting up Firefox testing and LLVM builds to gather > some data that took longer than I hoped for. On Firefox and LLVM on zen > I can measure some improvements via instruction cache perofmrance > counters, but the overall benefit seems to be close to noise, but this > is likely very CPU specfic. Overall code locality is one of main missing > parts of the LTO framework. As discussed on Cauldron, I think next > stage1 we can untie this from partitining algorithm, but that needs more > work on linker side as well as on gas fragments, so I think it is a good > idea to move with this patch as it is and improve from it. Thank you very much for the evaluation and the feedback! I agree the effect can be very CPU-specific. > > I think the patch is modifying almost no code that is run w/o the > -fipa-reorder-for-locality so I hope it is safe for 15.1, but I would > like to let Richi and Jakub to comment on this. Thanks a lot for your reviews yet again. They were very helpful. I’ve updated the patch as per your suggestion and did a profiled lto bootstrap with the new bootstrap-lto-locality.mk that exercises this. All looks good on aarch64-none-linux-gnu. Richi, Jakub, is it okay for trunk now given Honza’s comments? > > Honza
0001-Locality-cloning-pass-fipa-reorder-for-locality.patch
Description: 0001-Locality-cloning-pass-fipa-reorder-for-locality.patch
locality-cloning-updates.patch
Description: locality-cloning-updates.patch