Hi Honza,

> On 13 Apr 2025, at 23:19, Jan Hubicka <hubi...@ucw.cz> wrote:
> 
>> +@opindex fipa-reorder-for-locality
>> +@item -fipa-reorder-for-locality
>> +Group call chains close together in the binary layout to improve code code
>> +locality.  This option is incompatible with an explicit
>> +@option{-flto-partition=} option since it enforces a custom partitioning
>> +scheme.
> 
> Please also cross-link this with -fprofile-reorder-functions and
> -freorder-functions, which does similar thing.
> If you see how to clean-up the description of the other two so user is
> not confused. 
> 
> Perhaps say that -freorder-functions only partitions functions into
> never-executed/cold/normal/hot and -fprofile-reroder-functions is aiming
> for program startup optimization (it reorders by measured first time the
> function is executed.  By accident it seems to kind of work for
> locality.

Yeah, the option names are quite similar aren't they?
I’ve attempted to disambiguate them a bit in their description.
I’m attaching a diff from the previous version (as the full updated patch) to 
make it easier to see what’s adjusted.


> 
>> +
>> +/* Helper function of to accumulate call counts.  */
>> +static bool
>> +accumulate_profile_counts_after_cloning (cgraph_node *node, void *data)
>> +{
>> +  struct profile_stats *stats = (struct profile_stats *) data;
>> +  for (cgraph_edge *e = node->callers; e; e = e->next_caller)
>> +    {
>> +      if (e->caller == stats->target)
>> + {
>> +  if (stats->rec_count.compatible_p (e->count.ipa ()))
>> +    stats->rec_count += e->count.ipa ();
>> + }
>> +      else
>> + {
>> +  if (stats->nonrec_count.compatible_p (e->count.ipa ()))
>> +    stats->nonrec_count += e->count.ipa ();
>> + }
> In case part of profile is missing (which may happen if one unit has -O0
> or so) , we may have counts to be uninitialized. Uninitialized counts are
> compatible with everything, but any arithmetics with it will produce
> uninitialized result which will likely confuse code later.  So I would
> skip edges with uninitialized counts.
> 
> On the other hand ipa counts are always compatible, so compatible_p
> should be redundat. Main reaosn for existence of compatible_p is that we
> can have local profiles that are 0 or unknown at IPA level.  The ipa ()
> conversion turns all counts into IPA counts and those are compatible
> with each other.
> 
> I suppose compatibe_p test is there since the code ICEd in past,but I
> think it was because of missing ipa() conversion.
> 
> 
>> +    }
>> +  return false;
>> +}
>> +
>> +/* NEW_NODE is a previously created clone of ORIG_NODE already present in
>> +   current partition.  EDGES contains newly redirected edges to NEW_NODE.
>> +   Adjust profile information for both nodes and the edge.  */
>> +
>> +static void
>> +adjust_profile_info_for_non_self_rec_edges (auto_vec<cgraph_edge *> &edges,
>> +    cgraph_node *new_node,
>> +    cgraph_node *orig_node)
>> +{
>> +  profile_count orig_node_count = orig_node->count.ipa ();
>> +  profile_count edge_count = profile_count::zero ();
>> +  profile_count final_new_count = profile_count::zero ();
>> +  profile_count final_orig_count = profile_count::zero ();
>> +
>> +  for (unsigned i = 0; i < edges.length (); ++i)
>> +    edge_count += edges[i]->count.ipa ();
> Here I would again skip uninitialized.  It is probably legal for -O0
> function to end up in partition.
>> +
>> +  final_orig_count = orig_node_count - edge_count;
>> +
>> +  /* NEW_NODE->count was adjusted for other callers when the clone was
>> +     first created.  Just add the new edge count.  */
>> +  if (new_node->count.compatible_p (edge_count))
>> +    final_new_count = new_node->count + edge_count;
> And here compatible_p should be unnecesary.
>> +/* Accumulate frequency of all edges from EDGE->caller to EDGE->callee.  */
>> +
>> +static sreal
>> +accumulate_incoming_edge_frequency (cgraph_edge *edge)
>> +{
>> +  sreal count = 0;
>> +  struct cgraph_edge *e;
>> +  for (e = edge->callee->callers; e; e = e->next_caller)
>> +    {
>> +      /* Make a local decision about all edges for EDGE->caller but not the
>> + other nodes already in the partition.  Their edges will be visited
>> + later or may have been visited before and not fit the
>> + cut-off criteria.  */
>> +      if (e->caller == edge->caller)
>> + {
>> +  profile_count caller_count = e->caller->inlined_to
>> + ? e->caller->inlined_to->count
>> + : e->caller->count;
>> +  if (e->count.compatible_p (caller_count))
> Here again compatiblity check should not be necessary, since the counts
> belong to one function body (after inlining) and should be compatible.
> inliner calls e->sreal_frequency all the time withotu further checks.
> 

Yeah, I’ve adjusted these uses and used checks for initialized_p where you 
highlighted.
Indeed it seems to work with a profiled bootstrap and the ICEs we were seeing 
earlier in development aren’t appearing.

> Patch is OK with these changes. I apologize for letting the review slip
> for so long.  I was setting up Firefox testing and LLVM builds to gather
> some data that took longer than I hoped for.  On Firefox and LLVM on zen
> I can measure some improvements via instruction cache perofmrance
> counters, but the overall benefit seems to be close to noise, but this
> is likely very CPU specfic. Overall code locality is one of main missing
> parts of the LTO framework.  As discussed on Cauldron, I think next
> stage1 we can untie this from partitining algorithm, but that needs more
> work on linker side as well as on gas fragments, so I think it is a good
> idea to move with this patch as it is and improve from it.

Thank you very much for the evaluation and the feedback! I agree the effect can 
be very CPU-specific.

> 
> I think the patch is modifying almost no code that is run w/o the
> -fipa-reorder-for-locality so I hope it is safe for 15.1, but I would
> like to let Richi and Jakub to comment on this.

Thanks a lot for your reviews yet again. They were very helpful.
I’ve updated the patch as per your suggestion and did a profiled lto bootstrap 
with the new bootstrap-lto-locality.mk that exercises this.
All looks good on aarch64-none-linux-gnu.
Richi, Jakub, is it okay for trunk now given Honza’s comments?


> 
> Honza

Attachment: 0001-Locality-cloning-pass-fipa-reorder-for-locality.patch
Description: 0001-Locality-cloning-pass-fipa-reorder-for-locality.patch

Attachment: locality-cloning-updates.patch
Description: locality-cloning-updates.patch

Reply via email to