Hi, the bootstrap-lto-locality is much longer compared to boostrap-lto and bootstrap, and

It seems that stage2 and stage3 only produced 5 partitions in LTO, is this reasonable...

Also could you please inform how much is the exact performance gain, please?


make bootstrap:                                       27m56.054s
make BUILD_CONFIG=bootstrap-lto:         38m25.048s
make BUILD_CONFIG=bootstrap-lto-locality:    71m1.882s


On 2025/4/15 22:38, Kyrylo Tkachov wrote:

On 15 Apr 2025, at 15:42, Richard Biener <richard.guent...@gmail.com> wrote:

On Mon, Apr 14, 2025 at 3:11 PM Kyrylo Tkachov <ktkac...@nvidia.com> wrote:
Hi Honza,

On 13 Apr 2025, at 23:19, Jan Hubicka <hubi...@ucw.cz> wrote:

+@opindex fipa-reorder-for-locality
+@item -fipa-reorder-for-locality
+Group call chains close together in the binary layout to improve code code
+locality.  This option is incompatible with an explicit
+@option{-flto-partition=} option since it enforces a custom partitioning
+scheme.
Please also cross-link this with -fprofile-reorder-functions and
-freorder-functions, which does similar thing.
If you see how to clean-up the description of the other two so user is
not confused.

Perhaps say that -freorder-functions only partitions functions into
never-executed/cold/normal/hot and -fprofile-reroder-functions is aiming
for program startup optimization (it reorders by measured first time the
function is executed.  By accident it seems to kind of work for
locality.
Yeah, the option names are quite similar aren't they?
I’ve attempted to disambiguate them a bit in their description.
I’m attaching a diff from the previous version (as the full updated patch) to 
make it easier to see what’s adjusted.


+
+/* Helper function of to accumulate call counts.  */
+static bool
+accumulate_profile_counts_after_cloning (cgraph_node *node, void *data)
+{
+  struct profile_stats *stats = (struct profile_stats *) data;
+  for (cgraph_edge *e = node->callers; e; e = e->next_caller)
+    {
+      if (e->caller == stats->target)
+ {
+  if (stats->rec_count.compatible_p (e->count.ipa ()))
+    stats->rec_count += e->count.ipa ();
+ }
+      else
+ {
+  if (stats->nonrec_count.compatible_p (e->count.ipa ()))
+    stats->nonrec_count += e->count.ipa ();
+ }
In case part of profile is missing (which may happen if one unit has -O0
or so) , we may have counts to be uninitialized. Uninitialized counts are
compatible with everything, but any arithmetics with it will produce
uninitialized result which will likely confuse code later.  So I would
skip edges with uninitialized counts.

On the other hand ipa counts are always compatible, so compatible_p
should be redundat. Main reaosn for existence of compatible_p is that we
can have local profiles that are 0 or unknown at IPA level.  The ipa ()
conversion turns all counts into IPA counts and those are compatible
with each other.

I suppose compatibe_p test is there since the code ICEd in past,but I
think it was because of missing ipa() conversion.


+    }
+  return false;
+}
+
+/* NEW_NODE is a previously created clone of ORIG_NODE already present in
+   current partition.  EDGES contains newly redirected edges to NEW_NODE.
+   Adjust profile information for both nodes and the edge.  */
+
+static void
+adjust_profile_info_for_non_self_rec_edges (auto_vec<cgraph_edge *> &edges,
+    cgraph_node *new_node,
+    cgraph_node *orig_node)
+{
+  profile_count orig_node_count = orig_node->count.ipa ();
+  profile_count edge_count = profile_count::zero ();
+  profile_count final_new_count = profile_count::zero ();
+  profile_count final_orig_count = profile_count::zero ();
+
+  for (unsigned i = 0; i < edges.length (); ++i)
+    edge_count += edges[i]->count.ipa ();
Here I would again skip uninitialized.  It is probably legal for -O0
function to end up in partition.
+
+  final_orig_count = orig_node_count - edge_count;
+
+  /* NEW_NODE->count was adjusted for other callers when the clone was
+     first created.  Just add the new edge count.  */
+  if (new_node->count.compatible_p (edge_count))
+    final_new_count = new_node->count + edge_count;
And here compatible_p should be unnecesary.
+/* Accumulate frequency of all edges from EDGE->caller to EDGE->callee.  */
+
+static sreal
+accumulate_incoming_edge_frequency (cgraph_edge *edge)
+{
+  sreal count = 0;
+  struct cgraph_edge *e;
+  for (e = edge->callee->callers; e; e = e->next_caller)
+    {
+      /* Make a local decision about all edges for EDGE->caller but not the
+ other nodes already in the partition.  Their edges will be visited
+ later or may have been visited before and not fit the
+ cut-off criteria.  */
+      if (e->caller == edge->caller)
+ {
+  profile_count caller_count = e->caller->inlined_to
+ ? e->caller->inlined_to->count
+ : e->caller->count;
+  if (e->count.compatible_p (caller_count))
Here again compatiblity check should not be necessary, since the counts
belong to one function body (after inlining) and should be compatible.
inliner calls e->sreal_frequency all the time withotu further checks.

Yeah, I’ve adjusted these uses and used checks for initialized_p where you 
highlighted.
Indeed it seems to work with a profiled bootstrap and the ICEs we were seeing 
earlier in development aren’t appearing.

Patch is OK with these changes. I apologize for letting the review slip
for so long.  I was setting up Firefox testing and LLVM builds to gather
some data that took longer than I hoped for.  On Firefox and LLVM on zen
I can measure some improvements via instruction cache perofmrance
counters, but the overall benefit seems to be close to noise, but this
is likely very CPU specfic. Overall code locality is one of main missing
parts of the LTO framework.  As discussed on Cauldron, I think next
stage1 we can untie this from partitining algorithm, but that needs more
work on linker side as well as on gas fragments, so I think it is a good
idea to move with this patch as it is and improve from it.
Thank you very much for the evaluation and the feedback! I agree the effect can 
be very CPU-specific.

I think the patch is modifying almost no code that is run w/o the
-fipa-reorder-for-locality so I hope it is safe for 15.1, but I would
like to let Richi and Jakub to comment on this.
Thanks a lot for your reviews yet again. They were very helpful.
I’ve updated the patch as per your suggestion and did a profiled lto bootstrap 
with the new bootstrap-lto-locality.mk that exercises this.
All looks good on aarch64-none-linux-gnu.
Richi, Jakub, is it okay for trunk now given Honza’s comments?
OK, prepare to back out if there are any issues though.

Quickly skimming over the patch shows

+/* Locality partitions, assigns nodes to partitions.  These are used later in
+   WPA partitioning.  */
+vec<locality_partition> locality_partitions;
+
+/* Map from original node to its latest clone.  Gets overwritten whenever a new
+   clone is created from the same node.  */
+hash_map<cgraph_node *, cgraph_node *> node_to_clone;
+/* Map from clone to its original node.  */
+hash_map<cgraph_node *, cgraph_node *> clone_to_node;

those global CTORs are frowned upon (why are those not static?), we prefer
those to be pointers.  They are also not marked for garbage collection
but cgraph_node generally are so I assume they are only live through the
IPA pass itself.  So they should be ideally created in a more local scope.

Thanks! As discussed on IRC we’ll work on addressing these. I’ve pushed the 
patch with 6d9fdf4bf57 after one more bootstrap on aarch64-none-linux-gnu.
Kyrill


Richard.


Honza

Reply via email to