From: Matthew Malcomson <[email protected]>

This patch is a separate beneficial step of the work I'm doing towards
PR119588.  In that bug I describe how some internal workloads are
hitting some performance problems w.r.t. the creation of many parallel
regions one after each other.

I'm attempting to reduce overhead of parallel region creation when using
a large number of threads.  I've posted two patches to rework the
barrier implementation to something that performs much better in this
particular case:
https://gcc.gnu.org/pipermail/gcc-patches/2025-September/695005.html
https://gcc.gnu.org/pipermail/gcc-patches/2025-October/698257.html

This is a separate patch to address a different area where I've seen a
performance impact.  It's the area that I mentioned in the below mail
to the GCC mailing list.
https://gcc.gnu.org/pipermail/gcc/2025-September/246618.html

The fundamental point is that the cost of teardown & setup of a parallel
region for LLVM is very close to the cost of a single barrier.  For us
there is a significant chunk of extra work in the actual teardown and
setup.

When re-using a team with no changes in the affinity ICV etc it would
seem that the ideal case is similar to LLVM -- (parallel region creation
being about 16% slower than a barrier for LLVM, as opposed to the 4.3x
slower for 144 threads that I'm seeing at the moment for GNU).
This patch gets nowhere near that ideal.  In this patch I observe that
the majority of the performance overhead in this case comes from
re-initialising the existing threads.  While all existing threads are
waiting on the primary thread, that primary thread is iterating over
each secondary threads data structures and setting it as needed for the
next parallel region.

There are a number of stores that could be relatively easily performed
in each secondary thread after it is released by the primary thread --
removing this work from the serialised region.  This patch proposes that
much simpler setup.  This re-initialisation scales with number of
threads -- which is why it has such an impact on the problematic case I
have been looking at.

Only subtlety here is that in order to ensure memory safety we need to
avoid performing this initialisation when `thr->fn == NULL`.  If a
secondary non-nested thread gets "left to exit" when a new team is
started the primary thread could race ahead, cache the new team, and
free the team that this secondary thread was using all before the
secondary thread goes through the new initialisation code.  That could
lead to a use-after-free.  Hence we have to early-exit when this thread
is not going to perform any more work.
- N.b. When running TSAN before and after I realised that this change
  happens to remove a race between `gomp_finish_task` in the secondary
  thread and `gomp_init_task` in the primary thread (when re-using a
  team and hence `nthr->task` is in the same place as the threads local
  `task` variable).

All threads that are not going to be used in the next region do
unnecessarily re-init their `team` data structure etc, but since
`thr->fn` is NULL they exit the do/while loop before that makes little
difference, and the work is spread across threads so it shouldn't block
other useful work.

I imagine there is a good chunk of performance to be gained in adding
more logic here:
1) Something recording whether the affinity setup has changed since the
   last non-nested team and if not removing the calculation of places
   and assignment to `nthr->ts.place_partition_*` variables.
2) I could move more members that are assigned multiple times to `team`
   for each secondary thread to take.
   - `data` pointer.
   - `num_teams`.
   - `team_num`.
   - Honestly should have looked into this in the patch -- noticed it
     while writing this cover letter and will look into whether this is
     feasible relatively soon.
3) If we can identify that we're re-using a team from the last parallel
   region (and affinity ICV has not changed) it seems that we could
   avoid re-initialising some of its fields:
   - ordered_release[i] should already point to `nthr`?
   - ts.team_id should already be set.
Overall if we can identify that we're using the same team as was
cached and we don't need to change anything we should be able to get
away with drastically less work in the serial part of the call to
GOMP_parallel.

I did run the micro-benchmark I'm looking at (mentioned in PR119588) on
some hacked up versions of the above and they seemed to provide
measurable gains.  I haven't tried to put them all in this patch due to
lack of time and the relatively higher complexity of the bookkeeping.
I'm hoping this is a straight-forward step in the right direction.

-------
Testing done (note -- much of the testing done on top of my other
patches rather than by itself):
- Bootstrap & regtest on x86_64 and aarch64
  - Testsuite ran with OMP_WAIT_POLICY=PASSIVE as well.
  - Also done with `--enable-linux-futex=no` for posix/ target.
  - Also done with `_LIBGOMP_CHECKING_` set for assertions.
- Cross compiler & regtest on arm (qemu).
- TSAN done combining all my other patches upstream.

libgomp/ChangeLog:

        PR libgomp/119588
        * libgomp.h (struct gomp_team): Add fields for communication
        at team start time.
        * team.c (gomp_thread_start): Initialise thread-local data using
        members stored on team.
        (gomp_team_start): Primary thread stores data on team structure
        instead of copying that data to each secondary threads
        gomp_thread structure.

Signed-off-by: Matthew Malcomson <[email protected]>
---
 libgomp/libgomp.h |  8 ++++++++
 libgomp/team.c    | 36 +++++++++++++++++++++++++-----------
 2 files changed, 33 insertions(+), 11 deletions(-)

diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h
index a43398300a5..0dac5871d29 100644
--- a/libgomp/libgomp.h
+++ b/libgomp/libgomp.h
@@ -857,6 +857,14 @@ struct gomp_team
   /* Number of tasks waiting for their completion event to be fulfilled.  */
   unsigned int task_detach_count;
 
+  /* Cache to help initialise each secondary threads `task` when re-using a
+     team very many times.  This allows each secondary thread to perform the
+     re-initialisation instead of having the primary thread need to perform
+     that initialisation.  */
+  struct gomp_task_icv task_icv_init;
+  struct gomp_task *task_parent_init;
+  struct gomp_taskgroup *task_taskgroup;
+
   /* This array contains structures for implicit tasks.  */
   struct gomp_task implicit_task[];
 };
diff --git a/libgomp/team.c b/libgomp/team.c
index cb1d3235312..893f4f9faea 100644
--- a/libgomp/team.c
+++ b/libgomp/team.c
@@ -131,6 +131,22 @@ gomp_thread_start (void *xdata)
          gomp_finish_task (task);
 
          gomp_simple_barrier_wait (&pool->threads_dock);
+         /* Early exit because if `thr->fn == NULL` then we can't guarantee
+            the `thr->ts.team` hasn't been freed by the primary thread racing
+            ahead.  Hence we don't want to write to it.  */
+         if (thr->fn == NULL)
+           break;
+         team = thr->ts.team;
+         thr->ts.work_share = &team->work_shares[0];
+         thr->ts.last_work_share = NULL;
+#ifdef HAVE_SYNC_BUILTINS
+         thr->ts.single_count = 0;
+#endif
+         thr->ts.static_trip = 0;
+         thr->task = &team->implicit_task[thr->ts.team_id];
+         gomp_init_task (thr->task, team->task_parent_init,
+                         &team->task_icv_init);
+         thr->task->taskgroup = team->task_taskgroup;
 
          local_fn = thr->fn;
          local_data = thr->data;
@@ -383,6 +399,15 @@ gomp_team_start (void (*fn) (void *), void *data, unsigned 
nthreads,
   thr->task->taskgroup = taskgroup;
   team->implicit_task[0].icv.nthreads_var = nthreads_var;
   team->implicit_task[0].icv.bind_var = bind_var;
+  /* Copy entirely rather than copy pointer so that we have an immutable copy
+     for the secondary threads to take from for the time-frame that said
+     secondary threads are executing on.  Can't copy from
+     `team->implicit_task[0].icv` because primary thread could have adjusted
+     its per-task ICV by the time the secondary thread gets around to
+     initialising things.   */
+  team->task_icv_init = team->implicit_task[0].icv;
+  team->task_parent_init = task;
+  team->task_taskgroup = taskgroup;
 
   if (nthreads == 1)
     return;
@@ -641,26 +666,15 @@ gomp_team_start (void (*fn) (void *), void *data, 
unsigned nthreads,
          else
            nthr = pool->threads[i];
          nthr->ts.team = team;
-         nthr->ts.work_share = &team->work_shares[0];
-         nthr->ts.last_work_share = NULL;
          nthr->ts.team_id = i;
          nthr->ts.level = team->prev_ts.level + 1;
          nthr->ts.active_level = thr->ts.active_level;
          nthr->ts.place_partition_off = place_partition_off;
          nthr->ts.place_partition_len = place_partition_len;
          nthr->ts.def_allocator = thr->ts.def_allocator;
-#ifdef HAVE_SYNC_BUILTINS
-         nthr->ts.single_count = 0;
-#endif
-         nthr->ts.static_trip = 0;
          nthr->num_teams = thr->num_teams;
          nthr->team_num = thr->team_num;
-         nthr->task = &team->implicit_task[i];
          nthr->place = place;
-         gomp_init_task (nthr->task, task, icv);
-         team->implicit_task[i].icv.nthreads_var = nthreads_var;
-         team->implicit_task[i].icv.bind_var = bind_var;
-         nthr->task->taskgroup = taskgroup;
          nthr->fn = fn;
          nthr->data = data;
          team->ordered_release[i] = &nthr->release;
-- 
2.43.0

Reply via email to