Re: [PATCH] [RFC] libgomp: Removing one barrier in non-nested thread loop

Matthew Malcomson Tue, 04 Nov 2025 01:23:45 -0800

Hi all,

I'd like to ping on the idea in this patch.

I was hoping to post the "finished" patch as the next ping, but hit afew issues (all getting resolved just fine -- but taking time and Iwanted to get the idea vetted as early as I could).


Biggest questions to raise are:

1) Memory model semantics. I believe it seems fine, what are thethoughts of others.2) When this is combined withhttps://gcc.gnu.org/pipermail/gcc-patches/2025-September/695005.html Iremove a lot of the complexity around "reinit" of the barrier.- If this patch is accepted then that complexity doesn't need to be putinto the other targets just to be taken out again.- In that case would it be fine to post the previous patch without theextra complexity? Asking because it's tricky for me to test all othertargets.


Thanks,
Matthew


On 10/21/25 11:38, [email protected] wrote:

From: Matthew Malcomson <[email protected]>

I asked about this on IRC earlier but believe it was hard to judge from
just a description of the plans.  This patch implements the change to
only have one barrier in between parallel regions at the top-level.
Still need to adjust other targets to ensure they build (most notably
rtems -- have not adjusted that target for the changes I proposed in the
preceding patch yet either.

Like previous patches I've posted for OpenMP, I'm attempting to reduce
the performance difference between clang and gcc in the large-contention
case I discussed in PR119588:
https://gcc.gnu.org/pipermail/gcc-patches/2025-September/695005.html
https://gcc.gnu.org/pipermail/gcc-patches/2025-September/696085.html

This patch builds on top of the first patch of the two mentioned above.

Outline of the performance improvement in this patch:
- In the current libgomp implementation each non-nested thread in the
   thread pool works in a loop along the lines of:
   1:
      perform work
      wait for all threads to have finished  (gather all threads)
      free some data structures
      wait for "go" signal                   (wait on release)
      goto 1
- The "wait for all threads to have finished" and "wait for go signal"
   actions are both currently implemented with full barriers.
- The implementation of each barrier has both of those steps -- "wait
   for all threads to have finished" and then all threads except the last
   "wait for go signal".
- LLVM only has one barrier between executing parallel regions in the
   main loop of a cached thread as can be observed by running in the
   debugger or by observing the performance numbers in the
   micro-benchmark I added to PR119588

In the existing libgomp barrier the "gather" step gives some *unknown*
thread the knowledge that all threads have arrived at the barrier
(whichever is last -- as indicated by BAR_WAS_LAST).  Then that unknown
thread sends the "release" signal to all others -- giving them that same
information.

In the first patch of the three I've posted, I adjusted the barrier
implementation.  In that implementation, synchronisation always goes
through the "primary" thread.  I.e. in the "gather" step it is the
primary thread that ends up with the information that all other threads
have arrived at the barrier.  Since it is the primary thread that needs
to perform bookeeping and run user code while the secondary threads all
wait in their main loop this maps slightly better to the holding and
releasing of secondary threads.

Hence I notice that if we change the loop to the below (which I believe
is safe since the data structures getting freed are thread-local) then
we can have all synchronisation done by one barrier instead of two:
1:
   perform work
   wait for all threads to have finished  (gather all threads)
   wait for "go" signal                   (wait on release)
   free some data structures
   goto 1

--------------
Outline of the changes made:
1) Introduce the target hook `gomp_barrier_can_hold` to let generic code
    know when the "final" barrier holds all secondary threads until the
    primary thread calls `gomp_team_barrier_done_final`.
    When `gomp_barrier_can_hold` returns false then the semantics of the
    "final" barrier are essentially the same as that of the "team"
    barrier in that when the primary thread is past it all secondary
    threads are known to have been released.
2) When there are threads "held" waiting for new work, these threads are
    now waiting on the barrier of their previous team rather than waiting
    on the barrier in the "thread_pool".
    Hence we introduce a `prev_barrier` field on pool.h that records the
    team barrier of the last team -- this is the barrier that needs to be
    "released".
    - This is recorded on the thread_pool in `gomp_team_end` and used to
      release the secondary threads in `gomp_team_start` (plus
      `gomp_free_thread` and `gomp_pause_host`).
3) Change the "simple" barrier used by the thread_pool back to one using
    the "centralized" implementation.  Since we no longer go through the
    "simple" barrier except in the case of a new thread, this barrier is
    no longer a performance bottleneck.  That means we no longer require
    it to be of the faster implementation introduced in the first patch
    mentioned above.
    - This means adding back a field of `awaited` for this barrier only
      into the `gomp_barrier_t` data structure.
    - This also means that the "simple" barrier no longer needs a thread
      identification number.  That means that we can remove the argument
      of `thr->ts.team_id` to each of these functions which is nice
      because logically these functions are not tied to a given team.
4) Reverting to a "centralized" implementation means that the "reinit"
    complexity introduced in the first patch for adjusting the size of
    the new-style barrier are no longer needed.  This means we can remove
    that complexity and adjust the `gomp_team_start` function to
    something closer to what is currently upstream.
5) The "final" barrier is adjusted so that once all threads have arrived
    and all tasks have been performed the primary thread continues while
    all secondary threads wait.
    - An interface to releasing all secondary threads is introduced and
      that is called on the `prev_barrier` stored on the thread pool as
      mentioned above.
    - This state of "secondaries held but primary to continue" is indicated
      by a new flag in the generation number.
6) We also check in `gomp_barrier_handle_tasks` whether the generation
    has been incremented since we entered this function.  If that is so
    then we don't perform tasks.
    - This is discussed a little more below.  This is related to PR122314
      that is present on the master branch.

This ends up with a new split on implementation between the four kinds
of barrier interface presented to the generic code:
- "simple" barrier is now a centralized barrier.
- "team"   barrier is a more scalable, tasking barrier.
   This barrier is never called directly after a cancellable barrier.
   Hence it does not have to handle the BAR_CANCELLED flag in the
   generation.
- "team final" barrier is a scalable tasking barrier that can break
   half-way through in order to provide different memory semantics.
- "cancel" barrier is a scalable tasking barrier that can be cancelled.

--------------
There are a few subtleties in this change to be aware of:
1) The memory ordering between secondary and primary threads has some
    subtle correctness reasoning.
    - In the existing behaviour there is an acquire-release ordering
      from all threads to each other after having gone through the
      first barrier (when the primary thread is executing code in between
      parallel regions).
      In the new change there is an acquire-release ordering from all
      secondary threads to the primary thread, but not one in the other
      direction until after the primary thread has released all other
      threads in the team.
    - I believe this is fine -- the only difference in memory ordering
      is visible in threads that are not running any code (secondary
      threads just waiting on the primary one to release them).  As far
      as the primary thread is concerned it knows that any stores added
      in other threads must be visible in its own one.
    - The memory model semantics at barrier regions (and specifically at
      the entry to a parallel region) are discussed in the OpenMP
      standard here:
      https://www.openmp.org/spec-html/5.0/openmpsu96.html#x127-4920002.17.8
      It explicitly calls out that on entry to a parallel region "the
      behavior is as if the release flush performed by the master thread
      on entry to the parallel region synchronizes with the acquire flush
      performed on entry to each implicit task that is assigned to a
      different thread" which is precisely the second "release" half of
      the barrier.  The exit from a parallel region is an implicit
      barrier and so the documentation could be interpreted as requiring
      an implicit release/acquire flush both ways -- but it's hard to
      decide what the semantics of a release flush from the primary
      syncing with an acquire flush in the secondary means in the time
      when the secondary is not performing any code (as soon as the
      secondaries do start to perform code there is indeed a flush from
      the primary to them).
2) I had to account for tasks getting performed on an "earlier" barrier.
    - In the existing code tasks scheduled after a barrier could be
      executed at that barrier point in some threads (see PR122314).
    - This change makes it so that there is only a single tasking barrier
      between parallel regions.  This means that the above behaviour
      would now affect consequent parallel regions instead of only
      sets of statements separated by barriers inside a single parallel
      block.
    - I have accounted for the case I introduced.  I expect it would be
      best to account for the existing case too (by adjusting the hook I
      added to the `linux` target) but would like confirmation on that
      before making the change.
3) What I believe to be an existing memory model ordering problem
    between performing tasks and releasing barrier would be more
    problematic.
    - I suspect there is a memory synchronisation problem existing in the
      code -- described in PR122356.
    - With this change I believe that problem would be possible in more
      cases.  Rather than it being possible in user code with a barrier
      in the middle of a parallel region it should also be possible
      between two parallel regions (because we now only add an
      acquire-release ordering from the primary thread to secondary
      threads on entry to a parallel region rather than ensuring an
      acquire-release ordering between all threads on entry to a parallel
      region).
    - If it's agreed that this is a problem I would make the
      `team->task_count` accesses and writes atomic and do the same for
      the write to `bar->generation` in `gomp_team_barrier_done`.

Signed-off-by: Matthew Malcomson <[email protected]>
---
  libgomp/config/linux/bar.c                    | 189 +++++++++++--
  libgomp/config/linux/bar.h                    | 256 +++++++-----------
  libgomp/config/linux/simple-bar.h             |  69 +++++
  libgomp/config/nvptx/bar.c                    |   4 +-
  libgomp/config/nvptx/bar.h                    |  15 +-
  libgomp/config/nvptx/simple-bar.h             |   8 +-
  libgomp/config/nvptx/team.c                   |   4 +-
  libgomp/config/posix/bar.h                    |  48 +---
  libgomp/config/posix/pool.h                   |   1 +
  libgomp/config/posix/simple-bar.h             |  41 +--
  libgomp/config/rtems/bar.h                    |  61 ++++-
  libgomp/libgomp.h                             |   3 +-
  libgomp/task.c                                |  19 +-
  libgomp/team.c                                | 202 ++++++++------
  .../testsuite/libgomp.c++/task-reduction-20.C | 126 +++++++++
  .../testsuite/libgomp.c++/task-reduction-21.C | 130 +++++++++
  16 files changed, 810 insertions(+), 366 deletions(-)
  create mode 100644 libgomp/config/linux/simple-bar.h
  create mode 100644 libgomp/testsuite/libgomp.c++/task-reduction-20.C
  create mode 100644 libgomp/testsuite/libgomp.c++/task-reduction-21.C

diff --git a/libgomp/config/linux/bar.c b/libgomp/config/linux/bar.c
index a072ac5854b..3d803deafb1 100644
--- a/libgomp/config/linux/bar.c
+++ b/libgomp/config/linux/bar.c
@@ -31,6 +31,47 @@
  #include "wait.h"
  #include "futex_waitv.h"

+void

+gomp_centralized_barrier_wait_end (gomp_barrier_t *bar, gomp_barrier_state_t 
state)
+{
+  if (__builtin_expect (state & BAR_WAS_LAST, 0))
+    {
+      /* Next time we'll be awaiting TOTAL threads again.  */
+      bar->awaited = bar->total;
+      __atomic_store_n (&bar->generation, bar->generation + BAR_INCR,
+                       MEMMODEL_RELEASE);
+      futex_wake ((int *) &bar->generation, INT_MAX);
+    }
+  else
+    {
+      do
+       do_wait ((int *) &bar->generation, state);
+      while (__atomic_load_n (&bar->generation, MEMMODEL_ACQUIRE) == state);
+    }
+}
+
+void
+gomp_centralized_barrier_wait (gomp_barrier_t *bar)
+{
+  gomp_barrier_state_t state =  gomp_centralized_barrier_wait_start (bar);
+  gomp_centralized_barrier_wait_end (bar, state);
+}
+
+/* Like gomp_centralized_barrier_wait, except that if the encountering thread
+   is not the last one to hit the barrier, it returns immediately.  The
+   intended usage is that a thread which intends to gomp_barrier_destroy this
+   barrier calls gomp_centralized_barrier_wait, while all other threads call
+   gomp_centralized_barrier_wait_last.  When gomp_centralized_barrier_wait
+   returns, the barrier can be safely destroyed.  */
+
+void
+gomp_centralized_barrier_wait_last (gomp_barrier_t *bar)
+{
+  gomp_barrier_state_t state = gomp_centralized_barrier_wait_start (bar);
+  if (state & BAR_WAS_LAST)
+    gomp_centralized_barrier_wait_end (bar, state);
+}
+
  void
  gomp_barrier_ensure_last (gomp_barrier_t *bar, unsigned id,
                          gomp_barrier_state_t state)
@@ -368,6 +409,9 @@ gomp_team_barrier_wait_end (gomp_barrier_t *bar, 
gomp_barrier_state_t state,
                            unsigned id)
  {
    unsigned int generation, gen;
+  gomp_assert (!(state & BAR_CANCELLED),
+              "Generation number includes BAR_CANCELLED in "
+              "gomp_team_barrier_wait_end: %u , id: %u", state, id);

if (__builtin_expect (state & BAR_WAS_LAST, 0))

      {
@@ -377,12 +421,11 @@ gomp_team_barrier_wait_end (gomp_barrier_t *bar, 
gomp_barrier_state_t state,
        team->work_share_cancelled = 0;
        if (__builtin_expect (team->task_count, 0))
        {
-         gomp_barrier_handle_tasks (state, false);
+         gomp_barrier_handle_tasks (state, BAR_INCR);
          state &= ~BAR_WAS_LAST;
        }
        else
        {
-         state &= ~BAR_CANCELLED;
          state += BAR_INCR - BAR_WAS_LAST;
          __atomic_store_n (&bar->generation, state, MEMMODEL_RELEASE);
          futex_wake ((int *) &bar->generation, INT_MAX);
@@ -391,23 +434,12 @@ gomp_team_barrier_wait_end (gomp_barrier_t *bar, 
gomp_barrier_state_t state,
      }

generation = state;

-  state &= ~BAR_CANCELLED;
    do
      {
        do_wait ((int *) &bar->generation, generation);
        gen = __atomic_load_n (&bar->generation, MEMMODEL_ACQUIRE);
-      gomp_assert ((gen == state + BAR_INCR)
-                    || (gen & BAR_CANCELLED) == (generation & BAR_CANCELLED)
-                    /* Can cancel a barrier when already gotten into final
-                       implicit barrier at the end of a parallel loop.
-                       This happens in `cancel-parallel-3.c`.
-                       In this case the above assertion does not hold because
-                       We are waiting on the implicit barrier at the end of a
-                       parallel region while some other thread is performing
-                       work in that parallel region, hits a
-                       `#pragma omp cancel parallel`, and sets said flag.  */
-                    || !(generation & BAR_CANCELLED),
-                  "Unnecessary looping due to BAR_CANCELLED diff"
+      gomp_assert (!(gen & BAR_CANCELLED),
+                  "BAR_CANCELLED set on generation in team barrier"
                   " gen: %u  generation: %u  id: %u",
                   gen, generation, id);
        /* TODO I believe this could end up executing tasks from the *next*
@@ -421,7 +453,7 @@ gomp_team_barrier_wait_end (gomp_barrier_t *bar, 
gomp_barrier_state_t state,
         implied about whether the above is a problem or not.  */
        if (__builtin_expect (gen & BAR_TASK_PENDING, 0))
        {
-         gomp_barrier_handle_tasks (state, false);
+         gomp_barrier_handle_tasks (state, BAR_INCR);
          gen = __atomic_load_n (&bar->generation, MEMMODEL_ACQUIRE);
        }
        /* These flags will not change until this barrier is completed.  Hence
@@ -432,7 +464,6 @@ gomp_team_barrier_wait_end (gomp_barrier_t *bar, 
gomp_barrier_state_t state,
         some thread will have used `gomp_barrier_handle_tasks` to go through
         all tasks and drop them.  */
        generation |= gen & BAR_WAITING_FOR_TASK;
-      generation |= gen & BAR_CANCELLED;
        /* Other flags that may be set in `bar->generation` are:
           1) BAR_SECONDARY_ARRIVED
         2) BAR_SECONDARY_CANCELLABLE_ARRIVED
@@ -452,10 +483,128 @@ gomp_team_barrier_wait (gomp_barrier_t *bar, unsigned id)
    gomp_team_barrier_wait_end (bar, state, id);
  }

+void

+gomp_team_barrier_wait_for_tasks (gomp_barrier_t *bar,
+                                 gomp_barrier_state_t state, unsigned id)
+{
+  unsigned int generation, gen;
+
+  if (__builtin_expect (state & BAR_WAS_LAST, 0))
+    {
+      gomp_assert (id == 0, "Id %u believes it is last\n", id);
+      struct gomp_thread *thr = gomp_thread ();
+      struct gomp_team *team = thr->ts.team;
+      team->work_share_cancelled = 0;
+      /* N.b. beware with the semantics of this load.  We are reading a value
+        set in another thread and using that to infer what has already
+        happened -- but we are not using memory ordering semantics, so we need
+        to be careful about what we infer based on it.
+        The `gomp_barrier_handle_tasks` function does everything under mutexes
+        (which have acquire-release ordering between mutex lock and unlock) so
+        actions inside that function are OK.  If adding anything here need to
+        be careful.  */
+      if (__builtin_expect (team->task_count, 0))
+       {
+         gomp_barrier_handle_tasks (state, BAR_HOLDING_SECONDARIES);
+         state &= ~BAR_WAS_LAST;
+       }
+      else
+       {
+         /* Believe that we don't need to use an atomic operation because:
+            1) We know no other thread is writing to this value, since all
+               tasks are finished (hence no user code adding a task) and we
+               never set `BAR_WAITING_FOR_TASK` so no other thread will mark
+               this generation as "done".
+            2) We don't need to add any memory model semantics here.  That is
+               left for the "release" function call later on in the primary
+               thread.
+
+            However, weak memory model means that while whatever secondary
+            thread reduced `team->task_count` to zero did so after having
+            cleared `BAR_TASK_PENDING`.  From our perspective we could see
+            this as happening *before* clearing `BAR_TASK_PENDING`.
+
+            Hence we can't do something like the below:
+               bar->generation |= BAR_HOLDING_SECONDARIES;
+
+            Other threads could also see our setting of `bar->generation` as
+            coming before this other secondary thread clearing the
+            BAR_TASK_PENDING bit.  However this doesn't matter since the
+            "clear flag" is safe to perform after the operation of "clear all
+            flags except BAR_HOLDING_SECONDARIES".  */
+         unsigned gens = state & ~BAR_WAS_LAST;
+         gens |= BAR_HOLDING_SECONDARIES;
+         bar->generation = gens;
+         return;
+       }
+    }
+
+  generation = state;
+  state &= ~BAR_CANCELLED;
+  do
+    {
+      do_wait ((int *) &bar->generation, generation);
+      gen = __atomic_load_n (&bar->generation, MEMMODEL_ACQUIRE);
+      gomp_assert ((gen & BAR_CANCELLED) == (generation & BAR_CANCELLED)
+                  || (gen > state + BAR_INCR)
+                    || ((gen == state + BAR_INCR)
+                        && (generation & BAR_CANCELLED))
+                    || ((gen == ((state + BAR_INCR) & BAR_TASK_PENDING))
+                        && (generation & BAR_CANCELLED))
+                    /* Can cancel a barrier when already gotten into final
+                       implicit barrier at the end of a parallel loop.
+                       This happens in `cancel-parallel-3.c`.
+                       In this case the above assertion does not hold because
+                       We are waiting on the implicit barrier at the end of a
+                       parallel region while some other thread is performing
+                       work in that parallel region, hits a
+                       `#pragma omp cancel parallel`, and sets said flag.  */
+                    || !(generation & BAR_CANCELLED),
+                  "Unnecessary looping due to BAR_CANCELLED diff"
+                  " gen: %u  state: %u  generation: %u  id: %u",
+                  gen, state, generation, id);
+      gomp_assert (gen < (state + 2 * BAR_INCR),
+                  "Generation has gone ahead of us two times!"
+                  " gen = %u, state = %u, generation = %u",
+                  gen, state, generation);
+      if (__builtin_expect (gen & BAR_TASK_PENDING, 0))
+       {
+         gomp_barrier_handle_tasks (state, BAR_HOLDING_SECONDARIES);
+         gen = __atomic_load_n (&bar->generation, MEMMODEL_ACQUIRE);
+       }
+      generation |= gen & BAR_WAITING_FOR_TASK;
+      generation |= gen & BAR_CANCELLED;
+      /* Situation where we know all tasks are finished, and we were called
+        with instructions to wait for all tasks to finish before letting the
+        primary thread exit and holding secondary threads ready to be woken
+        up.  */
+      if (id == 0 && gen & BAR_HOLDING_SECONDARIES)
+       return;
+      generation |= gen & BAR_HOLDING_SECONDARIES;
+    }
+  while (gen < state + BAR_INCR);
+}
+
  void
  gomp_team_barrier_wait_final (gomp_barrier_t *bar, unsigned id)
  {
-  return gomp_team_barrier_wait (bar, id);
+  gomp_barrier_state_t state =  gomp_barrier_wait_start (bar, id);
+  if (__builtin_expect (state & BAR_WAS_LAST, 0))
+    gomp_team_barrier_ensure_last (bar, id, state);
+  gomp_team_barrier_wait_for_tasks (bar, state, id);
+}
+
+void
+gomp_team_barrier_done_final (gomp_barrier_t *bar, unsigned id)
+{
+  gomp_assert (id == 0, "called with ID = %u", id);
+  unsigned gen = bar->generation;
+  gomp_assert ((gen & BAR_FLAGS_MASK & ~BAR_CANCELLED) == 
BAR_HOLDING_SECONDARIES,
+              "gomp_team_barrier_done_final called with generation: %u", gen);
+  gen &= ~(BAR_HOLDING_SECONDARIES | BAR_CANCELLED);
+  gen += BAR_INCR;
+  __atomic_store_n (&bar->generation, gen, MEMMODEL_RELEASE);
+  futex_wake ((int *) &bar->generation, INT_MAX);
  }

void

@@ -608,7 +757,7 @@ gomp_team_barrier_wait_cancel_end (gomp_barrier_t *bar,
        team->work_share_cancelled = 0;
        if (__builtin_expect (team->task_count, 0))
        {
-         gomp_barrier_handle_tasks (state, true);
+         gomp_barrier_handle_tasks (state, BAR_CANCEL_INCR);
          state &= ~BAR_WAS_LAST;
        }
        else
@@ -723,7 +872,7 @@ gomp_team_barrier_wait_cancel_end (gomp_barrier_t *bar,
        }
        if (__builtin_expect (gen & BAR_TASK_PENDING, 0))
        {
-         gomp_barrier_handle_tasks (state, true);
+         gomp_barrier_handle_tasks (state, BAR_CANCEL_INCR);
          gen = __atomic_load_n (&bar->generation, MEMMODEL_ACQUIRE);
        }
        generation |= gen & BAR_WAITING_FOR_TASK;
diff --git a/libgomp/config/linux/bar.h b/libgomp/config/linux/bar.h
index c91832828ab..b00cfd00307 100644
--- a/libgomp/config/linux/bar.h
+++ b/libgomp/config/linux/bar.h
@@ -55,6 +55,8 @@ typedef struct
    unsigned total __attribute__((aligned (64)));
    unsigned allocated;
    unsigned generation;
+  /* `awaited` only used for "simple" barrier.  */
+  unsigned awaited __attribute__((aligned (64)));
    struct thread_lock_data *threadgens;
  } gomp_barrier_t;

@@ -74,10 +76,13 @@ typedef unsigned int gomp_barrier_state_t;

     be available on all kernels newer than Linux 5.16.  */
  #define BAR_SECONDARY_ARRIVED 8
  #define BAR_SECONDARY_CANCELLABLE_ARRIVED     16
-/* Using bits 5 -> 10 for the generation number of cancellable barriers and
+/* Flag to indicate that primary should be released while others should be
+   left.  */
+#define BAR_HOLDING_SECONDARIES        32
+/* Using bits 6 -> 11 for the generation number of cancellable barriers and
     remaining bits for the generation number of non-cancellable barriers.  */
-#define BAR_CANCEL_INCR                32
-#define BAR_INCR               2048
+#define BAR_CANCEL_INCR                64
+#define BAR_INCR               4096
  #define BAR_FLAGS_MASK                (~(-BAR_CANCEL_INCR))
  #define BAR_GEN_MASK          (-BAR_INCR)
  #define BAR_BOTH_GENS_MASK    (-BAR_CANCEL_INCR)
@@ -109,25 +114,29 @@ gomp_assert_seenflags (gomp_barrier_t *bar, bool 
cancellable)
    /* Assert that all threads have been seen.  */
    for (unsigned i = 0; i < bar->total; i++)
      {
-      gomp_assert (arr[i].gen == (gen & BAR_GEN_MASK) + incr,
+      unsigned g = arr[i].gen;
+      unsigned cg = arr[i].cgen;
+      gomp_assert (g == (gen & BAR_GEN_MASK) + incr,
                   "Index %u generation is %u (global is %u)\n",
-                  i, arr[i].gen, gen);
-      gomp_assert ((arr[i].cgen & BAR_CANCEL_GEN_MASK)
+                  i, g, gen);
+      gomp_assert ((cg & BAR_CANCEL_GEN_MASK)
                     == ((gen + cancel_incr) & BAR_CANCEL_GEN_MASK),
-                  "Index %u cancel generation is %u (global is %u)\n", i,
-                  arr[i].cgen, gen);
+                  "Index %u cancel generation is %u (global is %u)\n", i, cg,
+                  gen);
      }
-
+
    /* Assert that generation numbers not corresponding to any thread are
       cleared.  This helps us test code-paths.  */
    for (unsigned i = bar->total; i < bar->allocated; i++)
      {
-      gomp_assert (arr[i].gen == 0,
-                  "Index %u gen should be 0.  Is %u (global gen is %u)\n",
-                  i, arr[i].gen, gen);
-      gomp_assert (arr[i].cgen == 0,
-                  "Index %u gen should be 0.  Is %u (global gen is %u)\n",
-                  i, arr[i].cgen, gen);
+      unsigned g = arr[i].gen;
+      unsigned cg = arr[i].cgen;
+      gomp_assert (g == 0,
+                  "Index %u gen should be 0.  Is %u (global gen is %u)\n", i,
+                  g, gen);
+      gomp_assert (cg == 0,
+                  "Index %u gen should be 0.  Is %u (global gen is %u)\n", i,
+                  cg, gen);
      }
  #endif
  }
@@ -144,124 +153,18 @@ gomp_barrier_init (gomp_barrier_t *bar, unsigned count)
      }
    bar->total = count;
    bar->allocated = count;
+  bar->awaited = count;
    bar->generation = 0;
  }

-static inline bool

-gomp_barrier_has_space (gomp_barrier_t *bar, unsigned nthreads)
-{
-  return nthreads <= bar->allocated;
-}
-
-static inline void
-gomp_barrier_minimal_reinit (gomp_barrier_t *bar, unsigned nthreads,
-                            unsigned num_new_threads)
-{
-  /* Just increasing number of threads by appending logically used threads at
-     "the end" of the team.  That essentially means we need more of the
-     `bar->threadgens` array to be logically used.  We set them all to the
-     current `generation` (marking that they are yet to hit this generation).
-
-     This function has only been called after we checked there is enough space
-     in this barrier for the number of threads we want using it.  Hence there's
-     no serialisation needed.  */
-  gomp_assert (nthreads <= bar->allocated,
-              "minimal reinit on barrier with not enough space: "
-              "%u > %u",
-              nthreads, bar->allocated);
-  unsigned gen = bar->generation & BAR_GEN_MASK;
-  unsigned cancel_gen = bar->generation & BAR_CANCEL_GEN_MASK;
-  gomp_assert (bar->total == nthreads - num_new_threads,
-              "minimal_reinit called with incorrect state: %u != %u - %u\n",
-              bar->total, nthreads, num_new_threads);
-  for (unsigned i = bar->total; i < nthreads; i++)
-    {
-      bar->threadgens[i].gen = gen;
-      bar->threadgens[i].cgen = cancel_gen;
-    }
-  bar->total = nthreads;
-}
-
-/* When re-initialising a barrier we know the following:
-   1) We are waiting on a non-cancellable barrier.
-   2) The cancel generation bits are known consistent (having been tidied up by
-      each individual thread if the barrier got cancelled).  */
+/* When re-initialising a barrier we know that all threads are serialised on
+   something else (because `gomp_barrier_can_hold` returns true.  However we
+   still want to have memory synchronisation between each thread so still use
+   atomic operations.  */
  static inline void
-gomp_barrier_reinit_1 (gomp_barrier_t *bar, unsigned nthreads,
-                      unsigned num_new_threads, unsigned long long *new_ids)
+gomp_centralized_barrier_reinit (gomp_barrier_t *bar, unsigned nthreads)
  {
-#if _LIBGOMP_CHECKING_
-  /* Assertions that this barrier is in a sensible state.
-     Everything waiting on the standard barrier.
-     Current thread has not registered itself as arrived, but we tweak for the
-     current assertions.  */
-  bar->threadgens[0].gen += BAR_INCR;
-  gomp_assert_seenflags (bar, false);
-  bar->threadgens[0].gen -= BAR_INCR;
-  struct thread_lock_data threadgen_zero = bar->threadgens[0];
-#endif
-  if (!gomp_barrier_has_space (bar, nthreads))
-    {
-      /* Using `realloc` not chosen.  Pros/Cons below.
-        Pros of using `realloc`:
-        - May not actually have to move memory.
-        Cons of using `realloc`:
-        - If do have to move memory, then *also* copies data, we are going to
-          overwrite the data in this function.  That copy would be a waste.
-        - If do have to move memory then pointer may no longer be aligned.
-          Would need bookkeeping for "pointer to free" and "pointer to have
-          data".
-        Seems like "bad" case of `realloc` is made even worse by what we need
-        here.  Would have to benchmark to figure out whether using `realloc`
-        or not is best.  Since we shouldn't be re-allocating very often I'm
-        choosing the simplest to code rather than the most optimal.
-
-        Does not matter that we have any existing threads waiting on this
-        barrier.  They are all waiting on bar->generation and their
-        thread-local generation will not be looked at.  */
-      gomp_aligned_free (bar->threadgens);
-      bar->threadgens
-       = gomp_aligned_alloc (64, sizeof (bar->threadgens[0]) * nthreads);
-      bar->allocated = nthreads;
-    }
-
-  /* Re-initialise the existing values.  */
-  unsigned curgen = bar->generation & BAR_GEN_MASK;
-  unsigned cancel_curgen = bar->generation & BAR_CANCEL_GEN_MASK;
-  unsigned iter_len = nthreads;
-  unsigned bits_per_ull = sizeof (unsigned long long) * CHAR_BIT;
-#if _LIBGOMP_CHECKING_
-  /* If checking, zero out everything that's not going to be used in this team.
-     This is only helpful for debugging (other assertions later can ensure that
-     we've gone through this path for adjusting the number of threads, and when
-     viewing the data structure in GDB can easily identify which generation
-     numbers are in use).  When not running assertions or running in the
-     debugger these extra numbers are simply not used.  */
-  iter_len = bar->allocated;
-  /* In the checking build just unconditionally reinitialise.  This handles
-     when the memory has moved and is harmless (except in performance which the
-     checking build doesn't care about) otherwise.  */
-  bar->threadgens[0] = threadgen_zero;
-#endif
-  for (unsigned i = 1; i < iter_len; i++)
-    {
-      /* Re-initialisation.  Zero out the "remaining" elements in our wake flag
-        array when _LIBGOMP_CHECKING_ as a helper for our assertions to check
-        validity.  Set thread-specific generations to "seen" for `i's
-        corresponding to re-used threads, set thread-specific generations to
-        "not yet seen" for `i's corresponding to threads about to be
-        spawned.  */
-      unsigned newthr_val = i < nthreads ? curgen : 0;
-      unsigned newthr_cancel_val = i < nthreads ? cancel_curgen : 0;
-      unsigned index = i / bits_per_ull;
-      unsigned long long bitmask = (1ULL << (i % bits_per_ull));
-      bool bit_is_set = ((new_ids[index] & bitmask) != 0);
-      bar->threadgens[i].gen = bit_is_set ? curgen + BAR_INCR : newthr_val;
-      /* This is different because we only ever call this function while 
threads
-        are waiting on a non-cancellable barrier.  Hence "which threads have
-        arrived and which will be newly spawned" is not a question.  */
-      bar->threadgens[i].cgen = newthr_cancel_val;
-    }
+  __atomic_add_fetch (&bar->awaited, nthreads - bar->total, MEMMODEL_ACQ_REL);
    bar->total = nthreads;
  }

@@ -270,21 +173,26 @@ static inline void gomp_barrier_destroy (gomp_barrier_t *bar)

    gomp_aligned_free (bar->threadgens);
  }

-static inline void

-gomp_barrier_reinit_2 (gomp_barrier_t __attribute__ ((unused)) * bar,
-                      unsigned __attribute__ ((unused)) nthreads) {};
  extern void gomp_barrier_wait (gomp_barrier_t *, unsigned);
  extern void gomp_barrier_wait_last (gomp_barrier_t *, unsigned);
  extern void gomp_barrier_wait_end (gomp_barrier_t *, gomp_barrier_state_t,
                                   unsigned);
+
+extern void gomp_centralized_barrier_wait (gomp_barrier_t *);
+extern void gomp_centralized_barrier_wait_last (gomp_barrier_t *);
+extern void gomp_centralized_barrier_wait_end (gomp_barrier_t *,
+                                              gomp_barrier_state_t);
+
  extern void gomp_team_barrier_wait (gomp_barrier_t *, unsigned);
  extern void gomp_team_barrier_wait_final (gomp_barrier_t *, unsigned);
  extern void gomp_team_barrier_wait_end (gomp_barrier_t *,
                                        gomp_barrier_state_t, unsigned);
+
  extern bool gomp_team_barrier_wait_cancel (gomp_barrier_t *, unsigned);
  extern bool gomp_team_barrier_wait_cancel_end (gomp_barrier_t *,
                                               gomp_barrier_state_t,
                                               unsigned);
+
  extern void gomp_team_barrier_wake (gomp_barrier_t *, int);
  struct gomp_team;
  extern void gomp_team_barrier_cancel (struct gomp_team *);
@@ -299,6 +207,21 @@ extern void gomp_assert_and_increment_flag (gomp_barrier_t 
*, unsigned,
  extern void gomp_assert_and_increment_cancel_flag (gomp_barrier_t *, unsigned,
                                                   unsigned);

+static inline gomp_barrier_state_t

+gomp_centralized_barrier_wait_start (gomp_barrier_t *bar)
+{
+  unsigned int ret = __atomic_load_n (&bar->generation, MEMMODEL_ACQUIRE);
+  ret &= -BAR_INCR | BAR_CANCELLED;
+  /* A memory barrier is needed before exiting from the various forms
+     of gomp_barrier_wait, to satisfy OpenMP API version 3.1 section
+     2.8.6 flush Construct, which says there is an implicit flush during
+     a barrier region.  This is a convenient place to add the barrier,
+     so we use MEMMODEL_ACQ_REL here rather than MEMMODEL_ACQUIRE.  */
+  if (__atomic_add_fetch (&bar->awaited, -1, MEMMODEL_ACQ_REL) == 0)
+    ret |= BAR_WAS_LAST;
+  return ret;
+}
+
  static inline gomp_barrier_state_t
  gomp_barrier_wait_start (gomp_barrier_t *bar, unsigned id)
  {
@@ -407,7 +330,8 @@ gomp_team_barrier_cancelled (gomp_barrier_t *bar)
  }

static inline void

-gomp_team_barrier_done (gomp_barrier_t *bar, gomp_barrier_state_t state, bool 
use_cancel)
+gomp_team_barrier_done (gomp_barrier_t *bar, gomp_barrier_state_t state,
+                       int increment)
  {
    /* N.b. not using atomic operations here because when performing this
       operation we know that all threads have arrived at the barrier and are
@@ -417,38 +341,48 @@ gomp_team_barrier_done (gomp_barrier_t *bar, 
gomp_barrier_state_t state, bool us
       swap loop anyway.  We know that it should always succeed immediately, but
       it doesn't leave a footgun for later changes to the code.  */
    unsigned gens = (state & BAR_BOTH_GENS_MASK);
-  bar->generation = use_cancel ? BAR_INCREMENT_CANCEL (gens) : gens + BAR_INCR;
+  switch (increment) {
+    case BAR_CANCEL_INCR:
+      bar->generation = BAR_INCREMENT_CANCEL (gens);
+      return;
+    case BAR_INCR:
+      bar->generation = gens + BAR_INCR;
+      return;
+    case BAR_HOLDING_SECONDARIES:
+      gomp_assert (!(bar->generation & BAR_HOLDING_SECONDARIES),
+                  "Setting BAR_HOLDING_SECONDARIES on generation of %u",
+                  bar->generation);
+      bar->generation = gens | BAR_HOLDING_SECONDARIES;
+      return;
+  }
  }

-static inline void

-gomp_barrier_prepare_reinit (gomp_barrier_t *bar, unsigned id)
+extern void gomp_team_barrier_done_final (gomp_barrier_t *bar, unsigned id);
+
+static inline bool
+gomp_barrier_can_hold (gomp_barrier_t *bar)
  {
-  gomp_assert (id == 0,
-              "gomp_barrier_prepare_reinit called in non-primary thread: %u",
-              id);
-  /* This use of `gomp_barrier_wait_start` is worth note.
-     1) We're running in `id == 0`, which means that without checking we'll
-       essentially just load `bar->generation`.
-     2) In this case there's no need to form any release-acquire ordering.  The
-       `gomp_barrier_ensure_last` call below will form a release-acquire
-       ordering between each secondary thread and this one, and that will be
-       from some point after all uses of the barrier that we care about.
-     3) However, in the checking builds, it's very useful to call
-       `gomp_assert_and_increment_flag` in order to provide extra guarantees
-       about what we're doing.  */
-  gomp_barrier_state_t state = gomp_barrier_wait_start (bar, id);
-  gomp_barrier_ensure_last (bar, id, state);
-#if _LIBGOMP_CHECKING_
-  /* When checking, `gomp_assert_and_increment_flag` will have incremented the
-     generation flag.  However later on down the line we'll be calling the full
-     barrier again and we need to decrement that flag ready for that.  We still
-     *want* the flag to have been incremented above so that the assertions in
-     `gomp_barrier_ensure_last` all work.
-
-     When not checking, this increment/decrement/increment again cycle is not
-     performed.  */
-  bar->threadgens[0].gen -= BAR_INCR;
-#endif
+  return true;
+}
+
+static inline bool
+gomp_barrier_completed (gomp_barrier_state_t state, unsigned increment,
+                       gomp_barrier_t *bar)
+{
+  /* Performing tasks on a barrier earlier in a given tasking region is
+     something that already happens.  Though I think it's a problem I haven't
+     yet gotten 100% sure on that (see PR122314).  Performing tasks on a
+     *different* given tasking region is definitely a problem -- e.g. omp
+     reduction initialisation may not have been run but reduction tasks may
+     have been scheduled.
+
+     When we're worried about performing tasks on a *different* tasking region
+     we will see the BAR_HOLDING_SECONDARIES increment.  If in PR122314 it
+     turns out that said case is also a problem we would just have to adjust
+     this function.  */
+  if (increment != BAR_HOLDING_SECONDARIES)
+    return false;
+   return bar->generation >= ((state & BAR_BOTH_GENS_MASK) + BAR_INCR);
  }

#endif /* GOMP_BARRIER_H */

diff --git a/libgomp/config/linux/simple-bar.h 
b/libgomp/config/linux/simple-bar.h
new file mode 100644
index 00000000000..ba6f90f6c48
--- /dev/null
+++ b/libgomp/config/linux/simple-bar.h
@@ -0,0 +1,69 @@
+/* Copyright (C) 2015-2025 Free Software Foundation, Inc.
+   Contributed by Alexander Monakov <[email protected]>
+
+   This file is part of the GNU Offloading and Multi Processing Library
+   (libgomp).
+
+   Libgomp is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 3, or (at your option)
+   any later version.
+
+   Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY
+   WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
+   FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+   more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+
+/* This is a simplified barrier that is suitable for thread pool
+   synchronizaton.  Only a subset of full barrier API (bar.h) is exposed.  */
+
+#ifndef GOMP_SIMPLE_BARRIER_H
+#define GOMP_SIMPLE_BARRIER_H 1
+
+#include "bar.h"
+
+typedef struct
+{
+  gomp_barrier_t bar;
+} gomp_simple_barrier_t;
+
+static inline void
+gomp_simple_barrier_init (gomp_simple_barrier_t *bar, unsigned count)
+{
+  gomp_barrier_init (&bar->bar, count);
+}
+
+static inline void
+gomp_simple_barrier_reinit (gomp_simple_barrier_t *sbar, unsigned nthreads)
+{
+  gomp_centralized_barrier_reinit (&sbar->bar, nthreads);
+}
+
+static inline void
+gomp_simple_barrier_destroy (gomp_simple_barrier_t *bar)
+{
+  gomp_barrier_destroy (&bar->bar);
+}
+
+static inline void
+gomp_simple_barrier_wait (gomp_simple_barrier_t *bar)
+{
+  gomp_centralized_barrier_wait (&bar->bar);
+}
+
+static inline void
+gomp_simple_barrier_wait_last (gomp_simple_barrier_t *bar)
+{
+  gomp_centralized_barrier_wait_last (&bar->bar);
+}
+
+#endif /* GOMP_SIMPLE_BARRIER_H */
diff --git a/libgomp/config/nvptx/bar.c b/libgomp/config/nvptx/bar.c
index 1c0ea183dc2..72b75fa7cc1 100644
--- a/libgomp/config/nvptx/bar.c
+++ b/libgomp/config/nvptx/bar.c
@@ -100,7 +100,7 @@ gomp_team_barrier_wait_end (gomp_barrier_t *bar, 
gomp_barrier_state_t state,
      {
        while (__atomic_load_n (&bar->generation, MEMMODEL_ACQUIRE)
             & BAR_TASK_PENDING)
-       gomp_barrier_handle_tasks (state, false);
+       gomp_barrier_handle_tasks (state, 0);

if (bar->total > 1)

        asm volatile ("bar.sync 1, %0;" : : "r" (32 * bar->total));
@@ -155,7 +155,7 @@ gomp_team_barrier_wait_cancel_end (gomp_barrier_t *bar,
      {
        while (__atomic_load_n (&bar->generation, MEMMODEL_ACQUIRE)
             & BAR_TASK_PENDING)
-       gomp_barrier_handle_tasks (state, true);
+       gomp_barrier_handle_tasks (state, 0);

if (bar->total > 1)

        asm volatile ("bar.sync 1, %0;" : : "r" (32 * bar->total));
diff --git a/libgomp/config/nvptx/bar.h b/libgomp/config/nvptx/bar.h
index 2c45901cadf..71a39f0bde2 100644
--- a/libgomp/config/nvptx/bar.h
+++ b/libgomp/config/nvptx/bar.h
@@ -59,13 +59,11 @@ static inline void gomp_barrier_init (gomp_barrier_t *bar, 
unsigned count)
    bar->generation = 0;
  }

-/*

  static inline void gomp_barrier_reinit (gomp_barrier_t *bar, unsigned count)
  {
    __atomic_add_fetch (&bar->awaited, count - bar->total, MEMMODEL_ACQ_REL);
    bar->total = count;
  }
-*/

static inline void gomp_barrier_destroy (gomp_barrier_t *bar)

  {
@@ -169,11 +167,18 @@ gomp_team_barrier_cancelled (gomp_barrier_t *bar)

static inline void

  gomp_team_barrier_done (gomp_barrier_t *bar, gomp_barrier_state_t state,
-                       bool use_cancellable __attribute__ ((unused)))
+                       unsigned increment __attribute__ ((unused)))
  {
    bar->generation = (state & -BAR_INCR) + BAR_INCR;
  }

+static inline bool

+gomp_barrier_can_hold (gomp_barrier_t *bar)
+{
+  return false;
+}
+
+/* Functions dummied out for this implementation.  */
  static inline void
  gomp_team_barrier_ensure_last (gomp_barrier_t *bar __attribute__ ((unused)),
                               unsigned id __attribute__ ((unused)),
@@ -202,4 +207,8 @@ static inline void
  gomp_reset_cancellable_primary_threadgen (gomp_barrier_t *bar, unsigned id)
  {}

+static inline void

+gomp_team_barrier_done_final (gomp_barrier_t *bar, unsigned id)
+{}
+
  #endif /* GOMP_BARRIER_H */
diff --git a/libgomp/config/nvptx/simple-bar.h 
b/libgomp/config/nvptx/simple-bar.h
index d13d804f3c8..1098b831d5b 100644
--- a/libgomp/config/nvptx/simple-bar.h
+++ b/libgomp/config/nvptx/simple-bar.h
@@ -56,15 +56,13 @@ gomp_simple_barrier_destroy (gomp_simple_barrier_t *bar)
  }

static inline void

-gomp_simple_barrier_wait (gomp_simple_barrier_t *bar,
-                         unsigned id __attribute__ ((unused)))
+gomp_simple_barrier_wait (gomp_simple_barrier_t *bar)
  {
-  asm volatile ("bar.sync 0, %0;" : : "r"(bar->count) : "memory");
+  asm volatile ("bar.sync 0, %0;" : : "r" (bar->count) : "memory");
  }

static inline void

-gomp_simple_barrier_wait_last (gomp_simple_barrier_t *bar,
-                              unsigned id __attribute__ ((unused)))
+gomp_simple_barrier_wait_last (gomp_simple_barrier_t *bar)
  {
    asm volatile ("bar.arrive 0, %0;" : : "r" (bar->count) : "memory");
  }
diff --git a/libgomp/config/nvptx/team.c b/libgomp/config/nvptx/team.c
index 8eee855a5d6..f05e1ee0f91 100644
--- a/libgomp/config/nvptx/team.c
+++ b/libgomp/config/nvptx/team.c
@@ -129,7 +129,7 @@ gomp_thread_start (struct gomp_thread_pool *pool)

do

      {
-      gomp_simple_barrier_wait (&pool->threads_dock, thr->ts.team_id);
+      gomp_simple_barrier_wait (&pool->threads_dock);
        if (!thr->fn)
        continue;
        thr->fn (thr->data);
@@ -207,7 +207,7 @@ gomp_team_start (void (*fn) (void *), void *data, unsigned 
nthreads,
        team->ordered_release[i] = &nthr->release;
      }

- gomp_simple_barrier_wait (&pool->threads_dock, thr->ts.team_id);

+  gomp_simple_barrier_wait (&pool->threads_dock);
  }

int

diff --git a/libgomp/config/posix/bar.h b/libgomp/config/posix/bar.h
index 3c87baf6692..7a707b33c3f 100644
--- a/libgomp/config/posix/bar.h
+++ b/libgomp/config/posix/bar.h
@@ -158,47 +158,13 @@ gomp_team_barrier_done (gomp_barrier_t *bar, 
gomp_barrier_state_t state,
    bar->generation = (state & -BAR_INCR) + BAR_INCR;
  }

-/* Functions dummied out for this implementation. */

-static inline void
-gomp_barrier_prepare_reinit (gomp_barrier_t * bar __attribute__ ((unused)),
-                            unsigned id __attribute__ ((unused)))
-{}
-
-static inline void
-gomp_barrier_minimal_reinit (gomp_barrier_t *bar, unsigned nthreads,
-                            unsigned num_new_threads __attribute__ ((unused)))
-{
-  gomp_barrier_reinit (bar, nthreads);
-}
-
-static inline void
-gomp_barrier_reinit_1 (gomp_barrier_t *bar, unsigned nthreads,
-                      unsigned num_new_threads,
-                      unsigned long long *new_ids __attribute__ ((unused)))
-{
-  if (num_new_threads)
-    {
-      gomp_mutex_lock (&bar->mutex1);
-      bar->total += num_new_threads;
-      gomp_mutex_unlock (&bar->mutex1);
-    }
-}
-
-static inline void
-gomp_barrier_reinit_2 (gomp_barrier_t *bar, unsigned nthreads)
-{
-  gomp_barrier_reinit (bar, nthreads);
-}
-
  static inline bool
-gomp_barrier_has_space (gomp_barrier_t *bar __attribute__ ((unused)),
-                       unsigned nthreads __attribute__ ((unused)))
+gomp_barrier_can_hold (gomp_barrier_t *bar)
  {
-  /* Space to handle `nthreads`.  Only thing that we need is to set bar->total
-     to `nthreads`.  Can always do that.  */
-  return true;
+  return false;
  }

+/* Functions dummied out for this implementation. */

  static inline void
  gomp_team_barrier_ensure_last (gomp_barrier_t *bar __attribute__ ((unused)),
                               unsigned id __attribute__ ((unused)),
@@ -224,7 +190,13 @@ gomp_team_barrier_ensure_cancel_last (gomp_barrier_t *bar
  }

static inline void

-gomp_reset_cancellable_primary_threadgen (gomp_barrier_t *bar, unsigned id)
+gomp_reset_cancellable_primary_threadgen (gomp_barrier_t *bar,
+                                         unsigned id __attribute__ ((unused)))
+{}
+
+static inline void
+gomp_team_barrier_done_final (gomp_barrier_t *bar,
+                             unsigned id __attribute__ ((unused)))
  {}

#endif /* GOMP_BARRIER_H */

diff --git a/libgomp/config/posix/pool.h b/libgomp/config/posix/pool.h
index d250d327df0..4415db3b51e 100644
--- a/libgomp/config/posix/pool.h
+++ b/libgomp/config/posix/pool.h
@@ -44,6 +44,7 @@ gomp_get_thread_pool (struct gomp_thread *thr, unsigned 
nthreads)
        pool->threads_size = 0;
        pool->threads_used = 0;
        pool->last_team = NULL;
+      pool->prev_barrier = NULL;
        pool->threads_busy = nthreads;
        thr->thread_pool = pool;
        pthread_setspecific (gomp_thread_destructor, thr);
diff --git a/libgomp/config/posix/simple-bar.h 
b/libgomp/config/posix/simple-bar.h
index 7a3a38f2b63..74c40c10398 100644
--- a/libgomp/config/posix/simple-bar.h
+++ b/libgomp/config/posix/simple-bar.h
@@ -43,18 +43,9 @@ gomp_simple_barrier_init (gomp_simple_barrier_t *bar, 
unsigned count)
  }

static inline void

-gomp_simple_barrier_minimal_reinit (gomp_simple_barrier_t *bar, unsigned 
nthreads,
-                                   unsigned num_new_threads)
+gomp_simple_barrier_reinit (gomp_simple_barrier_t *sbar, unsigned nthreads)
  {
-  gomp_barrier_minimal_reinit (&bar->bar, nthreads, num_new_threads);
-}
-
-static inline void
-gomp_simple_barrier_reinit_1 (gomp_simple_barrier_t *bar, unsigned nthreads,
-                             unsigned num_new_threads,
-                             unsigned long long *new_ids)
-{
-  gomp_barrier_reinit_1 (&bar->bar, nthreads, num_new_threads, new_ids);
+  gomp_barrier_reinit (&sbar->bar, nthreads);
  }

static inline void

@@ -64,33 +55,17 @@ gomp_simple_barrier_destroy (gomp_simple_barrier_t *bar)
  }

static inline void

-gomp_simple_barrier_wait (gomp_simple_barrier_t *bar, unsigned id)
-{
-  gomp_barrier_wait (&bar->bar, id);
-}
-
-static inline void
-gomp_simple_barrier_wait_last (gomp_simple_barrier_t *bar, unsigned id)
+gomp_simple_barrier_wait (gomp_simple_barrier_t *bar)
  {
-  gomp_barrier_wait_last (&bar->bar, id);
+  /* In default implementation the barrier ID argument is unused.  */
+  gomp_barrier_wait (&bar->bar, 0);
  }

static inline void

-gomp_simple_barrier_prepare_reinit (gomp_simple_barrier_t *sbar, unsigned id)
-{
-  gomp_barrier_prepare_reinit (&sbar->bar, id);
-}
-
-static inline void
-gomp_simple_barrier_reinit_2 (gomp_simple_barrier_t *sbar, unsigned nthreads)
-{
-  gomp_barrier_reinit_2 (&sbar->bar, nthreads);
-}
-
-static inline bool
-gomp_simple_barrier_has_space (gomp_simple_barrier_t *sbar, unsigned nthreads)
+gomp_simple_barrier_wait_last (gomp_simple_barrier_t *bar)
  {
-  return gomp_barrier_has_space (&sbar->bar, nthreads);
+  /* In default implementation the barrier ID argument is unused.  */
+  gomp_barrier_wait_last (&bar->bar, 0);
  }

#endif /* GOMP_SIMPLE_BARRIER_H */

diff --git a/libgomp/config/rtems/bar.h b/libgomp/config/rtems/bar.h
index 0d10efa28b1..cf4ca958362 100644
--- a/libgomp/config/rtems/bar.h
+++ b/libgomp/config/rtems/bar.h
@@ -73,16 +73,18 @@ static inline void gomp_barrier_destroy (gomp_barrier_t 
*bar)
  {
  }

-extern void gomp_barrier_wait (gomp_barrier_t *);

-extern void gomp_barrier_wait_last (gomp_barrier_t *);
-extern void gomp_barrier_wait_end (gomp_barrier_t *, gomp_barrier_state_t);
-extern void gomp_team_barrier_wait (gomp_barrier_t *);
-extern void gomp_team_barrier_wait_final (gomp_barrier_t *);
-extern void gomp_team_barrier_wait_end (gomp_barrier_t *,
-                                       gomp_barrier_state_t);
-extern bool gomp_team_barrier_wait_cancel (gomp_barrier_t *);
+extern void gomp_barrier_wait (gomp_barrier_t *, unsigned);
+extern void gomp_barrier_wait_last (gomp_barrier_t *, unsigned);
+extern void gomp_barrier_wait_end (gomp_barrier_t *, gomp_barrier_state_t,
+                                  unsigned);
+
+extern void gomp_team_barrier_wait (gomp_barrier_t *, unsigned);
+extern void gomp_team_barrier_wait_final (gomp_barrier_t *, unsigned);
+extern void gomp_team_barrier_wait_end (gomp_barrier_t *, gomp_barrier_state_t,
+                                       unsigned);
+extern bool gomp_team_barrier_wait_cancel (gomp_barrier_t *, unsigned);
  extern bool gomp_team_barrier_wait_cancel_end (gomp_barrier_t *,
-                                              gomp_barrier_state_t);
+                                              gomp_barrier_state_t, unsigned);
  extern void gomp_team_barrier_wake (gomp_barrier_t *, int);
  struct gomp_team;
  extern void gomp_team_barrier_cancel (struct gomp_team *);
@@ -167,4 +169,45 @@ gomp_team_barrier_done (gomp_barrier_t *bar, 
gomp_barrier_state_t state)
    bar->generation = (state & -BAR_INCR) + BAR_INCR;
  }

+static inline bool

+gomp_barrier_can_hold (gomp_barrier_t *bar)
+{
+  return false;
+}
+
+/* Functions dummied out for this implementation.  */
+static inline void
+gomp_team_barrier_ensure_last (gomp_barrier_t *bar __attribute__ ((unused)),
+                              unsigned id __attribute__ ((unused)),
+                              gomp_barrier_state_t state
+                              __attribute__ ((unused)))
+{}
+
+static inline bool
+gomp_team_barrier_ensure_cancel_last (gomp_barrier_t *bar
+                                     __attribute__ ((unused)),
+                                     unsigned id __attribute__ ((unused)),
+                                     gomp_barrier_state_t state
+                                     __attribute__ ((unused)))
+{
+  /* After returning BAR_WAS_LAST, actually ensure that this thread is last.
+     Return `true` if this thread is known last into the barrier return `false`
+     if the barrier got cancelled such that not all threads entered the 
barrier.
+
+     Since BAR_WAS_LAST is only set for a thread when that thread decremented
+     the `awaited` counter to zero we know that all threads must have entered
+     the barrier.  Hence always return `true`.  */
+  return true;
+}
+
+static inline void
+gomp_reset_cancellable_primary_threadgen (gomp_barrier_t *bar,
+                                         unsigned id __attribute__ ((unused)))
+{}
+
+static inline void
+gomp_team_barrier_done_final (gomp_barrier_t *bar,
+                             unsigned id __attribute__ ((unused)))
+{}
+
  #endif /* GOMP_BARRIER_H */
diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h
index de62a205e2d..f68b8baa8b9 100644
--- a/libgomp/libgomp.h
+++ b/libgomp/libgomp.h
@@ -935,6 +935,7 @@ struct gomp_thread_pool

/* This barrier holds and releases threads waiting in thread pools. */

    gomp_simple_barrier_t threads_dock;
+  gomp_barrier_t *prev_barrier;
  };

enum gomp_cancel_kind

@@ -1108,7 +1109,7 @@ extern unsigned gomp_dynamic_max_threads (void);
  extern void gomp_init_task (struct gomp_task *, struct gomp_task *,
                            struct gomp_task_icv *);
  extern void gomp_end_task (void);
-extern void gomp_barrier_handle_tasks (gomp_barrier_state_t, bool);
+extern void gomp_barrier_handle_tasks (gomp_barrier_state_t, unsigned);
  extern void gomp_task_maybe_wait_for_dependencies (void **);
  extern bool gomp_create_target_task (struct gomp_device_descr *,
                                     void (*) (void *), size_t, void **,
diff --git a/libgomp/task.c b/libgomp/task.c
index 0cda2190698..e4520a78413 100644
--- a/libgomp/task.c
+++ b/libgomp/task.c
@@ -1549,7 +1549,7 @@ gomp_task_run_post_remove_taskgroup (struct gomp_task 
*child_task)
  }

void

-gomp_barrier_handle_tasks (gomp_barrier_state_t state, bool use_cancel)
+gomp_barrier_handle_tasks (gomp_barrier_state_t state, unsigned increment)
  {
    struct gomp_thread *thr = gomp_thread ();
    struct gomp_team *team = thr->ts.team;
@@ -1559,11 +1559,24 @@ gomp_barrier_handle_tasks (gomp_barrier_state_t state, 
bool use_cancel)
    int do_wake = 0;

gomp_mutex_lock (&team->task_lock);

+  /* N.b. Need to have this when `gomp_barrier_can_hold` because then there is
+     only one barrier in between parallel regions.  When only one barrier
+     between parallel regions we need to avoid running tasks from "the next"
+     region because they might be reduction tasks that we have not done the
+     initialisation for.
+     If PR122314 is agreed to be a problem then we would always need this if
+     clause whether or not `gomp_barrier_can_hold`.  */
+  if (gomp_barrier_completed (state, increment, &team->barrier))
+    {
+      gomp_mutex_unlock (&team->task_lock);
+      return;
+    }
+
    if (gomp_barrier_last_thread (state))
      {
        if (team->task_count == 0)
        {
-         gomp_team_barrier_done (&team->barrier, state, use_cancel);
+         gomp_team_barrier_done (&team->barrier, state, increment);
          gomp_mutex_unlock (&team->task_lock);
          gomp_team_barrier_wake (&team->barrier, 0);
          return;
@@ -1600,7 +1613,7 @@ gomp_barrier_handle_tasks (gomp_barrier_state_t state, 
bool use_cancel)
        else if (team->task_count == 0
               && gomp_team_barrier_waiting_for_tasks (&team->barrier))
        {
-         gomp_team_barrier_done (&team->barrier, state, use_cancel);
+         gomp_team_barrier_done (&team->barrier, state, increment);
          gomp_mutex_unlock (&team->task_lock);
          gomp_team_barrier_wake (&team->barrier, 0);
          if (to_free)
diff --git a/libgomp/team.c b/libgomp/team.c
index 822ec240971..cac0cd8eb6a 100644
--- a/libgomp/team.c
+++ b/libgomp/team.c
@@ -62,6 +62,23 @@ struct gomp_thread_start_data
    pthread_t handle;
  };

+static void

+gomp_release_held_threads (struct gomp_thread_pool *pool,
+                          struct gomp_team *team, unsigned team_id)
+{
+  gomp_assert (team_id == 0, "Releasing threads from non-primary thread %u",
+              team_id);
+  if (pool->prev_barrier)
+    {
+      struct gomp_team *saved_team __attribute__ ((unused))
+       = pool->last_team ? pool->last_team : team;
+      gomp_assert (pool->prev_barrier == &saved_team->barrier,
+                  "prev_barrier not within cached team: %p != %p",
+                  pool->prev_barrier, &saved_team->barrier);
+      gomp_team_barrier_done_final (pool->prev_barrier, team_id);
+      pool->prev_barrier = NULL;
+    }
+}

/* This function is a pthread_create entry point. This contains the idle

     loop in which a thread waits to be called up to become part of a team.  */
@@ -120,7 +137,7 @@ gomp_thread_start (void *xdata)
      {
        pool->threads[thr->ts.team_id] = thr;

- gomp_simple_barrier_wait (&pool->threads_dock, thr->ts.team_id);

+      gomp_simple_barrier_wait (&pool->threads_dock);
        do
        {
          struct gomp_team *team = thr->ts.team;
@@ -129,8 +146,8 @@ gomp_thread_start (void *xdata)
          local_fn (local_data);
          gomp_team_barrier_wait_final (&team->barrier, thr->ts.team_id);
          gomp_finish_task (task);
-
-         gomp_simple_barrier_wait (&pool->threads_dock, thr->ts.team_id);
+         if (!gomp_barrier_can_hold (&team->barrier))
+           gomp_simple_barrier_wait (&pool->threads_dock);

local_fn = thr->fn;

          local_data = thr->data;
@@ -153,11 +170,18 @@ get_last_team (unsigned nthreads)
    struct gomp_thread *thr = gomp_thread ();
    if (thr->ts.team == NULL)
      {
+      gomp_assert (thr->ts.level == 0,
+                  "Looking for cached team in nested region %u thr->ts.level",
+                  thr->ts.level);
        struct gomp_thread_pool *pool = gomp_get_thread_pool (thr, nthreads);
        struct gomp_team *last_team = pool->last_team;
        if (last_team != NULL && last_team->nthreads == nthreads)
          {
-          pool->last_team = NULL;
+         gomp_assert (!pool->prev_barrier
+                      || pool->prev_barrier == &pool->last_team->barrier,
+                      "prev_barrier not within cached team: %p != %p",
+                      pool->prev_barrier, &pool->last_team->barrier);
+         pool->last_team = NULL;
            return last_team;
          }
      }
@@ -243,7 +267,7 @@ gomp_free_pool_helper (void *thread_pool)
    struct gomp_thread *thr = gomp_thread ();
    struct gomp_thread_pool *pool
      = (struct gomp_thread_pool *) thread_pool;
-  gomp_simple_barrier_wait_last (&pool->threads_dock, thr->ts.team_id);
+  gomp_simple_barrier_wait_last (&pool->threads_dock);
    gomp_sem_destroy (&thr->release);
    thr->thread_pool = NULL;
    thr->task = NULL;
@@ -278,10 +302,12 @@ gomp_free_thread (void *arg __attribute__((unused)))
              nthr->data = pool;
            }
          /* This barrier undocks threads docked on pool->threads_dock.  */
-         gomp_simple_barrier_wait (&pool->threads_dock, thr->ts.team_id);
+         gomp_release_held_threads (pool, NULL, thr->ts.team_id);
+         if (!gomp_barrier_can_hold (pool->prev_barrier))
+           gomp_simple_barrier_wait (&pool->threads_dock);
          /* And this waits till all threads have called gomp_barrier_wait_last
             in gomp_free_pool_helper.  */
-         gomp_simple_barrier_wait (&pool->threads_dock, thr->ts.team_id);
+         gomp_simple_barrier_wait (&pool->threads_dock);
          /* Now it is safe to destroy the barrier and free the pool.  */
          gomp_simple_barrier_destroy (&pool->threads_dock);

@@ -327,6 +353,8 @@ gomp_team_start (void (*fn) (void *), void *data, unsigned nthreads,

    bool nested;
    struct gomp_thread_pool *pool;
    unsigned i, n, old_threads_used = 0;
+  unsigned simple_barrier_n = 0;
+  unsigned num_new_threads = 0;
    pthread_attr_t thread_attr, *attr;
    unsigned long nthreads_var;
    char bind, bind_var;
@@ -457,13 +485,6 @@ gomp_team_start (void (*fn) (void *), void *data, unsigned 
nthreads,
    else
      bind = omp_proc_bind_false;

- unsigned bits_per_ull = sizeof (unsigned long long) * CHAR_BIT;

-  int id_arr_len = ((nthreads + pool->threads_used) / bits_per_ull) + 1;
-  unsigned long long new_ids[id_arr_len];
-  for (int j = 0; j < id_arr_len; j++) {
-    new_ids[j] = 0;
-  }
-
    /* We only allow the reuse of idle threads for non-nested PARALLEL
       regions.  This appears to be implied by the semantics of
       threadprivate variables, but perhaps that's reading too much into
@@ -471,8 +492,6 @@ gomp_team_start (void (*fn) (void *), void *data, unsigned 
nthreads,
       only the initial program thread will modify gomp_threads.  */
    if (!nested)
      {
-        /* This current thread is always re-used in next team.  */
-      unsigned total_reused = 1;
        gomp_assert (team->prev_ts.team_id == 0,
                   "Starting a team from thread with id %u in previous team\n",
                   team->prev_ts.team_id);
@@ -483,11 +502,19 @@ gomp_team_start (void (*fn) (void *), void *data, 
unsigned nthreads,
        else if (old_threads_used == 0)
        {
          n = 0;
+         simple_barrier_n = nthreads;
          gomp_simple_barrier_init (&pool->threads_dock, nthreads);
        }
        else
-       n = old_threads_used;
+       {
+         n = old_threads_used;

+ simple_barrier_n = gomp_barrier_can_hold (pool->prev_barrier)

+           ? nthreads - (n - 1) : nthreads;
+         /* Increase the barrier threshold to make sure all new
+            threads arrive before the team is released.  */
+         gomp_simple_barrier_reinit (&pool->threads_dock, simple_barrier_n);
+       }

/* Not true yet, but soon will be. We're going to release all

         threads from the dock, and those that aren't part of the
@@ -509,7 +536,6 @@ gomp_team_start (void (*fn) (void *), void *data, unsigned 
nthreads,
        }

/* Release existing idle threads. */

-      bool have_prepared = false;
        for (; i < n; ++i)
        {
          unsigned int place_partition_off = thr->ts.place_partition_off;
@@ -651,21 +677,7 @@ gomp_team_start (void (*fn) (void *), void *data, unsigned 
nthreads,
          nthr->ts.team = team;
          nthr->ts.work_share = &team->work_shares[0];
          nthr->ts.last_work_share = NULL;
-         /* If we're changing any threads team_id then we need to wait for all
-            other threads to have reached the barrier.  */
-         if (nthr->ts.team_id != i && !have_prepared) {
-           gomp_simple_barrier_prepare_reinit(&pool->threads_dock, 
thr->ts.team_id);
-           have_prepared = true;
-         }
          nthr->ts.team_id = i;
-         {
-           unsigned idx = (i / bits_per_ull);
-           gomp_assert (!(new_ids[idx] & (1ULL << (i % bits_per_ull))),
-                        "new_ids[%u] == %llu (for `i` %u)",
-                        idx, new_ids[idx], i);
-           new_ids[idx] |= (1ULL << (i % bits_per_ull));
-         }
-         total_reused += 1;
          nthr->ts.level = team->prev_ts.level + 1;
          nthr->ts.active_level = thr->ts.active_level;
          nthr->ts.place_partition_off = place_partition_off;
@@ -736,59 +748,30 @@ gomp_team_start (void (*fn) (void *), void *data, 
unsigned nthreads,
                    }
                  break;
                }
-           }
-       }

- /* If we are changing the number of threads *or* if we are starting new

-        threads for any reason.  Then update the barrier accordingly.
-
-        The handling of the barrier here is different for the different
-         designs of barrier.
-
-         The `posix/bar.h` design needs to "grow" to accomodate the extra
-         threads that we'll wait on, then "shrink" to the size we want
-         eventually.
-
-        The `linux/bar.h` design needs to assign positions for each thread.
-         Some of the threads getting started will want the position of a thread
-         that is currently running.  Hence we need to (1) serialise existing
-         threads then (2) set up barierr state for the incoming new threads.
-         Once this is done we don't need any equivalent of the "shrink" step
-         later.   This does result in a longer period of serialisation than
-         the posix/bar.h design, but it seems that this is a fair trade-off to
-         make for the design that is faster under contention.  */
-      if (old_threads_used != 0
-         && (nthreads != pool->threads_dock.bar.total || i < nthreads))
-       {
-         /* If all we've done is increase the number of threads that we want,
-            don't need to serialise anything (wake flags don't need to be
-            adjusted).  */
-         if (nthreads > old_threads_used && affinity_count == 0
-             && total_reused == old_threads_used
-             /* `have_prepared` can be used to detect whether we re-shuffled
-                any threads around.  */
-             && !have_prepared
-             && gomp_simple_barrier_has_space (&pool->threads_dock, nthreads))
-           gomp_simple_barrier_minimal_reinit (&pool->threads_dock, nthreads,
-                                               nthreads - old_threads_used);
-         else
-           {
-             /* Otherwise, we need to ensure that we've paused all existing
-                threads (waiting on us to restart them) before adjusting their
-                wake flags.  */
-             if (!have_prepared)
-               gomp_simple_barrier_prepare_reinit (&pool->threads_dock,
-                                                   thr->ts.team_id);
-             gomp_simple_barrier_reinit_1 (&pool->threads_dock, nthreads,
-                                           nthreads <= total_reused
-                                             ? 0
-                                             : nthreads - total_reused,
-                                           new_ids);
+             /* Increase the barrier threshold to make sure all new
+                threads and all the threads we're going to let die
+                arrive before the team is released.  */
+             if (affinity_count)
+               {
+                 unsigned num_reused_threads = old_threads_used - 
affinity_count;
+                 /* Add one for the primary thread.  */
+                 simple_barrier_n = gomp_barrier_can_hold (pool->prev_barrier)
+                                      ? (nthreads - num_reused_threads) + 1
+                                      : nthreads + affinity_count;
+                 gomp_simple_barrier_reinit (&pool->threads_dock,
+                                             simple_barrier_n);
+               }
            }
        }

if (i == nthreads)

-       goto do_release;
+       {
+         gomp_assert (simple_barrier_n == 0,
+                      "Have calculated need simple_barrier_n == %u"
+                      " but not starting any threads", simple_barrier_n);
+         goto do_release;
+       }

}@@ -921,10 +904,23 @@ gomp_team_start (void (*fn) (void *), void *data, unsigned nthreads,

        attr = gomp_adjust_thread_attr (attr, &thread_attr);
        err = pthread_create (&start_data->handle, attr, gomp_thread_start,
                            start_data);
+      num_new_threads += 1;
        start_data++;
        if (err != 0)
        gomp_fatal ("Thread creation failed: %s", strerror (err));
      }
+  unsigned adjustment __attribute__ ((unused))
+  = (old_threads_used == 0 || gomp_barrier_can_hold (pool->prev_barrier))
+      ? 1
+      : old_threads_used;
+  gomp_assert (num_new_threads == simple_barrier_n - adjustment || nested,
+              "simple barrier calculation incorrect for %s team: "
+              "simple_barrier_n: %u"
+              ", nthreads: %u"
+              ", num_new_threads: %u"
+              ", old_threads_used: %u",
+              nested ? "nested" : "top-level", simple_barrier_n, nthreads,
+              num_new_threads, old_threads_used);

if (__builtin_expect (attr == &thread_attr, 0))

      pthread_attr_destroy (&thread_attr);
@@ -933,7 +929,11 @@ gomp_team_start (void (*fn) (void *), void *data, unsigned 
nthreads,
    if (nested)
      gomp_barrier_wait (&team->barrier, thr->ts.team_id);
    else
-    gomp_simple_barrier_wait (&pool->threads_dock, thr->ts.team_id);
+    {
+      gomp_release_held_threads (pool, team, thr->ts.team_id);
+      if (num_new_threads || !gomp_barrier_can_hold (pool->prev_barrier))
+       gomp_simple_barrier_wait (&pool->threads_dock);
+    }

/* Decrease the barrier threshold to match the number of threads

       that should arrive back at the end of this team.  The extra
@@ -951,7 +951,8 @@ gomp_team_start (void (*fn) (void *), void *data, unsigned 
nthreads,
        if (affinity_count)
        diff = -affinity_count;

- gomp_simple_barrier_reinit_2 (&pool->threads_dock, nthreads);

+      gomp_simple_barrier_reinit (&pool->threads_dock, nthreads);
+
  #ifdef HAVE_SYNC_BUILTINS
        __sync_fetch_and_add (&gomp_managed_threads, diff);
  #else
@@ -1017,7 +1018,9 @@ gomp_team_end (void)
       As #pragma omp cancel parallel might get awaited count in
       team->barrier in a inconsistent state, we need to use a different
       counter here.  */
-  gomp_team_barrier_wait_final (&team->barrier, thr->ts.team_id);
+  gomp_team_barrier_wait_final (&team->barrier, team_id);
+  /* After this point all tasking has finished.  Just haven't told all
+     secondary threads that they're good to continue.  */
    if (__builtin_expect (team->team_cancelled, 0))
      {
        struct gomp_work_share *ws = team->work_shares_to_free;
@@ -1037,8 +1040,11 @@ gomp_team_end (void)
    gomp_end_task ();
    thr->ts = team->prev_ts;

+ bool finished_threads __attribute__ ((unused)) = false;

    if (__builtin_expect (thr->ts.level != 0, 0))
      {
+      gomp_team_barrier_done_final (&team->barrier, team_id);
+      finished_threads = true;
  #ifdef HAVE_SYNC_BUILTINS
        __sync_fetch_and_add (&gomp_managed_threads, 1L - team->nthreads);
  #else
@@ -1064,15 +1070,31 @@ gomp_team_end (void)
      }
    gomp_sem_destroy (&team->master_release);

- if (__builtin_expect (thr->ts.team != NULL, 0)

+  if (__builtin_expect (thr->ts.team != NULL
+                       && (thr->ts.team->nthreads != 1
+                           || thr->ts.level != 0), 0)
        || __builtin_expect (team->nthreads == 1, 0))
-    free_team (team);
+    {
+      gomp_assert (finished_threads || team->nthreads == 1,
+                  "Freeing team while threads may be waiting on it."
+                  " team = %p, level = %u", thr->ts.team, thr->ts.level);
+      free_team (team);
+    }
    else
      {
        struct gomp_thread_pool *pool = thr->thread_pool;
        if (pool->last_team)
        free_team (pool->last_team);
        pool->last_team = team;
+      gomp_assert (!finished_threads,
+                  "Have let threads go while still recording prev_barrier"
+                  " level = %u, team = %p, nthreads = %u",
+                  thr->ts.level, thr->ts.team, team->nthreads);
+      gomp_assert (!pool->prev_barrier,
+                  "Previous barrier not released before it gets overridden"
+                  " level = %u, team = %p, nthreads = %u",
+                  thr->ts.level, thr->ts.team, team->nthreads);
+      pool->prev_barrier = &team->barrier;
        gomp_release_thread_pool (pool);
      }
  }
@@ -1112,7 +1134,7 @@ gomp_pause_pool_helper (void *thread_pool)
    struct gomp_thread *thr = gomp_thread ();
    struct gomp_thread_pool *pool
      = (struct gomp_thread_pool *) thread_pool;
-  gomp_simple_barrier_wait_last (&pool->threads_dock, thr->ts.team_id);
+  gomp_simple_barrier_wait_last (&pool->threads_dock);
    gomp_sem_destroy (&thr->release);
    thr->thread_pool = NULL;
    thr->task = NULL;
@@ -1144,10 +1166,12 @@ gomp_pause_host (void)
              thrs[i] = gomp_thread_to_pthread_t (nthr);
            }
          /* This barrier undocks threads docked on pool->threads_dock.  */
-         gomp_simple_barrier_wait (&pool->threads_dock, thr->ts.team_id);
-         /* And this waits till all threads have called gomp_barrier_wait_last
+         gomp_release_held_threads (pool, NULL, thr->ts.team_id);
+         if (!gomp_barrier_can_hold (pool->prev_barrier))
+           gomp_simple_barrier_wait (&pool->threads_dock);
+         /* And this waits till all threads have called 
gomp_simple_barrier_wait_last
             in gomp_pause_pool_helper.  */
-         gomp_simple_barrier_wait (&pool->threads_dock, thr->ts.team_id);
+         gomp_simple_barrier_wait (&pool->threads_dock);
          /* Now it is safe to destroy the barrier and free the pool.  */
          gomp_simple_barrier_destroy (&pool->threads_dock);

diff --git a/libgomp/testsuite/libgomp.c++/task-reduction-20.C b/libgomp/testsuite/libgomp.c++/task-reduction-20.C

new file mode 100644
index 00000000000..1a413d282a0
--- /dev/null
+++ b/libgomp/testsuite/libgomp.c++/task-reduction-20.C
@@ -0,0 +1,126 @@
+extern "C" void abort ();
+
+struct S { S (); S (long long int, int); ~S (); static int cnt1, cnt2, cnt3; 
long long int s; int t; };
+
+int S::cnt1;
+int S::cnt2;
+int S::cnt3;
+
+S::S ()
+{
+  #pragma omp atomic
+  cnt1++;
+}
+
+S::S (long long int x, int y) : s (x), t (y)
+{
+  #pragma omp atomic update
+  ++cnt2;
+}
+
+S::~S ()
+{
+  #pragma omp atomic
+  cnt3 = cnt3 + 1;
+  if (t < 3 || t > 9 || (t & 1) == 0)
+    abort ();
+}
+
+void
+bar (S *p, S *o)
+{
+  p->s = 1;
+  if (o->t != 5)
+    abort ();
+  p->t = 9;
+}
+
+static inline void
+baz (S *o, S *i)
+{
+  if (o->t != 5 || i->t != 9)
+    abort ();
+  o->s *= i->s;
+}
+
+#pragma omp declare reduction (+: S : omp_out.s += omp_in.s) initializer 
(omp_priv (0, 3))
+#pragma omp declare reduction (*: S : baz (&omp_out, &omp_in)) initializer (bar 
(&omp_priv, &omp_orig))
+
+S as = { 0LL, 7 };
+S &a = as;
+S bs (1LL, 5);
+S &b = bs;
+
+void
+foo (S &c, S &d)
+{
+  int i;
+  for (i = 0; i < 2; i++)
+    #pragma omp task in_reduction (+: c) in_reduction (*: b, d) in_reduction 
(+: a)
+    {
+      a.s += 7;
+      b.s *= 2;
+      c.s += 9;
+      d.s *= 3;
+      if ((a.t != 7 && a.t != 3) || (b.t != 5 && b.t != 9)
+         || (c.t != 7 && c.t != 3) || (d.t != 5 && d.t != 9))
+       abort ();
+    }
+}
+
+void
+test ()
+{
+  S cs = { 0LL, 7 };
+  S &c = cs;
+  S ds (1LL, 5);
+  S &d = ds;
+#pragma omp parallel
+    {
+      asm volatile("" ::: "memory");
+    }
+#pragma omp parallel reduction (task, +: a, c) reduction (task, *: b, d)
+    {
+#pragma omp for
+      for (int i = 0; i < 4; i++)
+#pragma omp task in_reduction (*: b, d) in_reduction (+: a, c)
+       {
+         int j;
+         a.s += 7;
+         b.s *= 2;
+         for (j = 0; j < 2; j++)
+#pragma omp task in_reduction (+: a) in_reduction (*: b) \
+           in_reduction (+: c) in_reduction (*: d)
+           {
+             a.s += 7;
+             b.s *= 2;
+             c.s += 9;
+             d.s *= 3;
+             foo (c, d);
+             if ((a.t != 7 && a.t != 3) || (b.t != 5 && b.t != 9)
+                 || (c.t != 7 && c.t != 3) || (d.t != 5 && d.t != 9))
+               abort ();
+           }
+         c.s += 9;
+         d.s *= 3;
+         if ((a.t != 7 && a.t != 3) || (b.t != 5 && b.t != 9)
+             || (c.t != 7 && c.t != 3) || (d.t != 5 && d.t != 9))
+           abort ();
+       }
+    }
+#define THREEP7 (3LL * 3LL * 3LL * 3LL * 3LL * 3LL * 3LL)
+  if (d.s != (THREEP7 * THREEP7 * THREEP7 * THREEP7) || d.t != 5)
+    abort ();
+  if (a.s != 28 * 7 || a.t != 7 || b.s != (1L << 28) || b.t != 5
+      || c.s != 28 * 9 || c.t != 7)
+    abort ();
+}
+
+int
+main ()
+{
+  int c1 = S::cnt1, c2 = S::cnt2, c3 = S::cnt3;
+  test ();
+  if (S::cnt1 + S::cnt2 - c1 - c2 != S::cnt3 - c3)
+    abort ();
+}
diff --git a/libgomp/testsuite/libgomp.c++/task-reduction-21.C 
b/libgomp/testsuite/libgomp.c++/task-reduction-21.C
new file mode 100644
index 00000000000..539f6f93ffd
--- /dev/null
+++ b/libgomp/testsuite/libgomp.c++/task-reduction-21.C
@@ -0,0 +1,130 @@
+extern "C" void abort ();
+
+struct S { S (); S (long long int, int); ~S (); static int cnt1, cnt2, cnt3; 
long long int s; int t; };
+
+int S::cnt1;
+int S::cnt2;
+int S::cnt3;
+
+S::S ()
+{
+  #pragma omp atomic
+  cnt1++;
+}
+
+S::S (long long int x, int y) : s (x), t (y)
+{
+  #pragma omp atomic update
+  ++cnt2;
+}
+
+S::~S ()
+{
+  #pragma omp atomic
+  cnt3 = cnt3 + 1;
+  if (t < 3 || t > 9 || (t & 1) == 0)
+    abort ();
+}
+
+void
+bar (S *p, S *o)
+{
+  p->s = 1;
+  if (o->t != 5)
+    abort ();
+  p->t = 9;
+}
+
+static inline void
+baz (S *o, S *i)
+{
+  if (o->t != 5 || i->t != 9)
+    abort ();
+  o->s *= i->s;
+}
+
+#pragma omp declare reduction (+: S : omp_out.s += omp_in.s) initializer 
(omp_priv (0, 3))
+#pragma omp declare reduction (*: S : baz (&omp_out, &omp_in)) initializer (bar 
(&omp_priv, &omp_orig))
+
+S as = { 0LL, 7 };
+S &a = as;
+S bs (1LL, 5);
+S &b = bs;
+
+void
+foo (S &c, S &d)
+{
+  int i;
+  for (i = 0; i < 2; i++)
+    #pragma omp task in_reduction (+: c) in_reduction (*: b, d) in_reduction 
(+: a)
+    {
+      a.s += 7;
+      b.s *= 2;
+      c.s += 9;
+      d.s *= 3;
+      if ((a.t != 7 && a.t != 3) || (b.t != 5 && b.t != 9)
+         || (c.t != 7 && c.t != 3) || (d.t != 5 && d.t != 9))
+       abort ();
+    }
+}
+
+void
+test ()
+{
+  S cs = { 0LL, 7 };
+  S &c = cs;
+  S ds (1LL, 5);
+  S &d = ds;
+#pragma omp parallel
+    {
+      for (int i = 0; i < 4; i++)
+#pragma omp task
+       {
+         asm volatile("" ::: "memory");
+       }
+    }
+#pragma omp parallel reduction (task, +: a, c) reduction (task, *: b, d)
+    {
+#pragma omp for
+      for (int i = 0; i < 4; i++)
+#pragma omp task in_reduction (*: b, d) in_reduction (+: a, c)
+       {
+         int j;
+         a.s += 7;
+         b.s *= 2;
+         for (j = 0; j < 2; j++)
+#pragma omp task in_reduction (+: a) in_reduction (*: b) \
+           in_reduction (+: c) in_reduction (*: d)
+           {
+             a.s += 7;
+             b.s *= 2;
+             c.s += 9;
+             d.s *= 3;
+             foo (c, d);
+             if ((a.t != 7 && a.t != 3) || (b.t != 5 && b.t != 9)
+                 || (c.t != 7 && c.t != 3) || (d.t != 5 && d.t != 9))
+               abort ();
+           }
+         c.s += 9;
+         d.s *= 3;
+         if ((a.t != 7 && a.t != 3) || (b.t != 5 && b.t != 9)
+             || (c.t != 7 && c.t != 3) || (d.t != 5 && d.t != 9))
+           abort ();
+       }
+    }
+#define THREEP7 (3LL * 3LL * 3LL * 3LL * 3LL * 3LL * 3LL)
+  if (d.s != (THREEP7 * THREEP7 * THREEP7 * THREEP7) || d.t != 5)
+    abort ();
+  if (a.s != 28 * 7 || a.t != 7 || b.s != (1L << 28) || b.t != 5
+      || c.s != 28 * 9 || c.t != 7)
+    abort ();
+}
+
+int
+main ()
+{
+  int c1 = S::cnt1, c2 = S::cnt2, c3 = S::cnt3;
+  test ();
+  if (S::cnt1 + S::cnt2 - c1 - c2 != S::cnt3 - c3)
+    abort ();
+}

Re: [PATCH] [RFC] libgomp: Removing one barrier in non-nested thread loop

Reply via email to