> -----Original Message----- > From: [email protected] <[email protected]> > Sent: 26 November 2025 18:42 > To: [email protected]; Jakub Jelinek <[email protected]>; Tobias > Burnus <[email protected]> > Cc: Julian Brown <[email protected]>; Thomas Schwinge > <[email protected]>; Andrew Stubbs <[email protected]>; Tom de > Vries <[email protected]>; Sebastian Huber <sebastian.huber@embedded- > brains.de>; Matthew Malcomson <[email protected]> > Subject: [PATCH 0/5] OpenMP Barrier perf improvements > > External email: Use caution opening links or attachments > > > From: Matthew Malcomson <[email protected]> > > I'd previously split these patches up into logically independent > changes, but since patches have been written on top of the others have > just made maintainers jobs more difficult. > - Sebastian just pointed out that I'd included the wrong link in my > latest email so the combination was incorrect, so I'll go the way > less > likely to make mistakes and send everything as a single patch > series. > > Hence sending up the patch series as a complete patchset ordered on > top of each other. In order to do that I rebased the "Move thread > task re-initialisation into threads" patch on top of the others (some > order had to be chosen since neither order cleanly applies). This is > the order in which I did most of my testing (and TSAN testing was done > on the combination of all patches in the order sent here). > > Apologies for the noise & extra back-and-forth around attempting to > apply patches. > > Including original cover letter for the patchset below here: > ------------------------------ > > Cc'ing in maintainers of nvptx, gcn, and rtems ports for target > specific changes (especially with request for runtime testing). > > After having updated the target code, looked into various TODO's and > ran more testing I've combined my previous patches into one patchset. > This patchset drastically improves the performance on the micro- > benchmark 119588. > > This micro-benchmark represents a significant slowdown in some OMP > uses in NVPL BLAS running GEMM routines on small matrices with a high > level of parallelism. (High level of parallelism due to other > routines in the code benefiting from many threads, and there being no > low-overhead way to change the level of parallelism between routines). > > This patchset has 5 commits: > 1) Is a fix for PR122314. It ensures that GOMP tasks are executed > logically in the region where they are scheduled. > 2) Is a fix for PR122356. It ensures there is a memory > synchronisation > point between tasks being run in the barrier and the barrier > continuing. > 3) Changes the linux/ barrier implementation from the "centralized" > method currently used to a combination of a "linear" barrier gather > and "centralized" barrier release. > - I see this gives about a 3x improvement on time through a highly > contended barrier on a 144 core machine. > 4) Follows the LLVM example and "wait" between parallel regions > *inside* > the barrier rather than between two barriers. This halves the > overhead from barriers on many consecutive parallel regions. > 5) Reduces the data structure initialisation overhead when starting a > new parallel region. Rather than have the primary thread > initialise > each threads data while each secondary thread is waiting the > primary > thread stores common data and lets each secondary thread initialise > its thread-specific data from that shared information. > > Patches (1), (2), and (5) could all be made independent (with some > adjustment for patch context). Patch (4) requires patch (3) and patch > (3) introduces some less-pleasant code structure that the changes in > patch (4) help fix. > > In order to use the feature introduced in patch (3) we have to change > the barrier API to pass an ID. For patch (3) alone we also need to > introduce some relatively awkward interfaces for adjusting the size of > the barrier. > > Patch (4) removes that need for the new awkward interface (the only > barrier that needs size adjustment is now no longer in the fast path). > > Since I hope to have both patches in I have only made changes for > other targets to build on top of patch (4). This in order to avoid > writing the implementation for this awkward interface that I intend to > never be actually used. > > N.b. when I did bootstrap & regtest on the posix/ target I saw flaky > tests before and after. Believe the same flaky tests. Hi, I am pinging this patch series on behalf of Matthew, while he's currently on leave.
[1/5] Enforce tasks executed lexically after scheduled: https://gcc.gnu.org/pipermail/gcc-patches/2025-November/702029.html [2/5] Ensure memory sync after performing tasks: https://gcc.gnu.org/pipermail/gcc-patches/2025-November/702028.html [3/5] Implement "flat" barrier for linux/ target: https://gcc.gnu.org/pipermail/gcc-patches/2025-November/702031.html [4/5] Removing one barrier in non-nested thread loop: https://gcc.gnu.org/pipermail/gcc-patches/2025-November/702032.html [5/5] Move thread task re-initialisation into threads: https://gcc.gnu.org/pipermail/gcc-patches/2025-November/702030.html Out of these patches -- 1/5, 2/5 and 5/5 are relatively localized changes in libgomp and independent of the changes to barrier implementation. Thanks, Prathamesh > > Matthew Malcomson (5): > libgomp: Enforce tasks executed lexically after scheduled > libgomp: Ensure memory sync after performing tasks > libgomp: Implement "flat" barrier for linux/ target > libgomp: Removing one barrier in non-nested thread loop > libgomp: Move thread task re-initialisation into threads > > libgomp/barrier.c | 4 +- > libgomp/config/gcn/bar.c | 53 +- > libgomp/config/gcn/bar.h | 98 ++- > libgomp/config/gcn/team.c | 2 +- > libgomp/config/linux/bar.c | 798 ++++++++++++++++- > - > libgomp/config/linux/bar.h | 331 +++++++- > libgomp/config/linux/futex_waitv.h | 129 +++ > libgomp/config/linux/simple-bar.h | 66 ++ > libgomp/config/linux/wait.h | 15 +- > libgomp/config/nvptx/bar.c | 36 +- > libgomp/config/nvptx/bar.h | 89 +- > libgomp/config/nvptx/team.c | 2 +- > libgomp/config/posix/bar.c | 41 +- > libgomp/config/posix/bar.h | 93 +- > libgomp/config/posix/pool.h | 1 + > libgomp/config/posix/simple-bar.h | 10 +- > libgomp/config/rtems/bar.c | 185 +++- > libgomp/config/rtems/bar.h | 97 ++- > libgomp/libgomp.h | 21 +- > libgomp/single.c | 4 +- > libgomp/task.c | 66 +- > libgomp/team.c | 292 ++++++- > .../testsuite/libgomp.c++/task-reduction-20.C | 136 +++ > .../testsuite/libgomp.c++/task-reduction-21.C | 140 +++ > libgomp/testsuite/libgomp.c/pr122314.c | 36 + > libgomp/testsuite/libgomp.c/pr122356.c | 33 + > .../libgomp.c/primary-thread-tasking.c | 80 ++ > libgomp/work.c | 26 +- > 28 files changed, 2614 insertions(+), 270 deletions(-) create mode > 100644 libgomp/config/linux/futex_waitv.h > create mode 100644 libgomp/config/linux/simple-bar.h create mode > 100644 libgomp/testsuite/libgomp.c++/task-reduction-20.C > create mode 100644 libgomp/testsuite/libgomp.c++/task-reduction-21.C > create mode 100644 libgomp/testsuite/libgomp.c/pr122314.c > create mode 100644 libgomp/testsuite/libgomp.c/pr122356.c > create mode 100644 libgomp/testsuite/libgomp.c/primary-thread- > tasking.c > > -- > 2.43.0
