https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119588

Matthew Malcomson <matmal01 at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |matmal01 at gcc dot gnu.org

--- Comment #2 from Matthew Malcomson <matmal01 at gcc dot gnu.org> ---
Created attachment 61938
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=61938&action=edit
WIP patch adding a faster barrier

Hello,

I'm posting my current status and some questions.  Much more details below,
but the summary is:
1) I have an RFC patch for a faster barrier.  There is work left to do but
   it is performing well against LLVM on benchmarks isolating the barrier
   performance.
2) I have realised that there is extra overhead on top of the barrier
   implementation that libgomp has but LLVM's libomp does not.
   - Part in initialising teams (when re-using the last team).
   - Part in going through *two* barriers per parallel region teardown and
     re-creation while LLVM goes through one.
3) Questions I would like to ask:
   1) Does the RFC patch look like it's going in an OK direction?
      Especially w.r.t. the need for tweaks in the API to target code
      defining how a barrier works.
   2) Are we OK going through only one barrier per `omp parallel region`?
      - Strict reading of OpenMP standard 4.5 "flush" construct seems to
        imply this is not allowed.
      - OpenMP standard 5.1 "flush" construct seems to have made
        adjustments specifically to allow this.
      - I don't know of any user-observable behaviour that would be
        different between this and the strict reading of OpenMP 4.5.
   3) Would appreciate any pointers around reducing the overhead of
      re-initialising thread data structures.  Has this been looked at
      before?
      - At the moment I'm leaning towards something small like having each
        non-nested thread "clear" their `thr->task` data structure after
        the team has finished and before waiting at the
        `pool->threads_dock`.  Then `gomp_team_start` can avoid
        re-initialising it if there is no need to change things too much.
      - It would be very nice if there was any existing plan to have more
        extensive cacheing of these data structures.

------------------------------
Attempting to improve the case where there are a lot of threads contending
on a barrier.  Above case showed a reproducer.  N.b. I've only been
adjusting the `linux` target barrier implementation.

Have tested a bunch of barrier approaches in isolation (i.e. outside of
libgomp).  Summary of what I found:
1) All threads waiting on a single "wake flag" -- i.e. the bar->generation
   increment -- seems to be clearly the fastest method of waking up
   multiple threads.
2) I tried various different approaches for the "gather" (i.e. getting to
   the point where there is one thread which knows all other threads have
   arrived).  It seems that this is where LLVM has a significant advantage
   to GCC.  I tried the current LLVM approach and various other tweaks.
   I found that both the below seem to perform generally well across
   multiple x86 and AArch64 server machines:
   1) A "flat" barrier, where the primary thread checks/waits on each of the
      other threads flags.  When they have all been set the primary thread
      knows that all follower threads have entered the barrier.
   2) A "hypercube embedded tree" barrier (the "gather" phase taken from LLVM
      that performs quite well -- also known as a "tournament").  This chooses
a
      group-size, separates all threads into groups of that size, and the
      "winner" of each group waits on the flags for the others.  Once this
      "round" has finished then the "winners" of this round are grouped
      according to the same size and pre-assigned winners of the next round
wait
      on the others.  This continues until only one thread is left.
   Each of these performed noticeably better than the current libgomp approach
   on average, and in general perform even better at a higher number of
threads.
   In this patch I've implemented approach (1) for the "gather" phase.  I
   mostly chose this since it seemed to perform well against approach (2)
   and was drastically simpler.
   - As yet I've not tried the "dist" barrier that LLVM has as an option.
     This barrier does something very similar to (2) except that it
     determines the "group" size according to system topology.  AIUI it is
     disabled by default due to the high setup-costs and the fact that this
     cost must be paid each time a team with a different size is made.

All of the faster "gather" approaches require an allocated team_id per thread
synchronising at a barrier.  This means that each of the threads waiting on a
barrier need to have a known ID.  I have used the `team->team_id` for this
purpose.
- This means that the internal API to barriers needs to change.  It should not
  be too difficult to adjust the other target-specific barrires to also use
this
  interface.

In the current libgomp code there is one place where the `team_id` of a given
thread may change while it is waiting at a barrier.  This is hence the only
place that passing a `team_id` requires any significant adjustment to the
code.  It occurs in `gomp_team_start`.  This only happens when a thread
from the past top-level team is being re-used in the new top-level team but
with a different `team_id` assigned as the affinity requirements have
changed and this thread now needs to be at a different place in the
array of threads.
In order to account for this I have introduced a new interface
`gomp_barrier_ensure_last`, and have adjusted the place and interface for
reinitialising the barrier.
- This looks like a more major change to the internal API for barriers,
  however I could simply do one thing or the other based on the result of a
  function defined differently in different targets (similar to a target
  hook).

------------------------------
I have not yet handled tasking in the team barrier optimally.  As the code
stands the primary thread will simply wait for all other threads to arrive
(compared to the existing codebase where the thread with BAR_WAS_LAST is
not pre-determined and found by which thread enters the barrier last).

I believe I have an approach to handle this -- though I've not yet implemented
it (honestly the problem occurred to me late).  The problem is the "wait".
I've sketched out an approach that I plan to try out (where the overhead
on the fast-path is (1) that each secondary thread tests whether a given bit is
set in its thread-local generation number, and (2) that the primary thread
spins checking both the thread-local generation number and the
global generation number).  Hopefully that will work and the performance will
still be good.
- Obviously will have to benchmark this to see if the performance drops due to
  the change in approach.

The "wait" that the primary thread would then be doing (in order to avoid
continually spinning) would be a little tricky.  We want to futex_wake on two
variables at the same time.
- I guess we could use `futex_waitv` on newer Linux kernels (waiting on both
  the secondary next in line and the generation number).
On older kernels (as a fallback implementation) we might be able to do
something like:
- Primary fetch_add's a new PRIMARY_WAITING_TG bit (which can overlap with any
  of the BAR_* flags because they are not used in the thread-local generation
  numbers) to the thread-local generation number of the secondary it's waiting
  on.
  - If primary sees that the BAR_INCR was incremented then secondary reached
    barrier as we decided to go to sleep waiting for it.  Clear the flag and
    continue.
  - Otherwise primary futex_wait's on the generation number.
    - Can get woken up by a new task being added (BAR_TASK_PENDING).
    - Can get woken up by (yet another) bit flag in the global generation
      number saying "secondary you were waiting on has arrived".
      Call it BAR_SECONDARY_ARRIVED.
      - In this case it clears that bit and returns to looking to see if all
        secondary threads have arrived.
Meanwhile secondary thread:
- Checks the value it gets in `gomp_assert_and_increment_flag`.
  - If it has the PRIMARY_WAITING_TG flag then set BAR_SECONDARY_ARRIVED on
    global generation number and futex_wake it.
Each secondary thread has to ignore that bit in their loop (will still get
woken up by it, just continue around the loop until it's cleared).

------------------------------
I found handling cancellable barriers to be one of the trickiest parts of
adding a new kind of barrier.  I'm posting this WIP patch without a
cancellable barrier using the faster barrier approach -- however I suspect
it's possible to use that faster barrier implementation (I'm just
attempting to get some directional feedback on the patch as a whole before
doing all polishing).

As it stands, libgomp simply has two counters implementing barriers in
one `gomp_barrier_t`.  We use `awaited` the majority of the time, then use
`awaited_final` to synchronise threads when they may have cancelled a
barrier and left `awaited` in an unknown state.
For this patch I also have two different barrier data structures but the
split is different.  I have one `cancel_awaited` counter which is used to
implement a barrier in the same way as the current libgomp barrier, and I
have one `threadgens` used to implement the new kind of barrier.  In this
split `threadgens` is used for all barriers except those known to be
cancellable, and `cancel_awaited` is used for cancellable barriers.

The main problem with making cancellable threads use this faster barrier is
that if the primary thread has arrived at a barrier and is waiting for
others to arrive, then it is not waiting on the barrier-global generation
flag.  Rather it is waiting on all of the thread-local flags for the
secondary threads.
This is the same problem as we have for getting the primary thread to handle
tasks as they come in.  I would expect that any solution to the task-handling
problem would also be a solution to the cancellation problem.

Another problem is how to "increment" the global generation number.  At the
moment I use the actual values of the generation number to add assertions
(compile-time included with the `_LIBGOMP_CHECKING_` macro) on the invariant
that we have incremented our thread-local generation numbers the same number of
times as we have incremented our barrier-global generation number.  The barrier
generation number is the one place that holds information about what is
happening with the barrier.  Having this separate for the cancellable barriers
and the non-cancellable barriers seems quite tricky because some functions like
`GOMP_task` send a message to the current teams barrier about there being a
pending task without knowledge of whether the barrier they're communicating
with is cancellable or not.

In order to use the same generation number in memory for cancellable
barriers and non-cancellable barriers while also maintaining an invariant
that the generation number is incremented the same number of times as the
thread-local generation numbers that are only used for non-cancellable
barriers, I've encoded two different generation numbers in the one integer.
At the moment I have one bit for the cancellable barrier and "the rest" for
the non-cancellable barrier -- but I could assign a few extra bits to the
cancellable barrier.

------------------------------
Other points:
LLVM still performs quite well w.r.t. GNU across many iterations of creating a
parallel region.  This despite this new barrier performing quite well when
benchmarking just the time taken across an `omp barrier`.  It is
particularly interesting that the time taken for LLVM to create a new
region is approximately the same time taken for LLVM to go through a
barrier (see reproducer with both things measured attached).  This
indicates two other areas for GNU to improve on workloads like those that
triggered this PR.

Also, worth mentioning that the clang runs I presented above were with
`OMP_PROC_BIND=true`, which seems to have been choosing
`OMP_PROC_BIND=spread` and giving much worse results at 72 cores.

Using the below bash functions for testing:
```
    vshcmd: > clang-measurements () {
    vshcmd: >   for i in 1 18 72 144; do
    vshcmd: >     echo -e "##  $i"
    vshcmd: >     OMP_NUM_THREADS=$i \
    vshcmd: >     OMP_WAIT_POLICY=active OMP_PROC_BIND=close \
    vshcmd: >     LD_LIBRARY_PATH=${clang_install_path}/lib \
    vshcmd: >         ./clang.bench.x
    vshcmd: >   done
    vshcmd: > }
    vshcmd: > do-gcc-measurements () {
    vshcmd: >   infix="${1}${1:+-}"
    vshcmd: >   for i in 1 18 72 144; do
    vshcmd: >     echo -e "##  $i"
    vshcmd: >     OMP_NUM_THREADS=$i \
    vshcmd: >     OMP_WAIT_POLICY=active OMP_PROC_BIND=close \
    vshcmd: >     LD_LIBRARY_PATH=/local$HOME/gcc-${infix}install/lib64 \
    vshcmd: >         ./gcc.bench.x
    vshcmd: >   done
    vshcmd: > }
    vshcmd: > original-gcc-measurements () {
    vshcmd: >   do-gcc-measurements
    vshcmd: > }
    vshcmd: > better-barrier-gcc-measurements () {
    vshcmd: >   do-gcc-measurements better-barrier
    vshcmd: > }
    vshcmd: > hack-no-reinit-gcc-measurements () {
    vshcmd: >   do-gcc-measurements hack-no-reinit
    vshcmd: > }
    vshcmd: > hack-half-barrier-gcc-measurements () {
    vshcmd: >   do-gcc-measurements hack-half-barrier
    vshcmd: > }
    vshcmd: > clang-dist-measurements () {
    vshcmd: >   KMP_PLAIN_BARRIER_PATTERN="dist,dist" \
    vshcmd: >   KMP_FORKJOIN_BARRIER_PATTERN="dist,dist" \
    vshcmd: >   clang-measurements
    vshcmd: > }
```

So default clang results measuring both a plain barrier and the creation of
a parallel region are:
```
    vshcmd: > clang-measurements
    ##  1
    creation maxthr:1 nthr:1
        min_time:0.075 us
        max_time:0.084 us
        avg_time:0.076 us
    barrier maxthr:1 nthr:1
        min_time:0.006 us
        max_time:0.006 us
        avg_time:0.006 us
    ##  18
    creation maxthr:18 nthr:18
        min_time:2.340 us
        max_time:2.436 us
        avg_time:2.391 us
    barrier maxthr:18 nthr:18
        min_time:1.930 us
        max_time:1.980 us
        avg_time:1.962 us
    ##  72
    creation maxthr:72 nthr:72
        min_time:3.267 us
        max_time:3.687 us
        avg_time:3.345 us
    barrier maxthr:72 nthr:72
        min_time:2.880 us
        max_time:3.158 us
        avg_time:2.953 us
    ##  144
    creation maxthr:144 nthr:144
        min_time:9.426 us
        max_time:10.007 us
        avg_time:9.637 us
    barrier maxthr:144 nthr:144
        min_time:7.984 us
        max_time:8.629 us
        avg_time:8.227 us
    lego-c2-qs-56:openmp-parallel-gomp-slow [08:11:57] $ 
```

Using the "dist" barrier clang results are:
```
    vshcmd: > clang-dist-measurements
    ##  1
    creation maxthr:1 nthr:1
        min_time:0.075 us
        max_time:0.075 us
        avg_time:0.075 us
    barrier maxthr:1 nthr:1
        min_time:0.006 us
        max_time:0.006 us
        avg_time:0.006 us
    ##  18
    creation maxthr:18 nthr:18
        min_time:1.057 us
        max_time:1.075 us
        avg_time:1.061 us
    barrier maxthr:18 nthr:18
        min_time:0.569 us
        max_time:0.576 us
        avg_time:0.570 us
    ##  72
    creation maxthr:72 nthr:72
        min_time:1.363 us
        max_time:1.586 us
        avg_time:1.411 us
    barrier maxthr:72 nthr:72
        min_time:0.705 us
        max_time:0.867 us
        avg_time:0.728 us
    ##  144
    creation maxthr:144 nthr:144
        min_time:3.793 us
        max_time:4.222 us
        avg_time:3.945 us
    barrier maxthr:144 nthr:144
        min_time:2.418 us
        max_time:2.763 us
        avg_time:2.552 us
    lego-c2-qs-56:openmp-parallel-gomp-slow [08:11:55] $ 
```

Current GCC OpenMP results (including benchmarking only a barrier this
time) are:
```
    vshcmd: > original-gcc-measurements
    ##  1
    creation maxthr:1 nthr:1
        min_time:0.241 us
        max_time:0.253 us
        avg_time:0.243 us
    barrier maxthr:1 nthr:1
        min_time:0.116 us
        max_time:0.117 us
        avg_time:0.116 us
    ##  18
    creation maxthr:18 nthr:18
        min_time:3.034 us
        max_time:3.065 us
        avg_time:3.048 us
    barrier maxthr:18 nthr:18
        min_time:1.446 us
        max_time:1.464 us
        avg_time:1.457 us
    ##  72
    creation maxthr:72 nthr:72
        min_time:9.889 us
        max_time:10.329 us
        avg_time:10.063 us
    barrier maxthr:72 nthr:72
        min_time:4.788 us
        max_time:5.018 us
        avg_time:4.879 us
    ##  144
    creation maxthr:144 nthr:144
        min_time:31.021 us
        max_time:31.452 us
        avg_time:31.292 us
    barrier maxthr:144 nthr:144
        min_time:13.833 us
        max_time:14.287 us
        avg_time:14.089 us
    lego-c2-qs-56:openmp-parallel-gomp-slow [08:13:10] $ 
```

Results with my new barrier implementation (patch attached above) are:
```
    vshcmd: > better-barrier-gcc-measurements
    ##  1
    creation maxthr:1 nthr:1
        min_time:0.283 us
        max_time:0.303 us
        avg_time:0.285 us
    barrier maxthr:1 nthr:1
        min_time:0.112 us
        max_time:0.122 us
        avg_time:0.113 us
    ##  18
    creation maxthr:18 nthr:18
        min_time:1.990 us
        max_time:2.134 us
        avg_time:2.075 us
    barrier maxthr:18 nthr:18
        min_time:0.588 us
        max_time:0.599 us
        avg_time:0.592 us
    ##  72
    creation maxthr:72 nthr:72
        min_time:5.142 us
        max_time:5.376 us
        avg_time:5.267 us
    barrier maxthr:72 nthr:72
        min_time:0.750 us
        max_time:0.807 us
        avg_time:0.770 us
    ##  144
    creation maxthr:144 nthr:144
        min_time:19.988 us
        max_time:21.928 us
        avg_time:21.183 us
    barrier maxthr:144 nthr:144
        min_time:2.212 us
        max_time:2.526 us
        avg_time:2.336 us
    lego-c2-qs-56:openmp-parallel-gomp-slow [08:13:16] $ 
```

We can see that the barrier times are pretty good w.r.t. clang now, however
the creation of a parallel region are not that great.  I've noticed two
contributions to this extra overhead between the creation of a parallel
region and a single barrier (overhead that LLVM does not have).
1) There is a re-initialisation step in `gomp_team_start` that takes a
   reasonable amount of time.  Especially the part where the primary thread
   re-initialises the `gomp_thread` data structure corresponding to each of
   the existing threads.  When we have 144 threads and we are re-using them
   almost as-is from the previous team this seems like something we could
   avoid.
   - I have a patch that avoids re-initialising the `nthr->task` data
     structure that I've written just for benchmarking to get an idea of the
     overheads involved.
     It is not functional except for the very specific case of this
     benchmark, but it does give an idea for how much overhead we can lose
     just focussing on that part of the re-initialisation.
```
    vshcmd: > hack-no-reinit-gcc-measurements
    ##  1
    creation maxthr:1 nthr:1
        min_time:0.285 us
        max_time:0.323 us
        avg_time:0.291 us
    barrier maxthr:1 nthr:1
        min_time:0.112 us
        max_time:0.113 us
        avg_time:0.113 us
    ##  18
    creation maxthr:18 nthr:18
        min_time:1.510 us
        max_time:1.575 us
        avg_time:1.540 us
    barrier maxthr:18 nthr:18
        min_time:0.483 us
        max_time:0.490 us
        avg_time:0.485 us
    ##  72
    creation maxthr:72 nthr:72
        min_time:3.535 us
        max_time:3.802 us
        avg_time:3.609 us
    barrier maxthr:72 nthr:72
        min_time:0.793 us
        max_time:1.066 us
        avg_time:0.852 us
    ##  144
    creation maxthr:144 nthr:144
        min_time:13.290 us
        max_time:14.463 us
        avg_time:14.091 us
    barrier maxthr:144 nthr:144
        min_time:2.526 us
        max_time:2.778 us
        avg_time:2.625 us
    lego-c2-qs-56:openmp-parallel-gomp-slow [08:13:19] $ 
```

2) Half-barriers
   The current loop that top-level threads are running can be seen in
   `gomp_thread_start`.  For each `pragma omp parallel` this goes through two
   barriers.  One to ensure that all tasks have completed and another to
   hold threads until there is new work to be performed.
   LLVM only has one barrier in the loop.  That barrier is split into fork and
   join halves (i.e. the "gather" and "release" halves that I reference above).
   This difference introduces some performance overhead between the libGOMP
   implementation and the LLVM OpenMP implementation on this particular
   benchmark (predictably overhead of about the amount it takes to go through
   one barrier).
   I believe we could take the LLVM approach, though I haven't yet sorted out
   exactly how that would work out.
   - I believe that having this optimisation is compatible with the OpenMP 5
     standard requirements for an implicit release flush on the primary thread
     synchronising with an implicit acquire flush on all other threads on entry
     to a parallel region.
     https://www.openmp.org/spec-html/5.1/openmpsu106.html#x139-1490002.19.8
   - I believe that it is not compatible with a strict reading of the
     OpenMP 4.5 standard requirements for an implicit flush barrier at
     entry to and exit from a parallel region.
     https://www.openmp.org/wp-content/uploads/openmp-4.5.pdf P165.  I
     have the impression from the value of the `_OPENMP` macro
     that we advertise that this is the standard we currently implement
     -- is that correct?
   - I don't actually know of any observable behaviour that would be
     different from the OpenMP 4.5 standard and a half-barrier approach.
     If the "gather" phase has acquire-release synchronisation from the
     secondary threads to the master thread on exit from a parallel region
     then everything observable in those secondary threads will be
     observable in the master thread and the secondary threads are not
     running so there's no problem.  On the "release" phase at entry to
     a new parallel region there would be release-acquire
     synchronisation from the master thread to the secondary threads and
     all changes in the master thread would be visible in the secondary
     threads.

Reply via email to