https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119588
Matthew Malcomson <matmal01 at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |matmal01 at gcc dot gnu.org --- Comment #2 from Matthew Malcomson <matmal01 at gcc dot gnu.org> --- Created attachment 61938 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=61938&action=edit WIP patch adding a faster barrier Hello, I'm posting my current status and some questions. Much more details below, but the summary is: 1) I have an RFC patch for a faster barrier. There is work left to do but it is performing well against LLVM on benchmarks isolating the barrier performance. 2) I have realised that there is extra overhead on top of the barrier implementation that libgomp has but LLVM's libomp does not. - Part in initialising teams (when re-using the last team). - Part in going through *two* barriers per parallel region teardown and re-creation while LLVM goes through one. 3) Questions I would like to ask: 1) Does the RFC patch look like it's going in an OK direction? Especially w.r.t. the need for tweaks in the API to target code defining how a barrier works. 2) Are we OK going through only one barrier per `omp parallel region`? - Strict reading of OpenMP standard 4.5 "flush" construct seems to imply this is not allowed. - OpenMP standard 5.1 "flush" construct seems to have made adjustments specifically to allow this. - I don't know of any user-observable behaviour that would be different between this and the strict reading of OpenMP 4.5. 3) Would appreciate any pointers around reducing the overhead of re-initialising thread data structures. Has this been looked at before? - At the moment I'm leaning towards something small like having each non-nested thread "clear" their `thr->task` data structure after the team has finished and before waiting at the `pool->threads_dock`. Then `gomp_team_start` can avoid re-initialising it if there is no need to change things too much. - It would be very nice if there was any existing plan to have more extensive cacheing of these data structures. ------------------------------ Attempting to improve the case where there are a lot of threads contending on a barrier. Above case showed a reproducer. N.b. I've only been adjusting the `linux` target barrier implementation. Have tested a bunch of barrier approaches in isolation (i.e. outside of libgomp). Summary of what I found: 1) All threads waiting on a single "wake flag" -- i.e. the bar->generation increment -- seems to be clearly the fastest method of waking up multiple threads. 2) I tried various different approaches for the "gather" (i.e. getting to the point where there is one thread which knows all other threads have arrived). It seems that this is where LLVM has a significant advantage to GCC. I tried the current LLVM approach and various other tweaks. I found that both the below seem to perform generally well across multiple x86 and AArch64 server machines: 1) A "flat" barrier, where the primary thread checks/waits on each of the other threads flags. When they have all been set the primary thread knows that all follower threads have entered the barrier. 2) A "hypercube embedded tree" barrier (the "gather" phase taken from LLVM that performs quite well -- also known as a "tournament"). This chooses a group-size, separates all threads into groups of that size, and the "winner" of each group waits on the flags for the others. Once this "round" has finished then the "winners" of this round are grouped according to the same size and pre-assigned winners of the next round wait on the others. This continues until only one thread is left. Each of these performed noticeably better than the current libgomp approach on average, and in general perform even better at a higher number of threads. In this patch I've implemented approach (1) for the "gather" phase. I mostly chose this since it seemed to perform well against approach (2) and was drastically simpler. - As yet I've not tried the "dist" barrier that LLVM has as an option. This barrier does something very similar to (2) except that it determines the "group" size according to system topology. AIUI it is disabled by default due to the high setup-costs and the fact that this cost must be paid each time a team with a different size is made. All of the faster "gather" approaches require an allocated team_id per thread synchronising at a barrier. This means that each of the threads waiting on a barrier need to have a known ID. I have used the `team->team_id` for this purpose. - This means that the internal API to barriers needs to change. It should not be too difficult to adjust the other target-specific barrires to also use this interface. In the current libgomp code there is one place where the `team_id` of a given thread may change while it is waiting at a barrier. This is hence the only place that passing a `team_id` requires any significant adjustment to the code. It occurs in `gomp_team_start`. This only happens when a thread from the past top-level team is being re-used in the new top-level team but with a different `team_id` assigned as the affinity requirements have changed and this thread now needs to be at a different place in the array of threads. In order to account for this I have introduced a new interface `gomp_barrier_ensure_last`, and have adjusted the place and interface for reinitialising the barrier. - This looks like a more major change to the internal API for barriers, however I could simply do one thing or the other based on the result of a function defined differently in different targets (similar to a target hook). ------------------------------ I have not yet handled tasking in the team barrier optimally. As the code stands the primary thread will simply wait for all other threads to arrive (compared to the existing codebase where the thread with BAR_WAS_LAST is not pre-determined and found by which thread enters the barrier last). I believe I have an approach to handle this -- though I've not yet implemented it (honestly the problem occurred to me late). The problem is the "wait". I've sketched out an approach that I plan to try out (where the overhead on the fast-path is (1) that each secondary thread tests whether a given bit is set in its thread-local generation number, and (2) that the primary thread spins checking both the thread-local generation number and the global generation number). Hopefully that will work and the performance will still be good. - Obviously will have to benchmark this to see if the performance drops due to the change in approach. The "wait" that the primary thread would then be doing (in order to avoid continually spinning) would be a little tricky. We want to futex_wake on two variables at the same time. - I guess we could use `futex_waitv` on newer Linux kernels (waiting on both the secondary next in line and the generation number). On older kernels (as a fallback implementation) we might be able to do something like: - Primary fetch_add's a new PRIMARY_WAITING_TG bit (which can overlap with any of the BAR_* flags because they are not used in the thread-local generation numbers) to the thread-local generation number of the secondary it's waiting on. - If primary sees that the BAR_INCR was incremented then secondary reached barrier as we decided to go to sleep waiting for it. Clear the flag and continue. - Otherwise primary futex_wait's on the generation number. - Can get woken up by a new task being added (BAR_TASK_PENDING). - Can get woken up by (yet another) bit flag in the global generation number saying "secondary you were waiting on has arrived". Call it BAR_SECONDARY_ARRIVED. - In this case it clears that bit and returns to looking to see if all secondary threads have arrived. Meanwhile secondary thread: - Checks the value it gets in `gomp_assert_and_increment_flag`. - If it has the PRIMARY_WAITING_TG flag then set BAR_SECONDARY_ARRIVED on global generation number and futex_wake it. Each secondary thread has to ignore that bit in their loop (will still get woken up by it, just continue around the loop until it's cleared). ------------------------------ I found handling cancellable barriers to be one of the trickiest parts of adding a new kind of barrier. I'm posting this WIP patch without a cancellable barrier using the faster barrier approach -- however I suspect it's possible to use that faster barrier implementation (I'm just attempting to get some directional feedback on the patch as a whole before doing all polishing). As it stands, libgomp simply has two counters implementing barriers in one `gomp_barrier_t`. We use `awaited` the majority of the time, then use `awaited_final` to synchronise threads when they may have cancelled a barrier and left `awaited` in an unknown state. For this patch I also have two different barrier data structures but the split is different. I have one `cancel_awaited` counter which is used to implement a barrier in the same way as the current libgomp barrier, and I have one `threadgens` used to implement the new kind of barrier. In this split `threadgens` is used for all barriers except those known to be cancellable, and `cancel_awaited` is used for cancellable barriers. The main problem with making cancellable threads use this faster barrier is that if the primary thread has arrived at a barrier and is waiting for others to arrive, then it is not waiting on the barrier-global generation flag. Rather it is waiting on all of the thread-local flags for the secondary threads. This is the same problem as we have for getting the primary thread to handle tasks as they come in. I would expect that any solution to the task-handling problem would also be a solution to the cancellation problem. Another problem is how to "increment" the global generation number. At the moment I use the actual values of the generation number to add assertions (compile-time included with the `_LIBGOMP_CHECKING_` macro) on the invariant that we have incremented our thread-local generation numbers the same number of times as we have incremented our barrier-global generation number. The barrier generation number is the one place that holds information about what is happening with the barrier. Having this separate for the cancellable barriers and the non-cancellable barriers seems quite tricky because some functions like `GOMP_task` send a message to the current teams barrier about there being a pending task without knowledge of whether the barrier they're communicating with is cancellable or not. In order to use the same generation number in memory for cancellable barriers and non-cancellable barriers while also maintaining an invariant that the generation number is incremented the same number of times as the thread-local generation numbers that are only used for non-cancellable barriers, I've encoded two different generation numbers in the one integer. At the moment I have one bit for the cancellable barrier and "the rest" for the non-cancellable barrier -- but I could assign a few extra bits to the cancellable barrier. ------------------------------ Other points: LLVM still performs quite well w.r.t. GNU across many iterations of creating a parallel region. This despite this new barrier performing quite well when benchmarking just the time taken across an `omp barrier`. It is particularly interesting that the time taken for LLVM to create a new region is approximately the same time taken for LLVM to go through a barrier (see reproducer with both things measured attached). This indicates two other areas for GNU to improve on workloads like those that triggered this PR. Also, worth mentioning that the clang runs I presented above were with `OMP_PROC_BIND=true`, which seems to have been choosing `OMP_PROC_BIND=spread` and giving much worse results at 72 cores. Using the below bash functions for testing: ``` vshcmd: > clang-measurements () { vshcmd: > for i in 1 18 72 144; do vshcmd: > echo -e "## $i" vshcmd: > OMP_NUM_THREADS=$i \ vshcmd: > OMP_WAIT_POLICY=active OMP_PROC_BIND=close \ vshcmd: > LD_LIBRARY_PATH=${clang_install_path}/lib \ vshcmd: > ./clang.bench.x vshcmd: > done vshcmd: > } vshcmd: > do-gcc-measurements () { vshcmd: > infix="${1}${1:+-}" vshcmd: > for i in 1 18 72 144; do vshcmd: > echo -e "## $i" vshcmd: > OMP_NUM_THREADS=$i \ vshcmd: > OMP_WAIT_POLICY=active OMP_PROC_BIND=close \ vshcmd: > LD_LIBRARY_PATH=/local$HOME/gcc-${infix}install/lib64 \ vshcmd: > ./gcc.bench.x vshcmd: > done vshcmd: > } vshcmd: > original-gcc-measurements () { vshcmd: > do-gcc-measurements vshcmd: > } vshcmd: > better-barrier-gcc-measurements () { vshcmd: > do-gcc-measurements better-barrier vshcmd: > } vshcmd: > hack-no-reinit-gcc-measurements () { vshcmd: > do-gcc-measurements hack-no-reinit vshcmd: > } vshcmd: > hack-half-barrier-gcc-measurements () { vshcmd: > do-gcc-measurements hack-half-barrier vshcmd: > } vshcmd: > clang-dist-measurements () { vshcmd: > KMP_PLAIN_BARRIER_PATTERN="dist,dist" \ vshcmd: > KMP_FORKJOIN_BARRIER_PATTERN="dist,dist" \ vshcmd: > clang-measurements vshcmd: > } ``` So default clang results measuring both a plain barrier and the creation of a parallel region are: ``` vshcmd: > clang-measurements ## 1 creation maxthr:1 nthr:1 min_time:0.075 us max_time:0.084 us avg_time:0.076 us barrier maxthr:1 nthr:1 min_time:0.006 us max_time:0.006 us avg_time:0.006 us ## 18 creation maxthr:18 nthr:18 min_time:2.340 us max_time:2.436 us avg_time:2.391 us barrier maxthr:18 nthr:18 min_time:1.930 us max_time:1.980 us avg_time:1.962 us ## 72 creation maxthr:72 nthr:72 min_time:3.267 us max_time:3.687 us avg_time:3.345 us barrier maxthr:72 nthr:72 min_time:2.880 us max_time:3.158 us avg_time:2.953 us ## 144 creation maxthr:144 nthr:144 min_time:9.426 us max_time:10.007 us avg_time:9.637 us barrier maxthr:144 nthr:144 min_time:7.984 us max_time:8.629 us avg_time:8.227 us lego-c2-qs-56:openmp-parallel-gomp-slow [08:11:57] $ ``` Using the "dist" barrier clang results are: ``` vshcmd: > clang-dist-measurements ## 1 creation maxthr:1 nthr:1 min_time:0.075 us max_time:0.075 us avg_time:0.075 us barrier maxthr:1 nthr:1 min_time:0.006 us max_time:0.006 us avg_time:0.006 us ## 18 creation maxthr:18 nthr:18 min_time:1.057 us max_time:1.075 us avg_time:1.061 us barrier maxthr:18 nthr:18 min_time:0.569 us max_time:0.576 us avg_time:0.570 us ## 72 creation maxthr:72 nthr:72 min_time:1.363 us max_time:1.586 us avg_time:1.411 us barrier maxthr:72 nthr:72 min_time:0.705 us max_time:0.867 us avg_time:0.728 us ## 144 creation maxthr:144 nthr:144 min_time:3.793 us max_time:4.222 us avg_time:3.945 us barrier maxthr:144 nthr:144 min_time:2.418 us max_time:2.763 us avg_time:2.552 us lego-c2-qs-56:openmp-parallel-gomp-slow [08:11:55] $ ``` Current GCC OpenMP results (including benchmarking only a barrier this time) are: ``` vshcmd: > original-gcc-measurements ## 1 creation maxthr:1 nthr:1 min_time:0.241 us max_time:0.253 us avg_time:0.243 us barrier maxthr:1 nthr:1 min_time:0.116 us max_time:0.117 us avg_time:0.116 us ## 18 creation maxthr:18 nthr:18 min_time:3.034 us max_time:3.065 us avg_time:3.048 us barrier maxthr:18 nthr:18 min_time:1.446 us max_time:1.464 us avg_time:1.457 us ## 72 creation maxthr:72 nthr:72 min_time:9.889 us max_time:10.329 us avg_time:10.063 us barrier maxthr:72 nthr:72 min_time:4.788 us max_time:5.018 us avg_time:4.879 us ## 144 creation maxthr:144 nthr:144 min_time:31.021 us max_time:31.452 us avg_time:31.292 us barrier maxthr:144 nthr:144 min_time:13.833 us max_time:14.287 us avg_time:14.089 us lego-c2-qs-56:openmp-parallel-gomp-slow [08:13:10] $ ``` Results with my new barrier implementation (patch attached above) are: ``` vshcmd: > better-barrier-gcc-measurements ## 1 creation maxthr:1 nthr:1 min_time:0.283 us max_time:0.303 us avg_time:0.285 us barrier maxthr:1 nthr:1 min_time:0.112 us max_time:0.122 us avg_time:0.113 us ## 18 creation maxthr:18 nthr:18 min_time:1.990 us max_time:2.134 us avg_time:2.075 us barrier maxthr:18 nthr:18 min_time:0.588 us max_time:0.599 us avg_time:0.592 us ## 72 creation maxthr:72 nthr:72 min_time:5.142 us max_time:5.376 us avg_time:5.267 us barrier maxthr:72 nthr:72 min_time:0.750 us max_time:0.807 us avg_time:0.770 us ## 144 creation maxthr:144 nthr:144 min_time:19.988 us max_time:21.928 us avg_time:21.183 us barrier maxthr:144 nthr:144 min_time:2.212 us max_time:2.526 us avg_time:2.336 us lego-c2-qs-56:openmp-parallel-gomp-slow [08:13:16] $ ``` We can see that the barrier times are pretty good w.r.t. clang now, however the creation of a parallel region are not that great. I've noticed two contributions to this extra overhead between the creation of a parallel region and a single barrier (overhead that LLVM does not have). 1) There is a re-initialisation step in `gomp_team_start` that takes a reasonable amount of time. Especially the part where the primary thread re-initialises the `gomp_thread` data structure corresponding to each of the existing threads. When we have 144 threads and we are re-using them almost as-is from the previous team this seems like something we could avoid. - I have a patch that avoids re-initialising the `nthr->task` data structure that I've written just for benchmarking to get an idea of the overheads involved. It is not functional except for the very specific case of this benchmark, but it does give an idea for how much overhead we can lose just focussing on that part of the re-initialisation. ``` vshcmd: > hack-no-reinit-gcc-measurements ## 1 creation maxthr:1 nthr:1 min_time:0.285 us max_time:0.323 us avg_time:0.291 us barrier maxthr:1 nthr:1 min_time:0.112 us max_time:0.113 us avg_time:0.113 us ## 18 creation maxthr:18 nthr:18 min_time:1.510 us max_time:1.575 us avg_time:1.540 us barrier maxthr:18 nthr:18 min_time:0.483 us max_time:0.490 us avg_time:0.485 us ## 72 creation maxthr:72 nthr:72 min_time:3.535 us max_time:3.802 us avg_time:3.609 us barrier maxthr:72 nthr:72 min_time:0.793 us max_time:1.066 us avg_time:0.852 us ## 144 creation maxthr:144 nthr:144 min_time:13.290 us max_time:14.463 us avg_time:14.091 us barrier maxthr:144 nthr:144 min_time:2.526 us max_time:2.778 us avg_time:2.625 us lego-c2-qs-56:openmp-parallel-gomp-slow [08:13:19] $ ``` 2) Half-barriers The current loop that top-level threads are running can be seen in `gomp_thread_start`. For each `pragma omp parallel` this goes through two barriers. One to ensure that all tasks have completed and another to hold threads until there is new work to be performed. LLVM only has one barrier in the loop. That barrier is split into fork and join halves (i.e. the "gather" and "release" halves that I reference above). This difference introduces some performance overhead between the libGOMP implementation and the LLVM OpenMP implementation on this particular benchmark (predictably overhead of about the amount it takes to go through one barrier). I believe we could take the LLVM approach, though I haven't yet sorted out exactly how that would work out. - I believe that having this optimisation is compatible with the OpenMP 5 standard requirements for an implicit release flush on the primary thread synchronising with an implicit acquire flush on all other threads on entry to a parallel region. https://www.openmp.org/spec-html/5.1/openmpsu106.html#x139-1490002.19.8 - I believe that it is not compatible with a strict reading of the OpenMP 4.5 standard requirements for an implicit flush barrier at entry to and exit from a parallel region. https://www.openmp.org/wp-content/uploads/openmp-4.5.pdf P165. I have the impression from the value of the `_OPENMP` macro that we advertise that this is the standard we currently implement -- is that correct? - I don't actually know of any observable behaviour that would be different from the OpenMP 4.5 standard and a half-barrier approach. If the "gather" phase has acquire-release synchronisation from the secondary threads to the master thread on exit from a parallel region then everything observable in those secondary threads will be observable in the master thread and the secondary threads are not running so there's no problem. On the "release" phase at entry to a new parallel region there would be release-acquire synchronisation from the master thread to the secondary threads and all changes in the master thread would be visible in the secondary threads.