https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122356
Bug ID: 122356
Summary: Suspected memory flush problem between task execution
and barrier release
Product: gcc
Version: unknown
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: libgomp
Assignee: unassigned at gcc dot gnu.org
Reporter: matmal01 at gcc dot gnu.org
CC: jakub at gcc dot gnu.org
Target Milestone: ---
In the current libgomp barrier implementation I believe the below code has a
theoretical problem around memory ordering.
I have not been able to trigger a crash due to this. I am writing this
partially in order to refer to it in descriptions of a separate patch for
libgomp I hope to contribute.
Code:
```
#pragma omp parallel num_threads(8)
{
#pragma omp for
for (int i = 0; i < 10; i++)
#pragma omp task
{
full_data[omp_get_thread_num ()] += 1;
}
#pragma omp barrier
unsigned total = 0;
for (int i = 0; i < NUM_THREADS; i++)
total += full_data[i];
if (total != 10)
abort ();
}
```
What I suspect is a problem is that user code may run *after* the
implementation of the memory flush in `gomp_barrier_wait_start`.
Imagine the following for the linux target:
1) Some thread `A` enters the barrier and decrements `awaited`.
2) That thread then executes a task (updating some user data after the store
that will form an acquire-release ordering).
3) The last thread `L` enters the barrier and decrements `awaited`
(acquire-release formed between thread `A` and itself `L` ensuring
syncronisation of user code before the barrier was entered -- but not ensuring
anything about the stores in the user code in tasks that thread `A` performed
while waiting for other threads).
4) The last thread reads `task_count` non-atomically (and it was written
non-atomically). Sees that it is zero, then writes to `bar->generation` with a
RELEASE memory order.
5) Thread `A` eventually reads `bar->generation` with an ACQUIRE memory order.
Creating an acquire-release ordering from the thread `L` to thread `A` -- but
*not* the other way around.
In that scenario I don't believe there is any memory synchronisation ensuring
stores in the task that thread `A` ran are visible after the barrier in thread
`L`.
There is a similar reasoning between threads in the task scheduling loop in
`gomp_barrier_wait_end`. There are many cases by which tasks may not enter
`gomp_barrier_handle_tasks` and without synchronising on the mutexes in that
function there is nowhere that threads can create a memory ordering after the
user code in tasks has been run.