https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122356

            Bug ID: 122356
           Summary: Suspected memory flush problem between task execution
                    and barrier release
           Product: gcc
           Version: unknown
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: libgomp
          Assignee: unassigned at gcc dot gnu.org
          Reporter: matmal01 at gcc dot gnu.org
                CC: jakub at gcc dot gnu.org
  Target Milestone: ---

In the current libgomp barrier implementation I believe the below code has a
theoretical problem around memory ordering.

I have not been able to trigger a crash due to this.  I am writing this
partially in order to refer to it in descriptions of a separate patch for
libgomp I hope to contribute.

Code:

```
#pragma omp parallel num_threads(8)
  {
#pragma omp for
    for (int i = 0; i < 10; i++)
#pragma omp task
      {
        full_data[omp_get_thread_num ()] += 1;
      }
#pragma omp barrier

  unsigned total = 0;
  for (int i = 0; i < NUM_THREADS; i++)
    total += full_data[i];

  if (total != 10)
    abort ();
  }
```

What I suspect is a problem is that user code may run *after* the
implementation of the memory flush in `gomp_barrier_wait_start`.

Imagine the following for the linux target:
1) Some thread `A` enters the barrier and decrements `awaited`.
2) That thread then executes a task (updating some user data after the store
that will form an acquire-release ordering).
3) The last thread `L` enters the barrier and decrements `awaited`
(acquire-release formed between thread `A` and itself `L` ensuring
syncronisation of user code before the barrier was entered -- but not ensuring
anything about the stores in the user code in tasks that thread `A` performed
while waiting for other threads).
4) The last thread reads `task_count` non-atomically (and it was written
non-atomically).  Sees that it is zero, then writes to `bar->generation` with a
RELEASE memory order.
5) Thread `A` eventually reads `bar->generation` with an ACQUIRE memory order. 
Creating an acquire-release ordering from the thread `L` to thread `A` -- but
*not* the other way around.

In that scenario I don't believe there is any memory synchronisation ensuring
stores in the task that thread `A` ran are visible after the barrier in thread
`L`.

There is a similar reasoning between threads in the task scheduling loop in
`gomp_barrier_wait_end`.  There are many cases by which tasks may not enter
`gomp_barrier_handle_tasks` and without synchronising on the mutexes in that
function there is nowhere that threads can create a memory ordering after the
user code in tasks has been run.

Reply via email to