https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121927

            Bug ID: 121927
           Summary: [gcn][OpenMP] "Asynchronous queue error" with 'omp
                    parallel if(0) ordered' + 'omp ordered'
           Product: gcc
           Version: 16.0
            Status: UNCONFIRMED
          Keywords: openmp, wrong-code
          Severity: normal
          Priority: P3
         Component: libgomp
          Assignee: unassigned at gcc dot gnu.org
          Reporter: burnus at gcc dot gnu.org
                CC: ams at gcc dot gnu.org, jakub at gcc dot gnu.org
  Target Milestone: ---
            Target: gcn

Created attachment 62374
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=62374&action=edit
Reducted testcase - compile with -fopenmp

That's with OMPTests, https://github.com/doru1004/omptests
is probably old (not verified).

The tests t-parallel-for, t-tdpf-nested-parallel, t-tp-nested-parallel, and
t-ttdpf-nested-parallel all fail at runtime with:

  libgomp: GCN fatal error: Asynchronous queue error
  Runtime message: HSA_STATUS_ERROR_MEMORY_APERTURE_VIOLATION:
     The agent attempted to access memory beyond the largest legal address.

* * *

Running t-parallel-for on gfx90a in the debugger shows the fail in:

Thread 5 "a.out" received signal SIGBUS, Bus error.

[Switching to thread 5, lane 0 (AMDGPU Lane 1:1:1:1/0 (0,0,0)[0,0,0])]
0x00007ffff48bbac8 in gomp_ptrlock_get (ptrlock=0x7ffff4209d44) at
repos/gcc/libgomp/config/accel/ptrlock.h:54

54            && __atomic_compare_exchange_n (ptrlock, &v, 1, false,

(gdb) bt
#0  0x00007ffff48bbac8 in gomp_ptrlock_get (ptrlock=0x7ffff4209d44) at
repos/gcc/libgomp/work.c:209
#2  0x00007ffff486f35c in gomp_loop_ordered_static_start (start=0,
end=<optimized out>, incr=<optimized out>, chunk_size=<optimized out>,
istart=<optimized out>, iend=<optimized out>) at repos/gcc/libgomp/loop.c:325
#3  0x00007ffff4859bb0 in main._omp_fn () at t-parallel-for/test.c:384

* * *

Note that the issue only shows for the penultimate test case ('K')
and only for 't = 0', 't = 224' seems to work fine. Namely, ...

* * *

Reduced testcase attached, i.e. the code block that causes the issue. No
attempt has been made to reduce the code block itself.

However, just enabling the first two 'omp parallel' is enough for the fail.
Note that threads[0] == 0 → 'if(0)'

  for (int t = 0; t <= 224; t += 224)
    {
      int threads[1];
      threads[0] = t;
      #pragma omp target teams num_teams(1) thread_limit(1024)
      {
        S[0] = 0;
        #pragma omp parallel for if(threads[0] > 1) \
                                 num_threads(threads[0]) ordered
        for (int i = 0; i < (1024 * 3); i++)
          #pragma omp ordered
          S[0] += C[i] + D[i];

          #pragma omp parallel for schedule(auto) if(threads[0] > 1) \
                                   num_threads(threads[0]) ordered
        for (int i = 0; i < (1024 * 3); i++)
          #pragma omp ordered
          S[0] += C[i] + D[i];
         // commented code: many more 'parallel for' loops
    }

The fail is in libgomp/work.c:209  (sidenote: GOMP_USE_ALIGNED_WORK_SHARES
unset for __AMDGCN__; the #if etc. is not shown as follows):

gomp_work_share_start (size_t ordered)
{
  struct gomp_thread *thr = gomp_thread ();
  struct gomp_team *team = thr->ts.team;
  struct gomp_work_share *ws;

  /* Work sharing constructs can be orphaned.  */
  if (team == NULL)
    {
      ws = gomp_malloc (sizeof (*ws));
      gomp_init_work_share (ws, ordered, 1);
      thr->ts.work_share = ws;
      return true;
    }

  ws = thr->ts.work_share;
  thr->ts.last_work_share = ws;
  ws = gomp_ptrlock_get (&ws->next_ws);

where the last one line is replaced by the following and, according to the
debugger, the fail is the compare-exchange line, i.e. in
libgomp/config/accel/ptrlock.h:54:


static inline void *gomp_ptrlock_get (gomp_ptrlock_t *ptrlock)
{
  uintptr_t v = (uintptr_t) __atomic_load_n (ptrlock, MEMMODEL_ACQUIRE);
  if (v > 2)
    return (void *) v;

  if (v == 0
      && __atomic_compare_exchange_n (ptrlock, &v, 1, false,
                                      MEMMODEL_ACQUIRE, MEMMODEL_ACQUIRE))
    return NULL;

  while (v == 1)
    v = (uintptr_t) __atomic_load_n (ptrlock, MEMMODEL_ACQUIRE);

  return (void *) v;
}

Reply via email to