https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121927
Bug ID: 121927
Summary: [gcn][OpenMP] "Asynchronous queue error" with 'omp
parallel if(0) ordered' + 'omp ordered'
Product: gcc
Version: 16.0
Status: UNCONFIRMED
Keywords: openmp, wrong-code
Severity: normal
Priority: P3
Component: libgomp
Assignee: unassigned at gcc dot gnu.org
Reporter: burnus at gcc dot gnu.org
CC: ams at gcc dot gnu.org, jakub at gcc dot gnu.org
Target Milestone: ---
Target: gcn
Created attachment 62374
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=62374&action=edit
Reducted testcase - compile with -fopenmp
That's with OMPTests, https://github.com/doru1004/omptests
is probably old (not verified).
The tests t-parallel-for, t-tdpf-nested-parallel, t-tp-nested-parallel, and
t-ttdpf-nested-parallel all fail at runtime with:
libgomp: GCN fatal error: Asynchronous queue error
Runtime message: HSA_STATUS_ERROR_MEMORY_APERTURE_VIOLATION:
The agent attempted to access memory beyond the largest legal address.
* * *
Running t-parallel-for on gfx90a in the debugger shows the fail in:
Thread 5 "a.out" received signal SIGBUS, Bus error.
[Switching to thread 5, lane 0 (AMDGPU Lane 1:1:1:1/0 (0,0,0)[0,0,0])]
0x00007ffff48bbac8 in gomp_ptrlock_get (ptrlock=0x7ffff4209d44) at
repos/gcc/libgomp/config/accel/ptrlock.h:54
54 && __atomic_compare_exchange_n (ptrlock, &v, 1, false,
(gdb) bt
#0 0x00007ffff48bbac8 in gomp_ptrlock_get (ptrlock=0x7ffff4209d44) at
repos/gcc/libgomp/work.c:209
#2 0x00007ffff486f35c in gomp_loop_ordered_static_start (start=0,
end=<optimized out>, incr=<optimized out>, chunk_size=<optimized out>,
istart=<optimized out>, iend=<optimized out>) at repos/gcc/libgomp/loop.c:325
#3 0x00007ffff4859bb0 in main._omp_fn () at t-parallel-for/test.c:384
* * *
Note that the issue only shows for the penultimate test case ('K')
and only for 't = 0', 't = 224' seems to work fine. Namely, ...
* * *
Reduced testcase attached, i.e. the code block that causes the issue. No
attempt has been made to reduce the code block itself.
However, just enabling the first two 'omp parallel' is enough for the fail.
Note that threads[0] == 0 → 'if(0)'
for (int t = 0; t <= 224; t += 224)
{
int threads[1];
threads[0] = t;
#pragma omp target teams num_teams(1) thread_limit(1024)
{
S[0] = 0;
#pragma omp parallel for if(threads[0] > 1) \
num_threads(threads[0]) ordered
for (int i = 0; i < (1024 * 3); i++)
#pragma omp ordered
S[0] += C[i] + D[i];
#pragma omp parallel for schedule(auto) if(threads[0] > 1) \
num_threads(threads[0]) ordered
for (int i = 0; i < (1024 * 3); i++)
#pragma omp ordered
S[0] += C[i] + D[i];
// commented code: many more 'parallel for' loops
}
The fail is in libgomp/work.c:209 (sidenote: GOMP_USE_ALIGNED_WORK_SHARES
unset for __AMDGCN__; the #if etc. is not shown as follows):
gomp_work_share_start (size_t ordered)
{
struct gomp_thread *thr = gomp_thread ();
struct gomp_team *team = thr->ts.team;
struct gomp_work_share *ws;
/* Work sharing constructs can be orphaned. */
if (team == NULL)
{
ws = gomp_malloc (sizeof (*ws));
gomp_init_work_share (ws, ordered, 1);
thr->ts.work_share = ws;
return true;
}
ws = thr->ts.work_share;
thr->ts.last_work_share = ws;
ws = gomp_ptrlock_get (&ws->next_ws);
where the last one line is replaced by the following and, according to the
debugger, the fail is the compare-exchange line, i.e. in
libgomp/config/accel/ptrlock.h:54:
static inline void *gomp_ptrlock_get (gomp_ptrlock_t *ptrlock)
{
uintptr_t v = (uintptr_t) __atomic_load_n (ptrlock, MEMMODEL_ACQUIRE);
if (v > 2)
return (void *) v;
if (v == 0
&& __atomic_compare_exchange_n (ptrlock, &v, 1, false,
MEMMODEL_ACQUIRE, MEMMODEL_ACQUIRE))
return NULL;
while (v == 1)
v = (uintptr_t) __atomic_load_n (ptrlock, MEMMODEL_ACQUIRE);
return (void *) v;
}