https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88941
--- Comment #1 from Tom de Vries <vries at gcc dot gnu.org> --- A patch like this waits for the kernel to finish, and then forces processing the event, so it fixes the failing test-case: ... diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c index dd2bcf3083f..56334a401dc 100644 --- a/libgomp/plugin/plugin-nvptx.c +++ b/libgomp/plugin/plugin-nvptx.c @@ -478,6 +478,8 @@ init_streams_for_device (struct ptx_device *ptx_dev, int concurrency) return true; } +static void event_gc (bool); + static bool fini_streams_for_device (struct ptx_device *ptx_dev) { @@ -489,6 +491,8 @@ fini_streams_for_device (struct ptx_device *ptx_dev) struct ptx_stream *s = ptx_dev->active_streams; ptx_dev->active_streams = ptx_dev->active_streams->next; + CUDA_CALL_ASSERT (cuStreamSynchronize, s->stream); + event_gc (false); ret &= map_fini (s); CUresult r = CUDA_CALL_NOCHECK (cuStreamDestroy, s->stream); ... but, it causes a regression for lib-82.c. There's a segfault in the event_gc we just added, because nvthd is NULL, and we dereferencing it here: ... 1029 if (e->ord != nvthd->ptx_dev->ord) 1030 continue; (gdb) p nvthd $9 = (struct nvptx_thread *) 0x0 ... The nvthd is NULL, because at that point the acc_shutdown (in the lib-82.c source) has resulted in calling GOMP_OFFLOAD_openacc_destroy_thread_data. At the acc_shutdown documentation, we read: .. - This routine may not be called during execution of an accelerator compute region. - If the program attempts to execute a compute region or access any device data on such a device, the behavior is undefined. ... The lib-82.c testcase launches kernels asynchronously (not by using parallel async, but by using cuLaunchKernel). The documentation of acc_shutdown seems to imply that we need to wait for all those streams finish before calling acc_shutdown. There is a wait call before acc_shutdown: ... acc_wait_all_async (0); ... but the semantics for that one is: ... The acc_wait_all_async routine enqueues wait operations on one async queue for the operations previously enqueued on all other async queues. ... ISTM that this can't guarantee that all queues have finished. OTOH, using acc_wait_all, with semantics: ... The acc_wait_all routine waits for completion of all asynchronous operations. ... seems to guarantee that, and indeed using this call instead fixed the lib-82.c failure.