https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88941

--- Comment #1 from Tom de Vries <vries at gcc dot gnu.org> ---
A patch like this waits for the kernel to finish, and then forces processing
the event, so it fixes the failing test-case:
...
diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index dd2bcf3083f..56334a401dc 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -478,6 +478,8 @@ init_streams_for_device (struct ptx_device *ptx_dev, int
concurrency)
   return true;
 }

+static void event_gc (bool);
+
 static bool
 fini_streams_for_device (struct ptx_device *ptx_dev)
 {
@@ -489,6 +491,8 @@ fini_streams_for_device (struct ptx_device *ptx_dev)
       struct ptx_stream *s = ptx_dev->active_streams;
       ptx_dev->active_streams = ptx_dev->active_streams->next;

+      CUDA_CALL_ASSERT (cuStreamSynchronize, s->stream);
+      event_gc (false);
       ret &= map_fini (s);

       CUresult r = CUDA_CALL_NOCHECK (cuStreamDestroy, s->stream);
...

but, it causes a regression for lib-82.c. There's a segfault in the event_gc we
just added, because nvthd is NULL, and we dereferencing it here:
...
1029          if (e->ord != nvthd->ptx_dev->ord)
1030            continue;
(gdb) p nvthd
$9 = (struct nvptx_thread *) 0x0
...

The nvthd is NULL, because at that point the acc_shutdown (in the lib-82.c
source) has resulted in calling GOMP_OFFLOAD_openacc_destroy_thread_data.

At the acc_shutdown documentation, we read:
..
- This routine may not be called during execution of an accelerator compute
  region.
- If the program attempts to execute a compute region or access any device
  data on such a device, the behavior is undefined.
...

The lib-82.c testcase launches kernels asynchronously (not by using parallel
async, but by using cuLaunchKernel).

The documentation of acc_shutdown seems to imply that we need to wait for all
those streams finish before calling acc_shutdown.

There is a wait call before acc_shutdown:
...
  acc_wait_all_async (0);
...
but the semantics for that one is:
...
The acc_wait_all_async routine enqueues wait operations on one async queue
for the operations previously enqueued on all other async queues.
...

ISTM that this can't guarantee that all queues have finished.

OTOH, using acc_wait_all, with semantics:
...
The acc_wait_all routine waits for completion of all asynchronous operations.
...
seems to guarantee that, and indeed using this call instead fixed the lib-82.c
failure.

Reply via email to