The approach I've taken in libgomp/nvptx is to have a single entry point,
gomp_nvptx_main, that can take care of initial allocation, transferring
control to target region function, and finalization.

At the moment it has the prototype:
void gomp_nvptx_main(void (*fn)(void*), void *fndata);

but it's plausible that down the road we'll need other arguments for passing
data allocated by the plugin.

I see two possible ways to arrange that.

1.  Make gomp_nvptx_main a .kernel function.  This is what this patch assumes.
This requires emitting pointers-to-target-region-functions from the compiler,
and looking them up via cuModuleLoadGlobal/cuMemcpyDtoH in the plugin.

2.  Make gomp_nvptx_main a device (.func) function.  To have that work, we'd
need to additionally emit a "trampoline" of sorts in the NVPTX backend.  For
each OpenMP target entrypoint foo$_omp_fn$0, we'd have to additionally emit

__global__ void foo$_omp_fn$0$entry(void *args)
{
   gomp_nvptx_main(foo$_omp_fn$0, args);
}

(or perhaps better, rename the original function, and emit the trampoline
under the original name)

In approach 1, the prototype of gomp_nvptx_main is the internal business of
libgomp.  We are free to add arguments to it as needed.  The ABI between
libgomp and the backend is the name 'gomp_nvptx_main' and '__ptr_' prefix for
exported function pointers.

In approach 2, the prototype of gomp_nvptx_main becomes an ABI detail between
libgomp and nvptx backend.  Adding more arguments to gomp_nvptx_main gets a
bit harder.  On the positive side, we won't need to export
function pointers anymore, so this plugin change won't be needed.

In both cases the ABI of gomp_nvptx_main matters to libgomp-nvptx-plugin.
Perhaps we should freeze it right from the beginning, like this:
void gomp_nvptx_main(void (*fn)(void*), void *fndata, void *auxdata, size_t 
size);

I think I like approach 2 more (it'll need time to materialize due to required
legwork in the nvptx backend).  Thoughts?

(admittedly this patch is rather crude, storing CUdeviceptr in 'function' is
an aliasing violation, and repeated lookups of gomp_nvptx_main in
GOMP_OFFLOAD_run is pointless; it all will go away if ultimately we go with
approach 2)

The plugin launches 1 team of 8 warps.  The number 8 is not an ABI matter:
gomp_nvptx_main can lookup team size it is launched with.  The static choice
of 8 warps, while unfortunate, shouldn't be a show-stopper: the implementation
can spawn as many threads as it wishes, and if fewer threads are requested via
num_threads clause, we'll have idle warps.

        * plugin/plugin-nvptx.c (GOMP_OFFLOAD_load_image): Try loading
        OpenMP-specific function pointer __ptr_NAME first.
        (GOMP_OFFLOAD_run): Launch gomp_offload_main.
---
 libgomp/plugin/plugin-nvptx.c | 23 ++++++++++++++++++-----
 1 file changed, 18 insertions(+), 5 deletions(-)

diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index 47ed074..4e9c054 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -1566,8 +1566,15 @@ GOMP_OFFLOAD_load_image (int ord, unsigned version,
   for (i = 0; i < fn_entries; i++, targ_fns++, targ_tbl++)
     {
       CUfunction function;
+      CUdeviceptr dptr;
+      char buf[sizeof("__ptr_") + strlen(fn_descs[i].fn)];
 
-      r = cuModuleGetFunction (&function, module, fn_descs[i].fn);
+      strcat(strcpy(buf, "__ptr_"), fn_descs[i].fn);
+      r = cuModuleGetGlobal (&dptr, NULL, module, buf);
+      if (r == CUDA_SUCCESS)
+       cuMemcpyDtoH (&function, dptr, sizeof (void*));
+      else
+       r = cuModuleGetFunction (&function, module, fn_descs[i].fn);
       if (r != CUDA_SUCCESS)
        GOMP_PLUGIN_fatal ("cuModuleGetFunction error: %s", cuda_error (r));
 
@@ -1793,12 +1800,18 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars)
   CUresult r;
   struct ptx_device *ptx_dev = ptx_devices[ord];
   const char *maybe_abort_msg = "(perhaps abort was called)";
-  void *args = &tgt_vars;
+  void *args[] = {&function, &tgt_vars};
 
-  r = cuLaunchKernel (function,
-                     1, 1, 1,
+  CUfunction mainfunc;
+
+  r = cuModuleGetFunction (&mainfunc, ptx_dev->images->module, 
"gomp_nvptx_main");
+  if (r != CUDA_SUCCESS)
+    GOMP_PLUGIN_fatal ("cuModuleGetFunction error: %s", cuda_error (r));
+
+  r = cuLaunchKernel (mainfunc,
                      1, 1, 1,
-                     0, ptx_dev->null_stream->stream, &args, 0);
+                     32, 8, 1,
+                     0, ptx_dev->null_stream->stream, args, 0);
   if (r != CUDA_SUCCESS)
     GOMP_PLUGIN_fatal ("cuLaunchKernel error: %s", cuda_error (r));
 

Reply via email to