https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108098

--- Comment #3 from Tobias Burnus <burnus at gcc dot gnu.org> ---
The problem - at least when testing on a system with:
  NVIDIA-SMI 440.118.02   Driver Version: 440.118.02   CUDA Version: 10.2

seems to be that libgomp/plugin/plugin-nvptx.c's GOMP_OFFLOAD_load_image has:

   fn_entries == 3 - and rev_fn_table != NULL (i.e. expect offloading)

and then runs:

      r = CUDA_CALL_NOCHECK (cuModuleGetGlobal, &var, &bytes, module,
                             "$offload_func_table");
      if (r != CUDA_SUCCESS)
        GOMP_PLUGIN_fatal ("cuModuleGetGlobal error: %s", cuda_error (r));
      assert (bytes == sizeof (uint64_t) * fn_entries);
      *rev_fn_table = GOMP_PLUGIN_malloc (sizeof (uint64_t) * fn_entries);
      r = CUDA_CALL_NOCHECK (cuMemcpyDtoH, *rev_fn_table, var, bytes);
      if (r != CUDA_SUCCESS)
        GOMP_PLUGIN_fatal ("cuMemcpyDtoH error: %s", cuda_error (r));

So far so good - but all entries are NULL. This then disables the checking for
reverse offload on the host side. (It is not quite clear to me why it doesn't
run into an endless loop on the device side.)

The generated PTX code for reverse-offload-1{,-aux}.c is for that offload
table:

        ".version 6.0"
        ".target sm_35"
        ".file 1 \"<dummy>\""
        ".extern .func tg_fn$_omp_fn$0$nohost$0 (.param .u64 %in_ar0);"
        ".extern .func main$_omp_fn$2$nohost$1 (.param .u64 %in_ar0);"
        ".visible .global .align 8 .u64 $offload_func_table[] = {"
                "tg_fn$_omp_fn$0$nohost$0,"
                "main$_omp_fn$2$nohost$1,"
                "0,"
                "0};\n";

which seems to be OK – and works with CUDA 11.  It looks as if the '>= sm_35'
is only one required criterion but that there are additional ones.

 * * *

I am relatively sure that it did work before, but it could well be that I only
checked that the device->host notification worked w/o trying any actual offload
(and before adding all NULL -> no reverse offload). And later when doing the
actual offload tests, I might have missed that machine. — Or I did something
different back then, but I don't know what.

 * * *

In patch "nvptx: Support global constructors/destructors via 'collect2'",
https://gcc.gnu.org/pipermail/gcc-patches/2022-December/607749.html , Thomas
uses a dummy entry - possibly that would be also a solution:

+/* For example with old Nvidia Tesla K20c, Driver Version: 361.93.02, the
+   function pointers stored in the '__CTOR_LIST__', '__DTOR_LIST__' arrays
+   evidently evaluate to NULL in JIT compilation.  Avoiding the use of
+   assembler names ('write_list_with_asm') doesn't help, but defining a dummy
+   function next to the arrays apparently does work around this issue...

Reply via email to