Bug#1107746: gcc-14-offload-nvptx: offloading to gpu disabled if #pragma omp requires unified_shared_memory is used (it worked before on the same hardware)

Tobias Burnus Mon, 16 Jun 2025 01:57:42 -0700

I found out that in the current debian gcc-14-offload-nvptx and
gcc-13-offload-nvptx, if I compile a code that requires
unified_shared_memory and uses openmp to offload to gpu, the code is never
run on the gpu.  It does compile the offload code, but then it is never
executed on the gpu.


That's to be expected for (at least) GCC 13 and GCC 14.

The OpenMP spec states that 'available devices' must be
'accessible' and 'supported'. And the later is defined
(glossary, here from 6.0):

"supported device - The host device or any non-host device supported
by the implementation, including any device-related requirements
specified by the requires directive."


Thus, if you specify

  omp requires unified_shared_memory

and either the device or the implementation does not support
unified-shared memory, all unsupported devices are removed such
that only the host is left (host fallback).

In some old GCC versions, '#pragma omp requires' was simply
ignored (warning with -Wunknown-pragmas, implied by -Wall).

For some versions, requiring USM would give an error.

I think since GCC 13, the host-fallback mechanism is at works,
printing an warning with GOMP_DEBUG=1 at runtime, if a device
cannot fulfill the requirement.

* * *

Since GCC 15, USM is supported under the following conditions:

https://gcc.gnu.org/onlinedocs/libgomp/Offload-Target-Specifics.html

As this is about an Nvidia GPU:

"* OpenMP code that has a requires directive with self_maps or
  unified_shared_memory runs on nvptx devices if and only if all
  of those support the pageableMemoryAccess property;⁵ otherwise,
  all nvptx device are removed from the list of available devices
  (“host fallback”)."

(5) 
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-requirements


Which is fulfilled by [cf. (5)]:

"Linux HMM requires Linux kernel version 6.1.24+, 6.2.11+ or 6.3+,devices with compute capability 7.5 or higher and a CUDA driver version535+ installed with Open Kernel Modules." The pageableMemoryAccess is true on, e.g., the Frontier supercomputer

but also on my Laptop (compute capability 8.6, Ampere, meanwhile a
6.15 kernel), but admittedly we had some issues with Debian 12 Bookworm
and an Ada (8.9) card with the current 6.1.140 kernel (>= 6.1.24+) and
a recent open-kernel driver, even though it should have worked according
to the spec.

You can check this by something like:

  CUresult res;
  int n;
  res = cuInit (0);
  res = cuDeviceGetCount (&n);
  for (int dev = 0; dev < n; ++dev)
    {
      int val;
      __builtin_printf("============== Device %d =================\n", dev);
      res = cuDeviceGetAttribute (&val, 
CU_DEVICE_ATTRIBUTE_PAGEABLE_MEMORY_ACCESS, dev);
      __builtin_printf ("Device %d: pageableMemoryAccess: %d\n", dev, val);
    }

* * *

SOLUTION:

* For full USM support, your system needs to support
  pageableMemoryAccess (at least effectively for the devices involved)


If it does, you have two choices:

* Using GCC 15 from experimental, which supports USM, cf.
  https://gcc.gnu.org/gcc-15/changes.html and
  https://gcc.gnu.org/projects/gomp/#omp5.0 and
  https://tracker.debian.org/pkg/gcc-15

* (Using (any) older GCC but) avoid using
    omp requires unified_shared_memory.

The difference between the two solutions:

- With the requirement, all maps are 'self maps', i.e.
  not data is actually copied.

- Without the requirement, data is copied but as, e.g.,
  pointer members of structs still point to the host memory,
  accessing those will work.

The USM support (HMM) works as follows:
If you access memory on the GPU that is not directly accessible
(= most host memory, unless you have e.g. a Grace-Hopper), a
memory-page fault is triggered and the the Linux kernel (+ Nvidia
kernel drivers) moves the page to device accessible memory.
Likewise on the way back from the device to the host accessible
memory.

* * *

If the system does not support pageableMemoryAccess, but at least
managedMemory you can access such memory (only) from the device.
If you are careful, this will work - but, obviously, the compiler
cannot regard such a system as supporting USM. Obtaining such memory
can be done using the CUDA-runtime routines for pinned and managed
memory.

I hope it helps.

Tobias

PS: GCC 16 will support some more memory handling tweaks and other
improvements. Still out to date and very much work in progress:
https://gcc.gnu.org/gcc-16/changes.html

Bug#1107746: gcc-14-offload-nvptx: offloading to gpu disabled if #pragma omp requires unified_shared_memory is used (it worked before on the same hardware)

Reply via email to