Andrew Stubbs wrote:
This adds support for using Cuda Managed Memory with omp_alloc.  AMD support
will be added in a future patch.

There is one new predefined allocator, "ompx_gnu_managed_mem_alloc", plus a
corresponding memory space, which can be used to allocate memory in the
"managed" space.

Background – as generic information for patch readers.

Managed memory – at least as implemented by Nvidia – migrates
on the first hit from the device side to the device and at the
end back. With some older systems, accessing the memory on the
host while the kernel was still running on the device would
segfault; with newer systems, it would then just migrate back
[in my understanding].

With newer cards (post Volta) and the open kernel driver, the
behavior is similar without managed memory: if a page fault occurs,
the memory migrates between host and device.

Still, there can be differences: On Grace-Hopper, it seems as
if managed memory just migrates on the first device memory access;
but 'malloc'ed memory stays at place and only migrates after its
memory page was accessed 256 times (configurable value). With GH,
the device can read the host memory and vice versa, but still both
have their own memory controler  such one memory is closer.

* * *

The nvptx plugin is modified to make the necessary Cuda calls, via two new
(optional) plugin interfaces.

Compared to the previous version posted (Summer 2024), this renames
"unified shared memory" to use "managed memory", which more closely
describes what this really is, and removes all the elements that
attempted to use managed memory to implement USM.  I've also added
Fortran and C++ testcases, and documentation.

* * *

gcc/fortran/ChangeLog:

        * openmp.cc (is_predefined_allocator): Use GOMP_OMP_PREDEF_ALLOC_MAX
        and GOMP_OMPX_PREDEF_ALLOC_MIN/MAX instead of hardcoded values.

Actually, this is only a comment change. I think it would
be useful to see this from the wording.

* * *
--- a/include/cuda/cuda.h
+++ b/include/cuda/cuda.h
+  CU_DEVICE_ATTRIBUTE_MANAGED_MEMORY = 83,

In CUDA there is:

CU_DEVICE_ATTRIBUTE_MANAGED_MEMORY = 83,
CU_DEVICE_ATTRIBUTE_CONCURRENT_MANAGED_ACCESS = 89,

My old laptop had only former, which is as described:
when a kernel runs and one tries concurrently to access
the memory on the host, there is a fatal page fault on
the host side and the program fails.

While newer systems have both set. Looking though the
CU_DEVICE_ATTR list of our GPUs, I only found one that
only supports the former and not the later; namely, a
GK210GL [Tesla K80] (rev a1) (Kepler, sm_37)

In principle, we should state that caveat in the documentation;
however, meanwhile the number of old nvidia cards is very low
and as it is a generic issue for managed memory for those cards,
it makes probably sense to just sweep those details under the
carpet.

Still, you should consider to either add this enum value
as well (for completeness) – or just leave those attributes out.

* * *
+/* Predefined memspace value ranges.  */
+#define GOMP_OMP_PREDEF_MEMSPACE_MAX   4
+#define GOMP_OMPX_PREDEF_MEMSPACE_MIN  200
+#define GOMP_OMPX_PREDEF_MEMSPACE_MAX  200

You need to add the new allocator (and also
ompx_gnu_pinned_mem_alloc) and the new
memspace to libgomp/env.c's parse_allocator

Plus: add some static asserts there to ensure
that we won't miss the update in the future.

Background for this: You can create a default
allocator via the OMP_ALLOCATOR environment variable:
https://gcc.gnu.org/onlinedocs/libgomp/OMP_005fALLOCATOR.html

* * *

  linux_memspace_realloc (omp_memspace_handle_t memspace, void *addr,
                        size_t oldsize, size_t size, int oldpin, int pin)
  {

Don't we need to update this interface to handle both memspace and memspace old?

At least I fail to see where we handle 'free' vs. 'cuFree' if the users
does:

void *ptr = omp_alloc (sizeof(int)*10, omp_default_mem_space);
void *ptr2 = omp_alloc (sizeof(int)*10, ompx_gnu_pinned_mem_alloc);
...
ptr = omp_realloc (sizeof(int)*20, ompx_gnu_managed_mem_alloc,
                   omp_null_allocator);
ptr2 = omp_realloc (sizeof(int)*20, ompx_gnu_managed_mem_alloc,
                    omp_null_allocator);
...
omp_free (ptr, omp_null_allocator);
omp_free (ptr2, omp_null_allocator);

* * *

-  if (oldpin && pin)
+  if (memspace == ompx_gnu_managed_mem_space)
+    /* Realloc is not implemented for device Managed Memory.  */
+    ;
+  else if (oldpin && pin)

* * *

It seems as if we should return omp_null_allocator when mixing
that memory space with pinned? Cf. https://gcc.gnu.org/PR122590

One way would be to add a note there and handle it as part of
fixing that PR.

* * *

--- a/libgomp/libgomp.texi
+++ b/libgomp/libgomp.texi

> For the memory spaces, the following applies:

+@item @code{ompx_gnu_managed_mem_space} is a GNU extension that provides
+      managed memory accessible by both host and device; it is only available
+      on supported offload targets (see @ref{Offload-Target Specifics}).
+      This memory is accessible by both the host and the device at the same
+      address, but it need not be mapped with @code{map} clauses.  Instead,
+      use the @code{is_device_ptr} clause or @code{has_device_addr} clause
+      to indicate that the pointer is already accessible on the device.


The current implementation does:

* If numa/memkind is active, those are used

* If the default device is the host, gomp_managed_alloc returns
  NULL, invoking the fallback behavior (default would be the default
  mem allocator, i.e. 'malloc' - ignoring all traits, including
  the alignment trait)

* If USM is supported by the default device, it uses 'malloc' -
  honoring at least the alignment trait.

* Otherwise, if the default device's plugin support managed memory,
  it uses managed memory.

* If it doesn't, it returns NULL - triggering the fallback behavior.


* * *

Remarks:

(A) The following seems to have a memory leak:

// omp_set_default_device (0);
// assume that's an Nvidia GPU without GOMP_OFFLOAD_CAP_SHARED_MEM
void *ptr = omp_alloc (1024, ompx_gnu_managed_mem_space);
// ptr = cuMemAllocManaged

omp_set_default_device (omp_initial_device);
omp_free (ptr); // ignored


(B) Likewise:

Assuming an Nvidia GPU and an AMD GPU and
no GOMP_OFFLOAD_CAP_SHARED_MEM

omp_set_default_device (my_Nvidia_GPU);
void *ptr = omp_alloc (1024, ompx_gnu_managed_mem_space);
// ptr = cuMemAllocManaged

omp_set_default_device (my_AMD_GPU);
omp_free (ptr); // ignored


(C) Assuming one of them has GOMP_OFFLOAD_CAP_SHARED_MEM
and the other not, we could construct call mismatches like

malloc + cuFree  or  cuMemAllocManaged + free

but currently, that's not (yet) possible.

* * *

(D) As remarked, for Grace-Hopper there is a difference between
'malloc'/System memory and 'cudaMallocManaged'/Managed memory.

For managed memory, a memory page that resides on in host memory
migrates on the first access to the memory close to the GPU.

For system memory, the memory remains in host memory – however,
after accessing the memory a couple of times,* it moves to the
device.
[* = "By default, N_THRESHOLD = 256, for each 2MB regions
(sysadmin configuration)"]

Thus, the question is whether GCC really should ignore
ompx_gnu_managed_mem_space is USM is available.
This at least needs to be documented in the description.

Cf. https://www.fz-juelich.de/en/ias/jsc/news/events/2024/harnessing-integrated-cpu-gpu-system-memory-for-hpc-a-first-look-into-grace-hopper/20241007_presentation_gh_jsc.pdf
Especially page 22 labelled "23", but at the end, the whole document
is about this topic.

* * *

Coming back to the device thing:

* Currently, there is nothing device specific in terms of
  allocating device memory. Prefetch would be possible or
  cuMemAdvice or …
  But currently it is just memory reserved or allocated on the host.

* The ways it can be allocated depends on the device type not
  on the device.

* Whether the device type host vs. nvptx vs gcn is used
  depends on omp_get_default_device() at the time of the
  allocation/deallocation.

* OpenMP 6 adds to the traits:
- access = 'all'  (OpenMP 5.x's 'all' is now 'device')
- preferred_device = <device num>

and, for the memspace, it has some routines like
omp_get_devices_memspace that return a memory space
accessible from a set of devices.


The question is how to handle the (host vs. nvidia vs gcn
thing - including a later per-device setting of the USM
capability)
(i) For user-defined allocators
(ii) For ompx_gnu_managed_mem_space

For used-defined allocators, I would be inclined to save
the current default device in the descriptor – as we
presumably need it later anyway (default device), albeit
we later might want to use it for prefetching as well?
Taking the value at the time the allocator is constructed
is also, kind of, sensible.

However, ompx_gnu_managed_mem_space, we don't really have
a space for storing the default number. We could handle it
similar to pinned – were the offload devices are walked and
the first one supporting it, will be picked. This currently
would work fine (only Nvidia supported), but as soon as
AMD GPU support for managed is added, it will fail.

Additionally, it will then depend on the order of device
walking which is in theory not defined. (It is a configure-time
decision, and I think all distros build nvptx then gcn for historic
reasons. – Additionally, mixed AMD/Nvidia systems are somewhat rare)

* * *

In any case, we need to somehow sensible solve the device/device-type
picking issue.

And, in light of systems like Grace-Hopper, we have to (re)consider
whether on USM systems, malloc or cuMemAllocManaged will be called
for the new managed memspace/allocator + document what happens.

* * *

[AMD GPUs]> +@item Managed memory allocated using @code{omp_alloc} with the
+      @code{ompx_gnu_managed_mem_alloc} allocator is not currently supported
+      on AMD GPU devices.

[This implies a fallback is used, unles USM then malloc. (Just an
observation)]

+ The next comment also applied here.

[Nvidia GPUs]> +@item Managed memory allocated using @code{omp_alloc}

Can we change this to
"Memory allocated with the OpenMP @code{ompx_gnu_managed_mem_alloc
allocator or in the @code{ompx_gnu_managed_mem_space} ..."

Reason:
- The memspace should be mentioned as well.
- It is clearly not only 'omp_alloc' but also 'omp_calloc'
  or the 'allocate' clause - be it on 'parallel' or, in particular,
  on the 'allocators' directive.

+      memory in the CUDA Managed Memory space using @code{cuMemAllocManaged}.
+      This memory is accessible by both the host and the device at the same
+      address, but it need not be mapped with @code{map} clauses.  Instead,

I don't understand the ", but" - I think " and" makes more sense.

(Aside: There is also nothing wrong, in principle, with mapping
this data – exceptthat there is no need for it - and it is pointless
to use managed memory and still copying it around. Thus the rest can
remain there, why not.)

* * *

-void *
-GOMP_OFFLOAD_alloc (int ord, size_t size)
+static void *
+GOMP_OFFLOAD_alloc_1 (int ord, size_t size, bool managed)
  {

IMHO, for consistency, this should better be named, e.g.,
nvptx_alloc_1, leaving the GOMP_OFFLOAD_ prefix identifier
space to actually exported functions. (But one can also argue
otherwise.)

Thanks for the patch!

Tobias

Reply via email to