Re: [PATCH v2 5/6] libgomp, nvptx: Cuda pinned memory

Andrew Stubbs Thu, 07 Dec 2023 05:43:30 -0800

@Thomas, there are questions for you below....

On 22/11/2023 17:07, Tobias Burnus wrote:

Note before: Starting with TR11 alias OpenMP 6.0, OpenMP supports handling
multiple devices for allocation. It seems as if after using:

my_memspace = omp_get_device_and_host_memspace( 5 ,omp_default_mem_space)

   my_alloc = omp_init_allocator (my_memspace, my_traits_with_pinning);

The pinning should be done via device '5' is possible.

This patch is intended to get us to 5.0. These are future features forfuture patches.

However, I believe that it shouldn't really matter for now, given that CUDA
has no special handling of NUMA hierarchy on the host nor for specificdevices
and GCN has none.
It only becomes interesting if mmap/mlock memory is (measurably) fasterthan
CUDA allocated memory when accessed from the host or, for USM, from GCN.


I don't believe there's any issue here (yet).

Let's start with the patch itself:

--- a/libgomp/target.c
+++ b/libgomp/target.c
...
+static struct gomp_device_descr *
+get_device_for_page_locked (void)
+{
+ gomp_debug (0, "%s\n",
+             __FUNCTION__);
+
+ struct gomp_device_descr *device;
+#ifdef HAVE_SYNC_BUILTINS
+ device
+   = __atomic_load_n (&device_for_page_locked, MEMMODEL_RELAXED);
+ if (device == (void *) -1)
+   {
+     gomp_debug (0, " init\n");
+
+     gomp_init_targets_once ();
+
+     device = NULL;
+     for (int i = 0; i < num_devices; ++i)


Given that this function just sets a single variable based on whether the
page_locked_host_alloc_func function pointer exists, wouldn't it be much
simpler to just do all this handling in   gomp_target_init  ?


@Thomas, care to comment on this?

+ for (int i = 0; i < num_devices; ++i)
...
+    /* We consider only the first device of potentially several of the
+       same type as this functionality is not specific to an individual
+       offloading device, but instead relates to the host-side
+       implementation of the respective offloading implementation. */
+    if (devices[i].target_id != 0)
+      continue;
+
+    if (!devices[i].page_locked_host_alloc_func)
+      continue;
...
+    if (device)
+      gomp_fatal ("Unclear how %s and %s libgomp plugins may"
+                  " simultaneously provide functionality"
+                  " for page-locked memory",
+                  device->name, devices[i].name);
+    else
+      device = &devices[i];

I find this a bit inconsistent: If - let's say - GCN does not notprovide itsown pinning, the code assumes that CUDA pinning is just fine. However,if both

support it, CUDA pinning suddenly is not fine for GCN.

I think it means that we need to revisit this code if that situationever occurs. Again, @Thomas?

Additionally, all wording suggests that it does not matter for CUDA forwhichdevice access we want to optimize the pinning. But the code above alsofails ifI have a system with two Nvidia cards. From the wording, it sounds asif just
checking whether the  device->type  is different would do.
But all in all, I wonder whether it wouldn't be much simpler to statesomething
like the following (where applicable):
If first device that provided pinning support is used; the assumption isthat
all other devices and the host can access this memory without measurable
performance penalty compared to a normal page lock and that having multiple
device types or host/device NUMA aware pinning support in the plugin is not
available.
NOTE: For OpenMP 6.0's OMP_AVAILABLE_DEVICES environment variable,device-set
memory spaces this might need to be revisited.


This seems reasonable to me, until the user can specify.

(I'm going to go look at the other review points now....)

Andrew

Re: [PATCH v2 5/6] libgomp, nvptx: Cuda pinned memory

Reply via email to