[@Thomas, there's a question for you below.]

On 17/10/2025 16:18, Tobias Burnus wrote:
Andrew Stubbs wrote:
Use Cuda to pin memory, instead of Linux mlock, when available.

There are two advantages: firstly, this gives a significant speed boost for
NVPTX offloading, and secondly, it side-steps the usual OS ulimit/rlimit
setting.

Slightly offtopic remark:
I think it 'only' helps with performance when
CU_DEVICE_ATTRIBUTE_PAGEABLE_MEMORY_ACCESS_USES_HOST_PAGE_TABLES
is false - and can also be achieved by using cuMemHostRegister as
alternative. I wrote 'only' as the only two system I know that have
this attribute set are a Grace-Hopper GH100 system and ORNL's former
Summit system.

I do not recall on which system we observed the speed-up (a couple of years ago), but it could not have been those systems as we did not have access to either at the time. We did observe a real-world speed-up using one of the x86_64 test machines we had then (however, the PowerPC machine I tested did not demonstrate the advantage).

* * *

The design adds a device independent plugin API for allocating pinned memory, and then implements it for NVPTX.  At present, the other supported devices do
not have equivalent capabilities (or requirements).

Once they do, the current scheme has to be fixed - because it fails
hard, if there are two plugins which both provide that function.

    * libgomp_g.h (GOMP_enable_pinned_mode): New prototype.

This seems to be a leftover and is currently unused.

Oops.

    * testsuite/libgomp.c/alloc-pinned-1.c: Change expectations for NVPTX
    devices.

For completeness: The tests assume that either no Nvidia GPU
is available or the default device is an Nvidia GPU.
Otherwise, offload_device_nvptx will be false but still
cuMemHostAlloc is called.

However, that shouldn't be an issue most of the time as
libgomp first checks for CUDA devices - and the default
device becomes the first device found.

In that case, I think we might have bigger problems trying to test on systems that have both flavours of GPU. I have not tested that configuration, but I have run the tests on both AMD and NVPTX independently.

* * *

@@ -121,18 +199,19 @@ linux_memspace_realloc (omp_memspace_handle_t memspace, void *addr,

+
+manual_realloc:;
+  void *newaddr = linux_memspace_alloc (memspace, size, pin, false);
+  if (newaddr)
+    {
+      memcpy (newaddr, addr, oldsize < size ? oldsize : size);
+      linux_memspace_free (memspace, addr, oldsize, oldpin);
+    }
OpenMP states:
"If 'size' is zero, a memory-reallocating routine returns NULL
and the old allocation is deallocated. If size is not zero,
the old allocation will be deallocated if and only if the
routine returns a non-null value."

My impression is that the code above does not handle this correctly.

I have actually not checked whether 2/2 handles this correct,
either - thus, please re-check that part as well

OK, I will review realloc in both patches.

* * *

Question: How many gomp_debug calls exist for a successful
linux_memspace_calloc that uses cuMemHostRegister?

Answer: 11 - if I have not miscounted. I think that's
excessive and a large number of debugging output is
utterly useless, unless one implements the feature of
this patch - and tests it.

The most absurd is the following one, which is always
shown:

+         gomp_debug (0, "  using_device=%d, using_device_old=%d\n",
+                     using_device, using_device_old);
+         assert (using_device_old == -1
+                 /* We shouldn't have concurrently changed our mind.  */
+                 || using_device_old == using_device);

But I also find

+      for (int i = 0; i < num_devices; ++i)
+       {
+         gomp_debug (0, "  i=%d, target_id=%d\n",
+                     i, devices[i].target_id);

rather useless.

Thomas, do you have an opinion about these? I believe you added them?

And when calling gomp_page_locked_host_alloc, there are 3
gomp_debug in this function alone and by calling
   get_device_for_page_locked ()
there are another 3 (!) - and that's for an initialized
device, for the first use, you will get even more.

OK, that's 6 - you will get another 2 in the plugin and
in linux_memspace_alloc another three - while
linux_memspace_calloc adds another one - making it 12
and not 11 as I claimed above.

Can this be trimmed to something sensible for users?

Especially with Nvptx, we already output much more than
usually sensible - and I think there is no value in
increasing the clutter - and saving some conditional
jumps also does not harm, not that it will add much to
all the locking and other other overhead in this code path.

* * *

Otherwise, it it seems to be fine.

Tobias

I should be able to repost on Monday, or so.

Andrew

Reply via email to