Andrew Stubbs wrote:
Use Cuda to pin memory, instead of Linux mlock, when available.
There are two advantages: firstly, this gives a significant speed boost for
NVPTX offloading, and secondly, it side-steps the usual OS ulimit/rlimit
setting.
Slightly offtopic remark:
I think it 'only' helps with performance when
CU_DEVICE_ATTRIBUTE_PAGEABLE_MEMORY_ACCESS_USES_HOST_PAGE_TABLES
is false - and can also be achieved by using cuMemHostRegister as
alternative. I wrote 'only' as the only two system I know that have
this attribute set are a Grace-Hopper GH100 system and ORNL's former
Summit system.
* * *
The design adds a device independent plugin API for allocating pinned memory,
and then implements it for NVPTX. At present, the other supported devices do
not have equivalent capabilities (or requirements).
Once they do, the current scheme has to be fixed - because it fails
hard, if there are two plugins which both provide that function.
* libgomp_g.h (GOMP_enable_pinned_mode): New prototype.
This seems to be a leftover and is currently unused.
* testsuite/libgomp.c/alloc-pinned-1.c: Change expectations for NVPTX
devices.
For completeness: The tests assume that either no Nvidia GPU
is available or the default device is an Nvidia GPU.
Otherwise, offload_device_nvptx will be false but still
cuMemHostAlloc is called.
However, that shouldn't be an issue most of the time as
libgomp first checks for CUDA devices - and the default
device becomes the first device found.
* * *
@@ -121,18 +199,19 @@ linux_memspace_realloc (omp_memspace_handle_t memspace,
void *addr,
+
+manual_realloc:;
+ void *newaddr = linux_memspace_alloc (memspace, size, pin, false);
+ if (newaddr)
+ {
+ memcpy (newaddr, addr, oldsize < size ? oldsize : size);
+ linux_memspace_free (memspace, addr, oldsize, oldpin);
+ }
OpenMP states:
"If 'size' is zero, a memory-reallocating routine returns NULL
and the old allocation is deallocated. If size is not zero,
the old allocation will be deallocated if and only if the
routine returns a non-null value."
My impression is that the code above does not handle this correctly.
I have actually not checked whether 2/2 handles this correct,
either - thus, please re-check that part as well
* * *
Question: How many gomp_debug calls exist for a successful
linux_memspace_calloc that uses cuMemHostRegister?
Answer: 11 - if I have not miscounted. I think that's
excessive and a large number of debugging output is
utterly useless, unless one implements the feature of
this patch - and tests it.
The most absurd is the following one, which is always
shown:
+ gomp_debug (0, " using_device=%d, using_device_old=%d\n",
+ using_device, using_device_old);
+ assert (using_device_old == -1
+ /* We shouldn't have concurrently changed our mind. */
+ || using_device_old == using_device);
But I also find
+ for (int i = 0; i < num_devices; ++i)
+ {
+ gomp_debug (0, " i=%d, target_id=%d\n",
+ i, devices[i].target_id);
rather useless.
And when calling gomp_page_locked_host_alloc, there are 3
gomp_debug in this function alone and by calling
get_device_for_page_locked ()
there are another 3 (!) - and that's for an initialized
device, for the first use, you will get even more.
OK, that's 6 - you will get another 2 in the plugin and
in linux_memspace_alloc another three - while
linux_memspace_calloc adds another one - making it 12
and not 11 as I claimed above.
Can this be trimmed to something sensible for users?
Especially with Nvptx, we already output much more than
usually sensible - and I think there is no value in
increasing the clutter - and saving some conditional
jumps also does not harm, not that it will add much to
all the locking and other other overhead in this code path.
* * *
Otherwise, it it seems to be fine.
Tobias