@Thomas, there are questions for you below....
On 22/11/2023 17:07, Tobias Burnus wrote:
Note before: Starting with TR11 alias OpenMP 6.0, OpenMP supports handling
multiple devices for allocation. It seems as if after using:
my_memspace = omp_get_device_and_host_memspace( 5 ,
omp_default_mem_space)
my_alloc = omp_init_allocator (my_memspace, my_traits_with_pinning);
The pinning should be done via device '5' is possible.
This patch is intended to get us to 5.0. These are future features for
future patches.
However, I believe that it shouldn't really matter for now, given that CUDA
has no special handling of NUMA hierarchy on the host nor for specific
devices
and GCN has none.
It only becomes interesting if mmap/mlock memory is (measurably) faster
than
CUDA allocated memory when accessed from the host or, for USM, from GCN.
I don't believe there's any issue here (yet).
Let's start with the patch itself:
--- a/libgomp/target.c
+++ b/libgomp/target.c
...
+static struct gomp_device_descr *
+get_device_for_page_locked (void)
+{
+ gomp_debug (0, "%s\n",
+ __FUNCTION__);
+
+ struct gomp_device_descr *device;
+#ifdef HAVE_SYNC_BUILTINS
+ device
+ = __atomic_load_n (&device_for_page_locked, MEMMODEL_RELAXED);
+ if (device == (void *) -1)
+ {
+ gomp_debug (0, " init\n");
+
+ gomp_init_targets_once ();
+
+ device = NULL;
+ for (int i = 0; i < num_devices; ++i)
Given that this function just sets a single variable based on whether the
page_locked_host_alloc_func function pointer exists, wouldn't it be much
simpler to just do all this handling in gomp_target_init ?
@Thomas, care to comment on this?
+ for (int i = 0; i < num_devices; ++i)
...
+ /* We consider only the first device of potentially several of the
+ same type as this functionality is not specific to an individual
+ offloading device, but instead relates to the host-side
+ implementation of the respective offloading implementation. */
+ if (devices[i].target_id != 0)
+ continue;
+
+ if (!devices[i].page_locked_host_alloc_func)
+ continue;
...
+ if (device)
+ gomp_fatal ("Unclear how %s and %s libgomp plugins may"
+ " simultaneously provide functionality"
+ " for page-locked memory",
+ device->name, devices[i].name);
+ else
+ device = &devices[i];
I find this a bit inconsistent: If - let's say - GCN does not not
provide its
own pinning, the code assumes that CUDA pinning is just fine. However,
if both
support it, CUDA pinning suddenly is not fine for GCN.
I think it means that we need to revisit this code if that situation
ever occurs. Again, @Thomas?
Additionally, all wording suggests that it does not matter for CUDA for
which
device access we want to optimize the pinning. But the code above also
fails if
I have a system with two Nvidia cards. From the wording, it sounds as
if just
checking whether the device->type is different would do.
But all in all, I wonder whether it wouldn't be much simpler to state
something
like the following (where applicable):
If first device that provided pinning support is used; the assumption is
that
all other devices and the host can access this memory without measurable
performance penalty compared to a normal page lock and that having multiple
device types or host/device NUMA aware pinning support in the plugin is not
available.
NOTE: For OpenMP 6.0's OMP_AVAILABLE_DEVICES environment variable,
device-set
memory spaces this might need to be revisited.
This seems reasonable to me, until the user can specify.
(I'm going to go look at the other review points now....)
Andrew