Andrew Stubbs wrote:
On 05/12/2025 13:50, Tobias Burnus wrote:
Andrew Stubbs wrote:
On 28/11/2025 12:16, Andrew Stubbs wrote:
This patch extends omp_target_is_accessible to check the actual
device status
for the memory region, on amdgcn and nvptx devices (rather than
just checking
if shared memory is enabled).
I have to admit that I am not completely happy
with this patch, but to move forward:
I think we all agree to defer the following:
* Permitting a call to omp_target_is_accessible on the device
[OpenMP 6 feature]
* Assuming that memory might not be host accessible, which
affects the result when checking for the host or when the
devices only could access the memory thanks to USM.
The question is only whether to also defer the following:
* Handling checking device-only memory.
That's:
void *ptr = omp_target_alloc (size, dev_num);
This is device-only memory that is accessible by a specific
device. On one hand, the device itself should know that it
owns this memory - on the other hand, it violates the
"is accessible by the host" assumption.
The current patch handles this properly with Nvidia devices
but for AMD it returns false (unless USM is used).
* * *
I think it makes more sense to remove the check here - and
handle it in the plugin. (More to this later.)
We still need it for the case where the plugin does not provide the
"is_shared_ptr" API (maybe there's one out there?), so I've simply
swapped them around, so the plugin call comes first, and made it so
that "false" from the plugin does not mean "check for other opinions".
Makes sense.
* * *
I have now implemented this so that positive values are "accessible"
(possibly with extra information), and zero or negative values are
"inaccessible" (possibly with a reason or caveat given).
So, 0 and 1 are clear,
Possible negative responses:
* "accessible only if extra checks pass", but assume inaccessible
otherwise. (Your "2" above).
* "partially, but not wholly accessible".
Possible extra positive responses:
* "accessible to device, but only device", which is ignorable
information.
* "accessible but slow/not recommended"
This seems more future proof than your scheme above?
I think it is effectively the same as I have not really specified
what it means, I just tried to illustrate how it could be used.
The only question is whether we need a boolean argument to discriminate
between checking whether the specified device can access the
pointer vs. whether that device knows that the host cannot access
that memory.
* * *
Regarding 'omp_target_alloc', here is a full testcase. As mentioned,
on AMD GCN, it fails (unless USM) even though it is device memory.
On Nvidia, it works.
//-------------------
#include <omp.h>
int main()
{
void *ptr = omp_target_alloc(sizeof(int)*10, omp_default_device);
if (!ptr)
__builtin_abort (); // Valid but unexpected
int accessible = omp_target_is_accessible(ptr, sizeof(int)*10,
omp_default_device);
__builtin_printf("-> %d\n", accessible);
if (!accessible)
__builtin_abort (); // should be accessible
}
//-------------------
Similar, but even less common would be code like:
//-------------------
int *ptr;
#pragma omp declare target local(ptr) // or: enter(ptr)
int *my_ptr = nullptr;
#pragma target from(my_ptr)
my_ptr = malloc (...)
//...
if (omp_target_is_accessible (my_ptr, omp_default_device))
// ...
//-------------------
A long version of that testcase is:
-----------------------------------
#include <omp.h>
void check (int dev)
{
constexpr int N = 10;
constexpr int size = N*sizeof(int);
int A[N] = {};
void *ptr = omp_target_alloc (size, dev);
if (ptr == nullptr || !omp_target_is_accessible (ptr, size, dev))
__builtin_abort ();
#pragma omp target device(dev) firstprivate(ptr)
for (int i = 0; i < N; i++)
((int *)ptr)[i] = i + 1;
if (omp_target_memcpy (A, ptr, size, 0, 0, omp_initial_device,
dev) ! = 0)
__builtin_abort ();
for (int i = 0; i < N; i++)
if (A[i] != i + 1)
__builtin_abort ();
omp_target_free (ptr, dev);
}
int main ()
{
check (omp_default_device);
for (int dev = 0; dev <= omp_get_num_devices(); dev++)
check (dev);
}
-----------------------------------
* * *
That one is fixed by:
+++ b/libgomp/plugin/plugin-gcn.c
@@ -238,2 +238,5 @@ struct hsa_runtime_fn_info
size_t attribute_count);
+ hsa_status_t (*hsa_amd_pointer_info_fn)
+ (const void *, hsa_amd_pointer_info_t *, void *(*)(size_t),
+ uint32_t *, hsa_agent_t **);
};
@@ -1500,2 +1503,3 @@ init_hsa_runtime_functions (void)
DLSYM_OPT_FN (hsa_amd_svm_attributes_get)
+ DLSYM_OPT_FN (hsa_amd_pointer_info)
return true;
@@ -3505,3 +3509,3 @@ gcn_exec (struct kernel_info *kernel,
-#if 0 /* TODO: Use to enable self-mapping/USM automatically. */
+
/* FIXME: The auto-self-map feature depends on still mapping
'declare target'
@@ -3558,3 +3562,2 @@ is_integrated_apu (struct agent_info *agent,
bool check_xnack)
}
-#endif
@@ -5269,3 +5272,4 @@ GOMP_OFFLOAD_is_shared_ptr (int device, const
void *ptr, size_t size)
|| device < 0 || device > hsa_context.agent_count
- || !hsa_fns.hsa_amd_svm_attributes_get_fn)
+ || (!hsa_fns.hsa_amd_svm_attributes_get_fn
+ && !hsa_fns.hsa_amd_pointer_info_fn))
return false;
@@ -5274,2 +5278,37 @@ GOMP_OFFLOAD_is_shared_ptr (int device, const
void *ptr, size_t size)
+ if (hsa_fns.hsa_amd_pointer_info_fn)
+ {
+ hsa_amd_pointer_info_t info;
+ uint32_t nagents;
+ hsa_agent_t *agents;
+ info.size = sizeof (hsa_amd_pointer_info_t);
+
+ hsa_status_t status = hsa_fns.hsa_amd_pointer_info_fn (ptr,
&info, NULL,
+ &nagents, &agents);
+ if (status2 == HSA_STATUS_SUCCESS && info.type !=
HSA_EXT_POINTER_TYPE_UNKNOWN)
+ {
+ /* Owns the pointer; can be true even for nagents == 0. */
+ if (agent->id.handle == info.agentOwner.handle)
+ return info.sizeInBytes >= size;
+ for (unsigned i = 0; i < nagents; i++)
+ if (agent->id.handle == agents[0].handle)
+ return info.sizeInBytes >= size;
+ if (info.type != HSA_EXT_POINTER_TYPE_LOCKED)
+ return false; // Not host memory and belonging to other agents.
+ }
+
+ /* Assume memory is host accessible. */
+ bool svm_accessible;
+ hsa_system_info_t type =
HSA_AMD_SYSTEM_INFO_SVM_ACCESSIBLE_BY_DEFAULT;
+ hsa_status_t status2 = hsa_fns.hsa_system_get_info_fn (type,
&svm_accessible);
+ if (status2 == HSA_STATUS_SUCCESS && svm_accessible)
+ return true;
+ if (is_integrated_apu (agent, /* xnack */ true
+ /* FIXME: pass !(HSA_AMD_SYSTEM_INFO_XNACK_ENABLED) status here?
*/ ))
+ return true;
+
+ /* FIXME: Will the following provide additional 'true' cases or
not? */
+ if (!hsa_fns.hsa_amd_svm_attributes_get_fn)
+ return false;
+
/* The HSA API doesn't seem to report for the whole range given,
so we call
---------------------------------------------------
I think we're running into feature creep here.
Sorry. That's a generic issue with OpenMP, but this feature
has been present from the beginning. Checking for access to
managed/pinned or USM memory is probably more useful, but this
might be useful at times as well.
* * *
* * *
For Nvidia, while it somehow works:
(A) I think we should run it on the right device,
i.e.
CUcontext old_ctx;
CUDA_CALL_ERET (false, cuCtxPushCurrent, ptx_dev->ctx);
....
CUDA_CALL_ASSERT (cuCtxPopCurrent, &old_ctx);
(B) I wonder whether it shouldn't be instead:
CU_POINTER_ATTRIBUTE_DEVICE_POINTER
I assume that this will also deal with USM, but I have not
checked whether additionally a USM check would make sense,
similar to the AMD part above or whether that's already
covered that way.
Likewise.
Likewise what?
Your check currently uses the "current" device which might
be the wrong one if there are multiple Nvidia devices - as
it just checks the current one. Admittedly, for USM and
managed memory, it likely does not matter but for
device-allocated memory it is surely wrong.
Thus, I think this is more a correctness than feature creep
topic.
* * *
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
...
@@ -353,6 +353,8 @@ struct ptx_device
static struct ptx_device **ptx_devices;
+static bool using_usm = false;
...
@@ -1906,6 +1913,45 @@ GOMP_OFFLOAD_managed_free (int ord, void *ptr)
+int
+GOMP_OFFLOAD_is_accessible_ptr (int device __attribute__((unused)),
+ const void *ptr, size_t size)
+{
+ /* USM implies access. */
+ if (using_usm)
+ return 1;
I wonder whether this shouldn't check
+ r = CUDA_CALL_NOCHECK (cuDeviceGetAttribute, &pi,
+ CU_DEVICE_ATTRIBUTE_PAGEABLE_MEMORY_ACCESS,
+ dev);
for the specified device. This will permit the check to work
also without the 'requires unified_shared_memory/self_maps'
and also when not all devices are USM.
* * *
+++ b/libgomp/testsuite/lib/libgomp.exp
+# return 1 if OpenMP Unified Shared Memory is supported by offload devices
+
+proc check_effective_target_omp_usm { } {
A newer version that fixes handling of nvptx was also part of the
now committed gfx908 cleanup patch, which has been committed to mainline as:
r16-5986-g1cf9fda4936de5
amdgcn: Adjust failure mode for gfx908 USM
* * *
Thus, we are left with:
* Two deferred items (unchanged, presumably we want to update the PR).
* What to do about 'omp_target_alloc' and similar memory
* RFC whether to pass a boolean to the plugin function to
be future proof
* Checking PAGEABLE_MEMORY_ACCESS for the passed device num
plus using the right device (→ device context) for the check.
(And, obviously, exclude the libgomp.exp change that has already landed.)
Otherwise, LGTM.
Tobias