On 06/11/2025 23:10, Tobias Burnus wrote:
Andrew Stubbs wrote:
This adds support for using Cuda Managed Memory with omp_alloc. AMD
support
will be added in a future patch.
There is one new predefined allocator, "ompx_gnu_managed_mem_alloc",
plus a
corresponding memory space, which can be used to allocate memory in the
"managed" space.
Background – as generic information for patch readers.
Managed memory – at least as implemented by Nvidia – migrates
on the first hit from the device side to the device and at the
end back. With some older systems, accessing the memory on the
host while the kernel was still running on the device would
segfault; with newer systems, it would then just migrate back
[in my understanding].
With newer cards (post Volta) and the open kernel driver, the
behavior is similar without managed memory: if a page fault occurs,
the memory migrates between host and device.
Still, there can be differences: On Grace-Hopper, it seems as
if managed memory just migrates on the first device memory access;
but 'malloc'ed memory stays at place and only migrates after its
memory page was accessed 256 times (configurable value). With GH,
the device can read the host memory and vice versa, but still both
have their own memory controler such one memory is closer.
* * *
The nvptx plugin is modified to make the necessary Cuda calls, via two
new
(optional) plugin interfaces.
Compared to the previous version posted (Summer 2024), this renames
"unified shared memory" to use "managed memory", which more closely
describes what this really is, and removes all the elements that
attempted to use managed memory to implement USM. I've also added
Fortran and C++ testcases, and documentation.
* * *
gcc/fortran/ChangeLog:
* openmp.cc (is_predefined_allocator): Use GOMP_OMP_PREDEF_ALLOC_MAX
and GOMP_OMPX_PREDEF_ALLOC_MIN/MAX instead of hardcoded values.
Actually, this is only a comment change. I think it would
be useful to see this from the wording.
Done.
* * *
--- a/include/cuda/cuda.h
+++ b/include/cuda/cuda.h
+ CU_DEVICE_ATTRIBUTE_MANAGED_MEMORY = 83,
In CUDA there is:
CU_DEVICE_ATTRIBUTE_MANAGED_MEMORY = 83,
CU_DEVICE_ATTRIBUTE_CONCURRENT_MANAGED_ACCESS = 89,
My old laptop had only former, which is as described:
when a kernel runs and one tries concurrently to access
the memory on the host, there is a fatal page fault on
the host side and the program fails.
While newer systems have both set. Looking though the
CU_DEVICE_ATTR list of our GPUs, I only found one that
only supports the former and not the later; namely, a
GK210GL [Tesla K80] (rev a1) (Kepler, sm_37)
In principle, we should state that caveat in the documentation;
however, meanwhile the number of old nvidia cards is very low
and as it is a generic issue for managed memory for those cards,
it makes probably sense to just sweep those details under the
carpet.
Still, you should consider to either add this enum value
as well (for completeness) – or just leave those attributes out.
I (actually Thomas, I think) added the values we need, and no more. I
can't work out what you're asking for here?
* * *
+/* Predefined memspace value ranges. */
+#define GOMP_OMP_PREDEF_MEMSPACE_MAX 4
+#define GOMP_OMPX_PREDEF_MEMSPACE_MIN 200
+#define GOMP_OMPX_PREDEF_MEMSPACE_MAX 200
You need to add the new allocator (and also
ompx_gnu_pinned_mem_alloc) and the new
memspace to libgomp/env.c's parse_allocator
Done.
Plus: add some static asserts there to ensure
that we won't miss the update in the future.
Done.
Background for this: You can create a default
allocator via the OMP_ALLOCATOR environment variable:
https://gcc.gnu.org/onlinedocs/libgomp/OMP_005fALLOCATOR.html
* * *
linux_memspace_realloc (omp_memspace_handle_t memspace, void *addr,
size_t oldsize, size_t size, int oldpin, int pin)
{
Don't we need to update this interface to handle both memspace and
memspace old?
We don't support realloc for managed memory, so this is moot. I would
expect omp_realloc to handle this case at that level (via the fallback
alloc-and-copy code); the code is rather hard to read, but you're
probably right that there's a pre-existing bug there (for another day).
At least I fail to see where we handle 'free' vs. 'cuFree' if the users
does:
void *ptr = omp_alloc (sizeof(int)*10, omp_default_mem_space);
void *ptr2 = omp_alloc (sizeof(int)*10, ompx_gnu_pinned_mem_alloc);
...
ptr = omp_realloc (sizeof(int)*20, ompx_gnu_managed_mem_alloc,
omp_null_allocator);
ptr2 = omp_realloc (sizeof(int)*20, ompx_gnu_managed_mem_alloc,
omp_null_allocator);
...
omp_free (ptr, omp_null_allocator);
omp_free (ptr2, omp_null_allocator);
Passing a different allocator to realloc that you used to allocate is
either an error defined in the standard, or UB, surely?
Regardless, omp_realloc ignores the parameter and uses the allocator
recorded at allocation time. Same for omp_free.
* * *
- if (oldpin && pin)
+ if (memspace == ompx_gnu_managed_mem_space)
+ /* Realloc is not implemented for device Managed Memory. */
+ ;
+ else if (oldpin && pin)
* * *
It seems as if we should return omp_null_allocator when mixing
that memory space with pinned? Cf. https://gcc.gnu.org/PR122590
One way would be to add a note there and handle it as part of
fixing that PR.
I think managed memory is probably effectively pinned (as in, it's not
going to get swapped).
* * *
--- a/libgomp/libgomp.texi
+++ b/libgomp/libgomp.texi
> For the memory spaces, the following applies:
+@item @code{ompx_gnu_managed_mem_space} is a GNU extension that provides
+ managed memory accessible by both host and device; it is only
available
+ on supported offload targets (see @ref{Offload-Target Specifics}).
+ This memory is accessible by both the host and the device at
the same
+ address, but it need not be mapped with @code{map} clauses.
Instead,
+ use the @code{is_device_ptr} clause or @code{has_device_addr}
clause
+ to indicate that the pointer is already accessible on the device.
The current implementation does:
* If numa/memkind is active, those are used
* If the default device is the host, gomp_managed_alloc returns
NULL, invoking the fallback behavior (default would be the default
mem allocator, i.e. 'malloc' - ignoring all traits, including
the alignment trait)
* If USM is supported by the default device, it uses 'malloc' -
honoring at least the alignment trait.
* Otherwise, if the default device's plugin support managed memory,
it uses managed memory.
* If it doesn't, it returns NULL - triggering the fallback behavior.
Yes, I think this is accurate.
* * *
Remarks:
(A) The following seems to have a memory leak:
// omp_set_default_device (0);
// assume that's an Nvidia GPU without GOMP_OFFLOAD_CAP_SHARED_MEM
void *ptr = omp_alloc (1024, ompx_gnu_managed_mem_space);
// ptr = cuMemAllocManaged
omp_set_default_device (omp_initial_device);
omp_free (ptr); // ignored
omp_free uses the allocator recorded at allocation time, not the value
passed in, so it should Just Work. Did I miss something?
(B) Likewise:
Assuming an Nvidia GPU and an AMD GPU and
no GOMP_OFFLOAD_CAP_SHARED_MEM
omp_set_default_device (my_Nvidia_GPU);
void *ptr = omp_alloc (1024, ompx_gnu_managed_mem_space);
// ptr = cuMemAllocManaged
omp_set_default_device (my_AMD_GPU);
omp_free (ptr); // ignored
Again, I don't see a problem here.
(C) Assuming one of them has GOMP_OFFLOAD_CAP_SHARED_MEM
and the other not, we could construct call mismatches like
malloc + cuFree or cuMemAllocManaged + free
but currently, that's not (yet) possible.
And, again, omp_free wouldn't fall for it anyway.
* * *
(D) As remarked, for Grace-Hopper there is a difference between
'malloc'/System memory and 'cudaMallocManaged'/Managed memory.
For managed memory, a memory page that resides on in host memory
migrates on the first access to the memory close to the GPU.
For system memory, the memory remains in host memory – however,
after accessing the memory a couple of times,* it moves to the
device.
[* = "By default, N_THRESHOLD = 256, for each 2MB regions
(sysadmin configuration)"]
Thus, the question is whether GCC really should ignore
ompx_gnu_managed_mem_space is USM is available.
This at least needs to be documented in the description.
We need to document that it does what it says, unconditionally?
Cf. https://www.fz-juelich.de/en/ias/jsc/news/events/2024/harnessing-
integrated-cpu-gpu-system-memory-for-hpc-a-first-look-into-grace-
hopper/20241007_presentation_gh_jsc.pdf
Especially page 22 labelled "23", but at the end, the whole document
is about this topic.
* * *
Coming back to the device thing:
* Currently, there is nothing device specific in terms of
allocating device memory. Prefetch would be possible or
cuMemAdvice or …
But currently it is just memory reserved or allocated on the host.
* The ways it can be allocated depends on the device type not
on the device.
* Whether the device type host vs. nvptx vs gcn is used
depends on omp_get_default_device() at the time of the
allocation/deallocation.
Agreed.
* OpenMP 6 adds to the traits:
- access = 'all' (OpenMP 5.x's 'all' is now 'device')
- preferred_device = <device num>
and, for the memspace, it has some routines like
omp_get_devices_memspace that return a memory space
accessible from a set of devices.
I probably should have said that this was written for the current OpenMP
5.1-ish-level support. Moving to 6.0 will be another project I'm not
attempting here.
The question is how to handle the (host vs. nvidia vs gcn
thing - including a later per-device setting of the USM
capability)
(i) For user-defined allocators
(ii) For ompx_gnu_managed_mem_space
For used-defined allocators, I would be inclined to save
the current default device in the descriptor – as we
presumably need it later anyway (default device), albeit
we later might want to use it for prefetching as well?
Taking the value at the time the allocator is constructed
is also, kind of, sensible.
The nature of Cuda Managed Memory is that it is not device specific, as
I understand it; if you have multiple NVidia devices then the page will
migrate to whichever needs it, on the fly.
However, ompx_gnu_managed_mem_space, we don't really have
a space for storing the default number. We could handle it
similar to pinned – were the offload devices are walked and
the first one supporting it, will be picked. This currently
would work fine (only Nvidia supported), but as soon as
AMD GPU support for managed is added, it will fail.
Yes, heterogeneous systems are potentially a problem (and not just for
this feature). I don't think I've made any (new) design decisions here
that would prevent someone implementing magic multi-GPU support in
future, if anyone ever has time/money/interest in doing so. In the
meantime I think we should recommend a) Don't Do That, and b) if you do,
do so carefully.
Additionally, it will then depend on the order of device
walking which is in theory not defined. (It is a configure-time
decision, and I think all distros build nvptx then gcn for historic
reasons. – Additionally, mixed AMD/Nvidia systems are somewhat rare)
Thankfully, rare.
* * *
In any case, we need to somehow sensible solve the device/device-type
picking issue.
And, in light of systems like Grace-Hopper, we have to (re)consider
whether on USM systems, malloc or cuMemAllocManaged will be called
for the new managed memspace/allocator + document what happens.
IMO, if the program asks for managed memory then it should get managed
memory. We can choose to have managed memory "not supported" on devices
where it's not profitable, in which case the fallback allocator would be
used. This is for a future patch though.
* * *
[AMD GPUs]> +@item Managed memory allocated using @code{omp_alloc} with the
+ @code{ompx_gnu_managed_mem_alloc} allocator is not currently
supported
+ on AMD GPU devices.
[This implies a fallback is used, unles USM then malloc. (Just an
observation)]
Yes.
+ The next comment also applied here.
[Nvidia GPUs]> +@item Managed memory allocated using @code{omp_alloc}
Can we change this to
"Memory allocated with the OpenMP @code{ompx_gnu_managed_mem_alloc
allocator or in the @code{ompx_gnu_managed_mem_space} ..."
Reason:
- The memspace should be mentioned as well.
- It is clearly not only 'omp_alloc' but also 'omp_calloc'
or the 'allocate' clause - be it on 'parallel' or, in particular,
on the 'allocators' directive.
Done.
+ memory in the CUDA Managed Memory space using
@code{cuMemAllocManaged}.
+ This memory is accessible by both the host and the device at
the same
+ address, but it need not be mapped with @code{map} clauses.
Instead,
I don't understand the ", but" - I think " and" makes more sense.
I've gone with ", so".
(Aside: There is also nothing wrong, in principle, with mapping
this data – exceptthat there is no need for it - and it is pointless
to use managed memory and still copying it around. Thus the rest can
remain there, why not.)
* * *
-void *
-GOMP_OFFLOAD_alloc (int ord, size_t size)
+static void *
+GOMP_OFFLOAD_alloc_1 (int ord, size_t size, bool managed)
{
IMHO, for consistency, this should better be named, e.g.,
nvptx_alloc_1, leaving the GOMP_OFFLOAD_ prefix identifier
space to actually exported functions. (But one can also argue
otherwise.)
We already have an nvptx_alloc function, and this one is a helper
function for GOMP_OFFLOAD_alloc, not the other one, so I think the name
is logical. However, I agree that the "namespace" is not ideal.
Thanks for the patch!
Tobias
My v2 patch is attached, with the env.c, documentation, and changelog
changes requested. I also spotted I had "is_device_addr", instead of
"has_device_address".
Is it OK now?
Andrew
From 4abfaedf3e9e33ce7931d56968c003b2879c3130 Mon Sep 17 00:00:00 2001
From: Andrew Stubbs <[email protected]>
Date: Fri, 28 Jun 2024 10:24:43 +0000
Subject: [PATCH v2] openmp, nvptx: ompx_gnu_managed_mem_alloc
This adds support for using Cuda Managed Memory with omp_alloc. AMD support
will be added in a future patch.
There is one new predefined allocator, "ompx_gnu_managed_mem_alloc", plus a
corresponding memory space, which can be used to allocate memory in the
"managed" space.
The nvptx plugin is modified to make the necessary Cuda calls, via two new
(optional) plugin interfaces.
gcc/fortran/ChangeLog:
* openmp.cc (is_predefined_allocator): Use GOMP_OMP_PREDEF_ALLOC_MAX
and GOMP_OMPX_PREDEF_ALLOC_MIN/MAX instead of hardcoded values in the
comment.
include/ChangeLog:
* cuda/cuda.h (cuMemAllocManaged): Add declaration and related
CU_MEM_ATTACH_GLOBAL flag and device attributes.
* gomp-constants.h (GOMP_OMPX_PREDEF_ALLOC_MAX): Update to 201.
(GOMP_OMP_PREDEF_MEMSPACE_MAX): New constant.
(GOMP_OMPX_PREDEF_MEMSPACE_MIN): New constant.
(GOMP_OMPX_PREDEF_MEMSPACE_MAX): New constant.
libgomp/ChangeLog:
* allocator.c (ompx_gnu_max_predefined_alloc): Update to
ompx_gnu_managed_mem_alloc.
(_Static_assert): Fix assertion messages.
(omp_max_predefined_mem_space): New define.
(ompx_gnu_min_predefined_mem_space): New define.
(ompx_gnu_max_predefined_mem_space): New define.
(predefined_ompx_gnu_alloc_mapping): Add ompx_gnu_managed_mem_space.
(omp_init_allocator): Add ompx_gnu_managed_mem_space validation.
* config/linux/allocator.c (linux_memspace_alloc): Add managed
memory space handling.
(linux_memspace_calloc): Likewise.
(linux_memspace_free): Likewise.
(linux_memspace_realloc): Likewise (returns NULL for fallback).
* env.c (parse_allocator): Add ompx_gnu_managed_mem_alloc and some
static asserts so I don't forget it again.
* libgomp-plugin.h (GOMP_OFFLOAD_managed_alloc): New declaration.
(GOMP_OFFLOAD_managed_free): New declaration.
* libgomp.h (gomp_managed_alloc): New declaration.
(gomp_managed_free): New declaration.
(struct gomp_device_descr): Add managed_alloc_func and
managed_free_func fields.
* libgomp.texi: Document ompx_gnu_managed_mem_alloc and
ompx_gnu_managed_mem_space, add C++ template documentation, and
describe NVPTX and AMD support.
* omp.h.in: Add ompx_gnu_managed_mem_space and
ompx_gnu_managed_mem_alloc enumerators, and gnu_managed_mem C++
allocator template.
* omp_lib.f90.in: Add Fortran bindings for new allocator and
memory space.
* omp_lib.h.in: Likewise.
* plugin/cuda-lib.def: Add cuMemAllocManaged.
* plugin/plugin-nvptx.c (nvptx_alloc): Add managed parameter to
support cuMemAllocManaged.
(GOMP_OFFLOAD_alloc): Move contents to ...
(GOMP_OFFLOAD_alloc_1): ... this new function, and add managed support.
(GOMP_OFFLOAD_managed_alloc): New function.
(GOMP_OFFLOAD_managed_free): New function.
* target.c (gomp_managed_alloc): New function.
(gomp_managed_free): New function.
(gomp_load_plugin_for_device): Load optional managed_alloc
and managed_free plugin APIs.
* testsuite/lib/libgomp.exp: Add check_effective_target_omp_managedmem.
* testsuite/libgomp.c++/alloc-managed-1.C: New test.
* testsuite/libgomp.c/alloc-managed-1.c: New test.
* testsuite/libgomp.c/alloc-managed-2.c: New test.
* testsuite/libgomp.c/alloc-managed-3.c: New test.
* testsuite/libgomp.fortran/alloc-managed-1.f90: New test.
Co-authored-by: Kwok Cheung Yeung <[email protected]>
Co-authored-by: Thomas Schwinge <[email protected]>
---
gcc/fortran/openmp.cc | 6 +--
include/cuda/cuda.h | 8 +++
include/gomp-constants.h | 7 ++-
libgomp/allocator.c | 22 ++++++--
libgomp/config/linux/allocator.c | 23 ++++++--
libgomp/env.c | 5 ++
libgomp/libgomp-plugin.h | 2 +
libgomp/libgomp.h | 4 ++
libgomp/libgomp.texi | 22 ++++++++
libgomp/omp.h.in | 6 +++
libgomp/omp_lib.f90.in | 4 ++
libgomp/omp_lib.h.in | 4 ++
libgomp/plugin/cuda-lib.def | 1 +
libgomp/plugin/plugin-nvptx.c | 32 +++++++++---
libgomp/target.c | 52 +++++++++++++++++++
libgomp/testsuite/lib/libgomp.exp | 9 ++++
.../testsuite/libgomp.c++/alloc-managed-1.C | 35 +++++++++++++
libgomp/testsuite/libgomp.c/alloc-managed-1.c | 28 ++++++++++
libgomp/testsuite/libgomp.c/alloc-managed-2.c | 38 ++++++++++++++
libgomp/testsuite/libgomp.c/alloc-managed-3.c | 44 ++++++++++++++++
.../libgomp.fortran/alloc-managed-1.f90 | 29 +++++++++++
21 files changed, 363 insertions(+), 18 deletions(-)
create mode 100644 libgomp/testsuite/libgomp.c++/alloc-managed-1.C
create mode 100644 libgomp/testsuite/libgomp.c/alloc-managed-1.c
create mode 100644 libgomp/testsuite/libgomp.c/alloc-managed-2.c
create mode 100644 libgomp/testsuite/libgomp.c/alloc-managed-3.c
create mode 100644 libgomp/testsuite/libgomp.fortran/alloc-managed-1.f90
diff --git a/gcc/fortran/openmp.cc b/gcc/fortran/openmp.cc
index f5db9a81ea6..bae4e479cf4 100644
--- a/gcc/fortran/openmp.cc
+++ b/gcc/fortran/openmp.cc
@@ -8359,9 +8359,9 @@ resolve_omp_udr_clause (gfc_omp_namelist *n, gfc_namespace *ns,
}
/* Assume that a constant expression in the range 1 (omp_default_mem_alloc)
- to 8 (omp_thread_mem_alloc) range, or 200 (ompx_gnu_pinned_mem_alloc) is
- fine. The original symbol name is already lost during matching via
- gfc_match_expr. */
+ to GOMP_OMP_PREDEF_ALLOC_MAX, or GOMP_OMPX_PREDEF_ALLOC_MIN to
+ GOMP_OMPX_PREDEF_ALLOC_MAX is fine. The original symbol name is already
+ lost during matching via gfc_match_expr. */
static bool
is_predefined_allocator (gfc_expr *expr)
{
diff --git a/include/cuda/cuda.h b/include/cuda/cuda.h
index 6be1ac0ab43..b0c075c7889 100644
--- a/include/cuda/cuda.h
+++ b/include/cuda/cuda.h
@@ -83,10 +83,17 @@ typedef enum {
CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR = 39,
CU_DEVICE_ATTRIBUTE_ASYNC_ENGINE_COUNT = 40,
CU_DEVICE_ATTRIBUTE_UNIFIED_ADDRESSING = 41,
+ CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR = 75,
+ CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR = 76,
CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR = 82,
+ CU_DEVICE_ATTRIBUTE_MANAGED_MEMORY = 83,
CU_DEVICE_ATTRIBUTE_PAGEABLE_MEMORY_ACCESS = 88
} CUdevice_attribute;
+typedef enum {
+ CU_MEM_ATTACH_GLOBAL = 0x1
+} CUmemAttach_flags;
+
enum {
CU_EVENT_DEFAULT = 0,
CU_EVENT_DISABLE_TIMING = 2
@@ -254,6 +261,7 @@ CUresult cuMemAlloc (CUdeviceptr *, size_t);
#define cuMemAllocHost cuMemAllocHost_v2
CUresult cuMemAllocHost (void **, size_t);
CUresult cuMemHostAlloc (void **, size_t, unsigned int);
+CUresult cuMemAllocManaged(CUdeviceptr *, size_t, unsigned int);
CUresult cuMemcpy (CUdeviceptr, CUdeviceptr, size_t);
CUresult cuMemcpyPeer (CUdeviceptr, CUcontext, CUdeviceptr, CUcontext, size_t);
CUresult cuMemcpyPeerAsync (CUdeviceptr, CUcontext, CUdeviceptr, CUcontext, size_t, CUstream);
diff --git a/include/gomp-constants.h b/include/gomp-constants.h
index 963436aee56..db55b260053 100644
--- a/include/gomp-constants.h
+++ b/include/gomp-constants.h
@@ -395,7 +395,12 @@ enum gomp_map_kind
/* Predefined allocator value ranges. */
#define GOMP_OMP_PREDEF_ALLOC_MAX 8
#define GOMP_OMPX_PREDEF_ALLOC_MIN 200
-#define GOMP_OMPX_PREDEF_ALLOC_MAX 200
+#define GOMP_OMPX_PREDEF_ALLOC_MAX 201
+
+/* Predefined memspace value ranges. */
+#define GOMP_OMP_PREDEF_MEMSPACE_MAX 4
+#define GOMP_OMPX_PREDEF_MEMSPACE_MIN 200
+#define GOMP_OMPX_PREDEF_MEMSPACE_MAX 200
/* Predefined allocator with access == thread. */
#define GOMP_OMP_PREDEF_ALLOC_THREADS 8
diff --git a/libgomp/allocator.c b/libgomp/allocator.c
index 4a683d90bba..9f28c11df8b 100644
--- a/libgomp/allocator.c
+++ b/libgomp/allocator.c
@@ -100,17 +100,28 @@ GOMP_is_alloc (void *ptr)
#define omp_max_predefined_alloc omp_thread_mem_alloc
#define ompx_gnu_min_predefined_alloc ompx_gnu_pinned_mem_alloc
-#define ompx_gnu_max_predefined_alloc ompx_gnu_pinned_mem_alloc
+#define ompx_gnu_max_predefined_alloc ompx_gnu_managed_mem_alloc
_Static_assert (GOMP_OMP_PREDEF_ALLOC_MAX == omp_thread_mem_alloc,
"GOMP_OMP_PREDEF_ALLOC_MAX == omp_thread_mem_alloc");
_Static_assert (GOMP_OMPX_PREDEF_ALLOC_MIN == ompx_gnu_min_predefined_alloc,
- "GOMP_OMP_PREDEF_ALLOC_MAX == omp_thread_mem_alloc");
+ "GOMP_OMPX_PREDEF_ALLOC_MIN == ompx_gnu_min_predefined_alloc");
_Static_assert (GOMP_OMPX_PREDEF_ALLOC_MAX == ompx_gnu_max_predefined_alloc,
- "GOMP_OMP_PREDEF_ALLOC_MAX == omp_thread_mem_alloc");
+ "GOMP_OMPX_PREDEF_ALLOC_MAX == ompx_gnu_max_predefined_alloc");
_Static_assert (GOMP_OMP_PREDEF_ALLOC_THREADS == omp_thread_mem_alloc,
"GOMP_OMP_PREDEF_ALLOC_THREADS == omp_thread_mem_alloc");
+#define omp_max_predefined_mem_space omp_low_lat_mem_space
+#define ompx_gnu_min_predefined_mem_space ompx_gnu_managed_mem_space
+#define ompx_gnu_max_predefined_mem_space ompx_gnu_managed_mem_space
+
+_Static_assert (GOMP_OMP_PREDEF_MEMSPACE_MAX == omp_max_predefined_mem_space,
+ "GOMP_OMP_PREDEF_MEMSPACE_MAX == omp_max_predefined_mem_space");
+_Static_assert (GOMP_OMPX_PREDEF_MEMSPACE_MIN == ompx_gnu_min_predefined_mem_space,
+ "GOMP_OMPX_PREDEF_MEMSPACE_MIN == ompx_gnu_min_predefined_mem_space");
+_Static_assert (GOMP_OMPX_PREDEF_MEMSPACE_MAX == ompx_gnu_max_predefined_mem_space,
+ "GOMP_OMPX_PREDEF_MEMSPACE_MAX == ompx_gnu_max_predefined_mem_space");
+
/* These macros may be overridden in config/<target>/allocator.c.
The defaults (no override) are to return NULL for pinned memory requests
and pass through to the regular OS calls otherwise.
@@ -155,6 +166,7 @@ static const omp_memspace_handle_t predefined_omp_alloc_mapping[] = {
};
static const omp_memspace_handle_t predefined_ompx_gnu_alloc_mapping[] = {
omp_default_mem_space, /* ompx_gnu_pinned_mem_alloc. */
+ ompx_gnu_managed_mem_space, /* ompx_gnu_managed_mem_alloc. */
};
#define ARRAY_SIZE(A) (sizeof (A) / sizeof ((A)[0]))
@@ -389,7 +401,9 @@ omp_init_allocator (omp_memspace_handle_t memspace, int ntraits,
struct omp_allocator_data *ret;
int i;
- if (memspace > omp_low_lat_mem_space)
+ if (memspace > omp_max_predefined_mem_space
+ && (memspace < ompx_gnu_min_predefined_mem_space
+ || memspace > ompx_gnu_max_predefined_mem_space))
return omp_null_allocator;
for (i = 0; i < ntraits; i++)
switch (traits[i].key)
diff --git a/libgomp/config/linux/allocator.c b/libgomp/config/linux/allocator.c
index f957bb3421a..26ca6316bdd 100644
--- a/libgomp/config/linux/allocator.c
+++ b/libgomp/config/linux/allocator.c
@@ -80,7 +80,9 @@ linux_memspace_alloc (omp_memspace_handle_t memspace, size_t size, int pin,
{
void *addr = NULL;
- if (pin)
+ if (memspace == ompx_gnu_managed_mem_space)
+ addr = gomp_managed_alloc (size);
+ else if (pin)
{
int using_device = __atomic_load_n (&using_device_for_page_locked,
MEMMODEL_RELAXED);
@@ -155,7 +157,15 @@ linux_memspace_alloc (omp_memspace_handle_t memspace, size_t size, int pin,
static void *
linux_memspace_calloc (omp_memspace_handle_t memspace, size_t size, int pin)
{
- if (pin)
+ if (memspace == ompx_gnu_managed_mem_space)
+ {
+ void *ret = gomp_managed_alloc (size);
+ if (!ret)
+ return NULL;
+ memset (ret, 0, size);
+ return ret;
+ }
+ else if (pin)
return linux_memspace_alloc (memspace, size, pin, true);
else
return calloc (1, size);
@@ -165,7 +175,9 @@ static void
linux_memspace_free (omp_memspace_handle_t memspace, void *addr, size_t size,
int pin)
{
- if (pin)
+ if (memspace == ompx_gnu_managed_mem_space)
+ gomp_managed_free (addr);
+ else if (pin)
{
int using_device
= __atomic_load_n (&using_device_for_page_locked,
@@ -186,7 +198,10 @@ static void *
linux_memspace_realloc (omp_memspace_handle_t memspace, void *addr,
size_t oldsize, size_t size, int oldpin, int pin)
{
- if (oldpin && pin)
+ if (memspace == ompx_gnu_managed_mem_space)
+ /* Realloc is not implemented for device Managed Memory. */
+ ;
+ else if (oldpin && pin)
{
int using_device
= __atomic_load_n (&using_device_for_page_locked,
diff --git a/libgomp/env.c b/libgomp/env.c
index f63a36afdd2..7efd707fc00 100644
--- a/libgomp/env.c
+++ b/libgomp/env.c
@@ -1231,6 +1231,10 @@ parse_affinity (bool ignore)
return false;
}
+/* These are reminders to add new allocators to parse_allocator. */
+_Static_assert (GOMP_OMP_PREDEF_ALLOC_MAX == omp_thread_mem_alloc);
+_Static_assert (GOMP_OMPX_PREDEF_ALLOC_MAX == ompx_gnu_managed_mem_alloc);
+
/* Parse the OMP_ALLOCATOR environment variable and return the value. */
static bool
parse_allocator (const char *env, const char *val, void *const params[])
@@ -1265,6 +1269,7 @@ parse_allocator (const char *env, const char *val, void *const params[])
C (omp_pteam_mem_alloc, false)
C (omp_thread_mem_alloc, false)
C (ompx_gnu_pinned_mem_alloc, false)
+ C (ompx_gnu_managed_mem_alloc, false)
C (omp_default_mem_space, true)
C (omp_large_cap_mem_space, true)
C (omp_const_mem_space, true)
diff --git a/libgomp/libgomp-plugin.h b/libgomp/libgomp-plugin.h
index f2baed9bad9..5b4704484dd 100644
--- a/libgomp/libgomp-plugin.h
+++ b/libgomp/libgomp-plugin.h
@@ -171,6 +171,8 @@ extern int GOMP_OFFLOAD_load_image (int, unsigned, const void *,
extern bool GOMP_OFFLOAD_unload_image (int, unsigned, const void *);
extern void *GOMP_OFFLOAD_alloc (int, size_t);
extern bool GOMP_OFFLOAD_free (int, void *);
+extern void *GOMP_OFFLOAD_managed_alloc (int, size_t);
+extern bool GOMP_OFFLOAD_managed_free (int, void *);
extern bool GOMP_OFFLOAD_page_locked_host_alloc (void **, size_t);
extern bool GOMP_OFFLOAD_page_locked_host_free (void *);
extern bool GOMP_OFFLOAD_dev2host (int, void *, const void *, size_t);
diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h
index 3d406be175e..ff445d1e90c 100644
--- a/libgomp/libgomp.h
+++ b/libgomp/libgomp.h
@@ -1135,6 +1135,8 @@ extern int gomp_get_num_devices (void);
extern bool gomp_target_task_fn (void *);
extern void gomp_target_rev (uint64_t, uint64_t, uint64_t, uint64_t, uint64_t,
int, struct goacc_asyncqueue *);
+extern void *gomp_managed_alloc (size_t size);
+extern void gomp_managed_free (void *device_ptr);
extern bool gomp_page_locked_host_alloc (void **, size_t);
extern void gomp_page_locked_host_free (void *);
@@ -1421,6 +1423,8 @@ struct gomp_device_descr
__typeof (GOMP_OFFLOAD_unload_image) *unload_image_func;
__typeof (GOMP_OFFLOAD_alloc) *alloc_func;
__typeof (GOMP_OFFLOAD_free) *free_func;
+ __typeof (GOMP_OFFLOAD_managed_alloc) *managed_alloc_func;
+ __typeof (GOMP_OFFLOAD_managed_free) *managed_free_func;
__typeof (GOMP_OFFLOAD_page_locked_host_alloc) *page_locked_host_alloc_func;
__typeof (GOMP_OFFLOAD_page_locked_host_free) *page_locked_host_free_func;
__typeof (GOMP_OFFLOAD_dev2host) *dev2host_func;
diff --git a/libgomp/libgomp.texi b/libgomp/libgomp.texi
index 76a0162f814..d31ca52f76d 100644
--- a/libgomp/libgomp.texi
+++ b/libgomp/libgomp.texi
@@ -6890,6 +6890,7 @@ GCC supports the following predefined allocators and predefined memory spaces:
@item omp_pteam_mem_alloc @tab omp_low_lat_mem_space (implementation defined)
@item omp_thread_mem_alloc @tab omp_low_lat_mem_space (implementation defined)
@item ompx_gnu_pinned_mem_alloc @tab omp_default_mem_space (GNU extension)
+@item ompx_gnu_managed_mem_alloc @tab ompx_gnu_managed_mem_space (GNU extension)
@end multitable
Each predefined allocator, including @code{omp_null_allocator}, has a corresponding
@@ -6917,6 +6918,7 @@ The following allocator templates are supported:
@item omp_pteam_mem_alloc @tab omp::allocator::pteam_mem
@item omp_thread_mem_alloc @tab omp::allocator::thread_mem
@item ompx_gnu_pinned_mem_alloc @tab ompx::allocator::gnu_pinned_mem
+@item ompx_gnu_managed_mem_alloc @tab ompx::allocator::gnu_managed_mem
@end multitable
The following traits are available when constructing a new allocator;
@@ -6976,6 +6978,13 @@ For the memory spaces, the following applies:
unless the memkind library is available
@item @code{omp_high_bw_mem_space} maps to @code{omp_default_mem_space},
unless the memkind library is available
+@item @code{ompx_gnu_managed_mem_space} is a GNU extension that provides
+ managed memory accessible by both host and device; it is only available
+ on supported offload targets (see @ref{Offload-Target Specifics}).
+ This memory is accessible by both the host and the device at the same
+ address, so it need not be mapped with @code{map} clauses. Instead,
+ use the @code{is_device_ptr} clause or @code{has_device_addr} clause
+ to indicate that the pointer is already accessible on the device.
@end itemize
On Linux systems, where the @uref{https://github.com/memkind/memkind, memkind
@@ -7135,6 +7144,10 @@ The implementation remark:
a performance boost for NVPTX offload code and also allows unlimited use
of pinned memory regardless of the OS @code{ulimit}/@code{rlimit}
settings.
+@item Managed memory allocated with the OpenMP
+ @code{ompx_gnu_managed_mem_alloc} allocator or in the
+ @code{ompx_gnu_managed_mem_space} is not currently supported on AMD GPU
+ devices.
@item The OpenMP routines @code{omp_target_memcpy_rect} and
@code{omp_target_memcpy_rect_async} and the @code{target update}
directive for non-contiguous list items use the 3D memory-copy function
@@ -7297,6 +7310,15 @@ The implementation remark:
@code{omp_thread_mem_alloc}, all use low-latency memory as first
preference, and fall back to main graphics memory when the low-latency
pool is exhausted.
+@item Managed memory allocated with the @code{ompx_gnu_managed_mem_alloc}
+ allocator or in the @code{ompx_gnu_managed_mem_space} (both GNU
+ extensions) allocate memory in the CUDA Managed Memory space using
+ @code{cuMemAllocManaged}. This memory is accessible by both the host and
+ the device at the same address, so it need not be mapped with @code{map}
+ clauses. Instead, use the @code{is_device_ptr} clause or
+ @code{has_device_addr} clause to indicate that the pointer is already
+ accessible on the device. The CUDA runtime will automatically handle
+ data migration between host and device as needed.
@item The OpenMP routines @code{omp_target_memcpy_rect} and
@code{omp_target_memcpy_rect_async} and the @code{target update}
directive for non-contiguous list items use the 2D and 3D memory-copy
diff --git a/libgomp/omp.h.in b/libgomp/omp.h.in
index 60cb2b21be7..74c074c31c8 100644
--- a/libgomp/omp.h.in
+++ b/libgomp/omp.h.in
@@ -121,6 +121,7 @@ typedef enum omp_memspace_handle_t __GOMP_UINTPTR_T_ENUM
omp_const_mem_space = 2,
omp_high_bw_mem_space = 3,
omp_low_lat_mem_space = 4,
+ ompx_gnu_managed_mem_space = 200,
__omp_memspace_handle_t_max__ = __UINTPTR_MAX__
} omp_memspace_handle_t;
@@ -136,6 +137,7 @@ typedef enum omp_allocator_handle_t __GOMP_UINTPTR_T_ENUM
omp_pteam_mem_alloc = 7,
omp_thread_mem_alloc = 8,
ompx_gnu_pinned_mem_alloc = 200,
+ ompx_gnu_managed_mem_alloc = 201,
__omp_allocator_handle_t_max__ = __UINTPTR_MAX__
} omp_allocator_handle_t;
@@ -562,6 +564,10 @@ template <typename __T>
struct gnu_pinned_mem
: omp::allocator::__detail::__allocator_templ <__T,
ompx_gnu_pinned_mem_alloc> {};
+template <typename __T>
+struct gnu_managed_mem
+ : omp::allocator::__detail::__allocator_templ <__T,
+ ompx_gnu_managed_mem_alloc> {};
} /* namespace allocator */
diff --git a/libgomp/omp_lib.f90.in b/libgomp/omp_lib.f90.in
index ce866c00121..1d8cbafaa42 100644
--- a/libgomp/omp_lib.f90.in
+++ b/libgomp/omp_lib.f90.in
@@ -164,6 +164,8 @@
parameter :: omp_thread_mem_alloc = 8
integer (kind=omp_allocator_handle_kind), &
parameter :: ompx_gnu_pinned_mem_alloc = 200
+ integer (kind=omp_allocator_handle_kind), &
+ parameter :: ompx_gnu_managed_mem_alloc = 201
integer (omp_memspace_handle_kind), &
parameter :: omp_default_mem_space = 0
integer (omp_memspace_handle_kind), &
@@ -174,6 +176,8 @@
parameter :: omp_high_bw_mem_space = 3
integer (omp_memspace_handle_kind), &
parameter :: omp_low_lat_mem_space = 4
+ integer (omp_memspace_handle_kind), &
+ parameter :: ompx_gnu_managed_mem_space = 200
integer, parameter :: omp_initial_device = -1
integer, parameter :: omp_invalid_device = -4
integer (omp_interop_kind), &
diff --git a/libgomp/omp_lib.h.in b/libgomp/omp_lib.h.in
index 9047095c5e0..7c158de5667 100644
--- a/libgomp/omp_lib.h.in
+++ b/libgomp/omp_lib.h.in
@@ -162,6 +162,7 @@
integer (omp_allocator_handle_kind) omp_pteam_mem_alloc
integer (omp_allocator_handle_kind) omp_thread_mem_alloc
integer (omp_allocator_handle_kind) ompx_gnu_pinned_mem_alloc
+ integer (omp_allocator_handle_kind) ompx_gnu_managed_mem_alloc
parameter (omp_null_allocator = 0)
parameter (omp_default_mem_alloc = 1)
parameter (omp_large_cap_mem_alloc = 2)
@@ -172,16 +173,19 @@
parameter (omp_pteam_mem_alloc = 7)
parameter (omp_thread_mem_alloc = 8)
parameter (ompx_gnu_pinned_mem_alloc = 200)
+ parameter (ompx_gnu_managed_mem_alloc = 201)
integer (omp_memspace_handle_kind) omp_default_mem_space
integer (omp_memspace_handle_kind) omp_large_cap_mem_space
integer (omp_memspace_handle_kind) omp_const_mem_space
integer (omp_memspace_handle_kind) omp_high_bw_mem_space
integer (omp_memspace_handle_kind) omp_low_lat_mem_space
+ integer (omp_memspace_handle_kind) ompx_gnu_managed_mem_space
parameter (omp_default_mem_space = 0)
parameter (omp_large_cap_mem_space = 1)
parameter (omp_const_mem_space = 2)
parameter (omp_high_bw_mem_space = 3)
parameter (omp_low_lat_mem_space = 4)
+ parameter (ompx_gnu_managed_mem_space = 200)
integer omp_initial_device, omp_invalid_device
parameter (omp_initial_device = -1)
parameter (omp_invalid_device = -4)
diff --git a/libgomp/plugin/cuda-lib.def b/libgomp/plugin/cuda-lib.def
index 7f4ddcc6bd1..67c783d8566 100644
--- a/libgomp/plugin/cuda-lib.def
+++ b/libgomp/plugin/cuda-lib.def
@@ -33,6 +33,7 @@ CUDA_ONE_CALL (cuLinkDestroy)
CUDA_ONE_CALL (cuMemAlloc)
CUDA_ONE_CALL (cuMemAllocHost)
CUDA_ONE_CALL (cuMemHostAlloc)
+CUDA_ONE_CALL (cuMemAllocManaged)
CUDA_ONE_CALL (cuMemcpy)
CUDA_ONE_CALL (cuMemcpyDtoDAsync)
CUDA_ONE_CALL (cuMemcpyDtoH)
diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index 5ad66688e7e..6f9fb758850 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -1133,11 +1133,13 @@ nvptx_stacks_free (struct ptx_device *ptx_dev, bool force)
}
static void *
-nvptx_alloc (size_t s, bool suppress_errors)
+nvptx_alloc (size_t s, bool suppress_errors, bool managed)
{
CUdeviceptr d;
- CUresult r = CUDA_CALL_NOCHECK (cuMemAlloc, &d, s);
+ CUresult r = (managed ? CUDA_CALL_NOCHECK (cuMemAllocManaged, &d, s,
+ CU_MEM_ATTACH_GLOBAL)
+ : CUDA_CALL_NOCHECK (cuMemAlloc, &d, s));
if (suppress_errors && r == CUDA_ERROR_OUT_OF_MEMORY)
return NULL;
else if (r != CUDA_SUCCESS)
@@ -1843,8 +1845,8 @@ GOMP_OFFLOAD_unload_image (int ord, unsigned version, const void *target_data)
return ret;
}
-void *
-GOMP_OFFLOAD_alloc (int ord, size_t size)
+static void *
+GOMP_OFFLOAD_alloc_1 (int ord, size_t size, bool managed)
{
if (!nvptx_attach_host_thread_to_device (ord))
return NULL;
@@ -1867,7 +1869,7 @@ GOMP_OFFLOAD_alloc (int ord, size_t size)
blocks = tmp;
}
- void *d = nvptx_alloc (size, true);
+ void *d = nvptx_alloc (size, true, managed);
if (d)
return d;
else
@@ -1875,10 +1877,22 @@ GOMP_OFFLOAD_alloc (int ord, size_t size)
/* Memory allocation failed. Try freeing the stacks block, and
retrying. */
nvptx_stacks_free (ptx_dev, true);
- return nvptx_alloc (size, false);
+ return nvptx_alloc (size, false, managed);
}
}
+void *
+GOMP_OFFLOAD_alloc (int ord, size_t size)
+{
+ return GOMP_OFFLOAD_alloc_1 (ord, size, false);
+}
+
+void *
+GOMP_OFFLOAD_managed_alloc (int ord, size_t size)
+{
+ return GOMP_OFFLOAD_alloc_1 (ord, size, true);
+}
+
bool
GOMP_OFFLOAD_free (int ord, void *ptr)
{
@@ -1886,6 +1900,12 @@ GOMP_OFFLOAD_free (int ord, void *ptr)
&& nvptx_free (ptr, ptx_devices[ord]));
}
+bool
+GOMP_OFFLOAD_managed_free (int ord, void *ptr)
+{
+ return GOMP_OFFLOAD_free (ord, ptr);
+}
+
bool
GOMP_OFFLOAD_page_locked_host_alloc (void **ptr, size_t size)
{
diff --git a/libgomp/target.c b/libgomp/target.c
index ac5b4b0b720..405798c6536 100644
--- a/libgomp/target.c
+++ b/libgomp/target.c
@@ -4697,6 +4697,56 @@ omp_target_free (void *device_ptr, int device_num)
gomp_mutex_unlock (&devicep->lock);
}
+void *
+gomp_managed_alloc (size_t size)
+{
+ struct gomp_task_icv *icv = gomp_icv (false);
+ struct gomp_device_descr *devicep = resolve_device (icv->default_device_var,
+ false);
+ if (devicep == NULL)
+ return NULL;
+
+ if (!(devicep->capabilities & GOMP_OFFLOAD_CAP_OPENMP_400)
+ || devicep->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
+ return malloc (size);
+
+ void *ret = NULL;
+ gomp_mutex_lock (&devicep->lock);
+ if (devicep->managed_alloc_func)
+ ret = devicep->managed_alloc_func (devicep->target_id, size);
+ gomp_mutex_unlock (&devicep->lock);
+ return ret;
+}
+
+void
+gomp_managed_free (void *device_ptr)
+{
+ if (device_ptr == NULL)
+ return;
+
+ struct gomp_task_icv *icv = gomp_icv (false);
+ struct gomp_device_descr *devicep = resolve_device (icv->default_device_var,
+ false);
+ if (devicep == NULL)
+ return;
+
+ if (!(devicep->capabilities & GOMP_OFFLOAD_CAP_OPENMP_400)
+ || devicep->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
+ {
+ free (device_ptr);
+ return;
+ }
+
+ gomp_mutex_lock (&devicep->lock);
+ if (devicep->managed_free_func
+ && !devicep->managed_free_func (devicep->target_id, device_ptr))
+ {
+ gomp_mutex_unlock (&devicep->lock);
+ gomp_fatal ("error in freeing device memory block at %p", device_ptr);
+ }
+ gomp_mutex_unlock (&devicep->lock);
+}
+
/* Device (really: libgomp plugin) to use for paged-locked memory. We
assume there is either none or exactly one such device for the lifetime of
the process. */
@@ -5929,6 +5979,8 @@ gomp_load_plugin_for_device (struct gomp_device_descr *device,
DLSYM (unload_image);
DLSYM (alloc);
DLSYM (free);
+ DLSYM_OPT (managed_alloc, managed_alloc);
+ DLSYM_OPT (managed_free, managed_free);
DLSYM_OPT (page_locked_host_alloc, page_locked_host_alloc);
DLSYM_OPT (page_locked_host_free, page_locked_host_free);
DLSYM (dev2host);
diff --git a/libgomp/testsuite/lib/libgomp.exp b/libgomp/testsuite/lib/libgomp.exp
index fd475ac3fe6..b0b05277706 100644
--- a/libgomp/testsuite/lib/libgomp.exp
+++ b/libgomp/testsuite/lib/libgomp.exp
@@ -722,3 +722,12 @@ int main() {
return 0;
} } "-lhipblas" ]
}
+
+# return 1 if OpenMP Device Managed Memory is supported
+
+proc check_effective_target_omp_managedmem { } {
+ if { [libgomp_check_effective_target_offload_target "nvptx"] } {
+ return 1
+ }
+ return 0
+}
diff --git a/libgomp/testsuite/libgomp.c++/alloc-managed-1.C b/libgomp/testsuite/libgomp.c++/alloc-managed-1.C
new file mode 100644
index 00000000000..afd7fd648c6
--- /dev/null
+++ b/libgomp/testsuite/libgomp.c++/alloc-managed-1.C
@@ -0,0 +1,35 @@
+// { dg-do run }
+// { dg-require-effective-target omp_managedmem }
+
+// Check that the ompx::allocator::gnu_managed_mem allocator can allocate
+// Managed Memory, and that host and target can see the data, at the same
+// address, without a mapping.
+
+#include <omp.h>
+#include <cstdint>
+#include <memory>
+
+int
+main ()
+{
+ using Allocator = ompx::allocator::gnu_managed_mem<int>;
+ using Traits = std::allocator_traits<Allocator>;
+
+ Allocator alloc;
+ int *a = Traits::allocate (alloc, 1);
+ if (!a)
+ __builtin_abort ();
+
+ Traits::construct (alloc, a, 42);
+ std::uintptr_t a_p = reinterpret_cast<std::uintptr_t>(a);
+
+ #pragma omp target is_device_ptr(a)
+ {
+ if (*a != 42 || a_p != reinterpret_cast<std::uintptr_t>(a))
+ __builtin_abort ();
+ }
+
+ Traits::destroy (alloc, a);
+ Traits::deallocate (alloc, a, 1);
+ return 0;
+}
diff --git a/libgomp/testsuite/libgomp.c/alloc-managed-1.c b/libgomp/testsuite/libgomp.c/alloc-managed-1.c
new file mode 100644
index 00000000000..31b252fc0ae
--- /dev/null
+++ b/libgomp/testsuite/libgomp.c/alloc-managed-1.c
@@ -0,0 +1,28 @@
+/* { dg-do run } */
+/* { dg-require-effective-target omp_managedmem } */
+
+/* Check that omp_alloc can allocate Managed Memory, and that host and target
+ can see the data, at the same address, without a mapping. */
+
+#include <omp.h>
+#include <stdint.h>
+
+int
+main ()
+{
+ int *a = (int *) omp_alloc(sizeof(int), ompx_gnu_managed_mem_alloc);
+ if (!a)
+ __builtin_abort ();
+
+ *a = 42;
+ uintptr_t a_p = (uintptr_t)a;
+
+ #pragma omp target is_device_ptr(a)
+ {
+ if (*a != 42 || a_p != (uintptr_t)a)
+ __builtin_abort ();
+ }
+
+ omp_free(a, ompx_gnu_managed_mem_alloc);
+ return 0;
+}
diff --git a/libgomp/testsuite/libgomp.c/alloc-managed-2.c b/libgomp/testsuite/libgomp.c/alloc-managed-2.c
new file mode 100644
index 00000000000..f7fd30a4f67
--- /dev/null
+++ b/libgomp/testsuite/libgomp.c/alloc-managed-2.c
@@ -0,0 +1,38 @@
+/* { dg-do run } */
+/* { dg-require-effective-target omp_managedmem } */
+
+/* Check that omp_calloc can allocate Managed Memory, and that host and target
+ can see the data, at the same address, without a mapping. */
+
+#include <omp.h>
+#include <stdint.h>
+
+int
+main ()
+{
+ int *a = (int *) omp_calloc(5, sizeof(int), ompx_gnu_managed_mem_alloc);
+ if (!a)
+ __builtin_abort ();
+
+ /* Check that memory is zero-initialized */
+ for (int i = 0; i < 5; i++)
+ if (a[i] != 0)
+ __builtin_abort ();
+
+ a[0] = 42;
+ a[4] = 99;
+ uintptr_t a_p = (uintptr_t)a;
+
+ #pragma omp target is_device_ptr(a)
+ {
+ if (a[0] != 42 || a[4] != 99 || a_p != (uintptr_t)a)
+ __builtin_abort ();
+ /* Check zero-initialization on device side */
+ for (int i = 1; i < 4; i++)
+ if (a[i] != 0)
+ __builtin_abort ();
+ }
+
+ omp_free(a, ompx_gnu_managed_mem_alloc);
+ return 0;
+}
diff --git a/libgomp/testsuite/libgomp.c/alloc-managed-3.c b/libgomp/testsuite/libgomp.c/alloc-managed-3.c
new file mode 100644
index 00000000000..17828b76962
--- /dev/null
+++ b/libgomp/testsuite/libgomp.c/alloc-managed-3.c
@@ -0,0 +1,44 @@
+/* { dg-do run } */
+/* { dg-require-effective-target omp_managedmem } */
+
+/* Check that omp_realloc can allocate Managed Memory, and that host and target
+ can see the data, at the same address, without a mapping. */
+
+#include <omp.h>
+#include <stdint.h>
+
+int
+main ()
+{
+ int *a = (int *) omp_alloc(2 * sizeof(int), ompx_gnu_managed_mem_alloc);
+ if (!a)
+ __builtin_abort ();
+
+ a[0] = 42;
+ a[1] = 43;
+
+ /* Reallocate to larger size */
+ int *b = (int *) omp_realloc(a, 5 * sizeof(int), ompx_gnu_managed_mem_alloc,
+ ompx_gnu_managed_mem_alloc);
+ if (!b)
+ __builtin_abort ();
+
+ /* Check that original data is preserved */
+ if (b[0] != 42 || b[1] != 43)
+ __builtin_abort ();
+
+ b[2] = 44;
+ b[3] = 45;
+ b[4] = 46;
+ uintptr_t b_p = (uintptr_t)b;
+
+ #pragma omp target is_device_ptr(b)
+ {
+ if (b[0] != 42 || b[1] != 43 || b[2] != 44 || b[3] != 45 || b[4] != 46
+ || b_p != (uintptr_t)b)
+ __builtin_abort ();
+ }
+
+ omp_free(b, ompx_gnu_managed_mem_alloc);
+ return 0;
+}
diff --git a/libgomp/testsuite/libgomp.fortran/alloc-managed-1.f90 b/libgomp/testsuite/libgomp.fortran/alloc-managed-1.f90
new file mode 100644
index 00000000000..685aeef7dae
--- /dev/null
+++ b/libgomp/testsuite/libgomp.fortran/alloc-managed-1.f90
@@ -0,0 +1,29 @@
+! { dg-do run }
+! { dg-require-effective-target omp_managedmem }
+
+! Check that omp_alloc can allocate Managed Memory, and that host and target
+! can see the data, at the same address, without a mapping.
+
+program main
+ use omp_lib
+ use iso_c_binding
+ implicit none
+
+ type(c_ptr) :: cptr
+ integer, pointer :: a
+ integer(c_intptr_t) :: a_p, a_p2
+
+ cptr = omp_alloc(c_sizeof(a), ompx_gnu_managed_mem_alloc)
+ if (.not. c_associated(cptr)) stop 1
+
+ call c_f_pointer(cptr, a)
+ a = 42
+ a_p = transfer(c_loc(a), a_p)
+
+ !$omp target is_device_ptr(a)
+ a_p2 = transfer(c_loc(a), a_p2)
+ if (a /= 42 .or. a_p /= a_p2) stop 2
+ !$omp end target
+
+ call omp_free(cptr, ompx_gnu_managed_mem_alloc)
+end program main
--
2.51.0