https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120682
--- Comment #9 from Benjamin Schulz <schulz.benjamin at googlemail dot com> --- Hi@all, thank you for your interesting replies. I guess the problem for the mapper and templates is that there are no pragma templates in C++. So this would probably indeed mean that one would have to change the C standard, which asks a bit too much... As for the mapper and these map pragmas, there are indeed possible improvements in the OpenMP standard which I think are more important, but they are not really for the gcc bugzilla. For example, according to https://www.openmp.org/spec-html/5.0/openmpsu109.html memory allocators cannot appear in a map clause: https://www.openmp.org/spec-html/5.0/openmpsu53.html i.e. I can not use the mapping pragmas to say that something should be placed into the fast omp_low_lat_mem or the omp_large_cap_mem for large data. And that does not seem to change in openmp 6.0. For a struct, that would then mean one has to manually copy each member variable and member array with omp_target_memcpy and then associate the pointers to the host variables with omp_target_associate if one wants to specify the allocator which is rather tedious for a large struct. The various memory allocators, like omp_large_cap_mem_alloc,omp_large_cap_mem_space omp_const_mem_alloc,omp_const_mem_space, omp_high_bw_mem_alloc omp_high_bw_mem_space, omp_low_lat_mem_alloc and different addressing methods, like unified shared memory, unified_address and copying methods (synchronous or asyncroneous copies), are unfortunately important for performance optimization. On my new GPU, I found that adding two large vectors with #pragma omp requires unified shared memory (which gcc seems to use by default on my system) takes this time on my gpu: 409.019 miliseconds Without unified shared memory but explicit mapping pragmas before the loop and then adding the vectors takes 26.5208 miliseconds (where-ever in the gpu it is placing them to). Maybe it would be faster if had copied to low_lat_mem than using mapping pragmas? Without taking the mapping into account, i.e. just the addition, takes this time 1.00863. Sadly, using asynchronic copies before the loop did not help did not help to improve mapping speed. I could imagine that especially for larger applications, the optimizer of a compiler may decide to which memory it maps, and whether it should be run on the gpu at all automatically... The slow speed of the copying process to the various memory sorts certainly creates an additional headache for code optimization. So when it comes to memory mapping, I there may be room for improvement in the OpenMP standard, as well as for the gcc compiler, as this is the bottleneck that can make operations slow on gpu.