https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120682

--- Comment #9 from Benjamin Schulz <schulz.benjamin at googlemail dot com> ---
Hi@all, thank you for your interesting replies.


I guess the problem for the mapper and templates is that there are no pragma
templates in C++. So this would probably indeed mean that one would have to
change the C standard, which asks a bit too much...

As for the mapper and these map pragmas, there are indeed possible improvements
in the OpenMP standard which I think are more important, but they are not
really for the gcc bugzilla. 

For example, according to https://www.openmp.org/spec-html/5.0/openmpsu109.html
memory allocators cannot appear in a map clause: 
https://www.openmp.org/spec-html/5.0/openmpsu53.html 

i.e. I can not use the mapping pragmas to say that something should be placed
into the fast omp_low_lat_mem or the omp_large_cap_mem for large data.

And that does not seem to change in openmp 6.0.


For a struct, that would then mean one has to manually copy each member
variable and member array with omp_target_memcpy and then associate the
pointers to the host variables with omp_target_associate if one wants to
specify the allocator which is rather tedious for a large struct.

The various memory allocators, like

omp_large_cap_mem_alloc,omp_large_cap_mem_space 
omp_const_mem_alloc,omp_const_mem_space, omp_high_bw_mem_alloc 
omp_high_bw_mem_space, omp_low_lat_mem_alloc 

and different addressing methods, like unified shared memory, unified_address

and copying methods (synchronous or asyncroneous copies),

are unfortunately important for performance optimization. 

On my new GPU, I found that adding two large vectors with #pragma omp requires
unified shared memory (which gcc seems to use by default on my system) takes
this time on my gpu: 409.019 miliseconds

Without unified shared memory but explicit mapping pragmas before the loop and
then adding the vectors takes 26.5208 miliseconds

(where-ever in the gpu it is placing them to). Maybe it would be faster if had
copied to low_lat_mem than using mapping pragmas?

Without taking the mapping into account, i.e. just the addition, takes this
time 1.00863.

Sadly, using asynchronic copies before the loop did not help did not help to
improve mapping speed. 

I could imagine that especially for larger applications, the optimizer of a
compiler may decide to which memory it maps, and whether it should be run on
the gpu at all automatically... The slow speed of the copying process to the
various memory sorts certainly creates an additional headache for code
optimization. 

So when it comes to memory mapping, I there may be room for improvement in the
OpenMP standard, as well as for the gcc compiler, as this is the bottleneck
that can make operations slow on gpu.

Reply via email to