Hi Thomas,

maybe I misunderstood your suggestion, but "Wait on a memory location"
assumes that there will be a change – but if a target region happens to
have no reverse offload, the memory location will never change, but
still the target region should return to the host.

What we would need: Wait on memory location – and return if either the
kernel stopped *or* the memory location changed.

My impression is that "return if the kernel stopped" is not really
guaranteed. Of did I miss some fineprint?

Tobias

On 04.04.23 16:40, Thomas Schwinge wrote:
Hi!

During GCC/OpenMP/nvptx reverse offload investigations, about how to
replace the problematic global 'GOMP_REV_OFFLOAD_VAR', I may have found
something re:

On 2022-08-26T11:07:28+0200, Tobias Burnus <tob...@codesourcery.com> wrote:
Better suggestions are welcome for the busy loop in
libgomp/plugin/plugin-nvptx.c regarding the variable placement and checking
its value.
On the host side, the last address is checked - if fn_addr != NULL,
it passes all arguments on to the generic (target.c) gomp_target_rev
to do the actual offloading.

CUDA does lockup when trying to copy data from the currently running
stream; hence, a new stream is generated to do the memory copying.
Future work for nvptx:
* Adjust 'sleep', possibly [...]
   to do shorter sleeps than usleep(1)?
... this busy loop.

Current 'libgomp/plugin/plugin-nvptx.c:GOMP_OFFLOAD_run':

     [...]
       if (reverse_offload)
         CUDA_CALL_ASSERT (cuStreamCreate, &copy_stream, 
CU_STREAM_NON_BLOCKING);
       r = CUDA_CALL_NOCHECK (cuLaunchKernel, function, teams, 1, 1,
                              32, threads, 1, 0, NULL, NULL, config);
       if (r != CUDA_SUCCESS)
         GOMP_PLUGIN_fatal ("cuLaunchKernel error: %s", cuda_error (r));
       if (reverse_offload)
         while (true)
           {
             r = CUDA_CALL_NOCHECK (cuStreamQuery, NULL);
             if (r == CUDA_SUCCESS)
               break;
             if (r == CUDA_ERROR_LAUNCH_FAILED)
               GOMP_PLUGIN_fatal ("cuStreamQuery error: %s %s\n", cuda_error 
(r),
                                  maybe_abort_msg);
             else if (r != CUDA_ERROR_NOT_READY)
               GOMP_PLUGIN_fatal ("cuStreamQuery error: %s", cuda_error (r));

             if (__atomic_load_n (&ptx_dev->rev_data->fn, __ATOMIC_ACQUIRE) != 
0)
               {
                 struct rev_offload *rev_data = ptx_dev->rev_data;
                 GOMP_PLUGIN_target_rev (rev_data->fn, rev_data->mapnum,
                                         rev_data->addrs, rev_data->sizes,
                                         rev_data->kinds, rev_data->dev_num,
                                         rev_off_dev_to_host_cpy,
                                         rev_off_host_to_dev_cpy, copy_stream);
                 CUDA_CALL_ASSERT (cuStreamSynchronize, copy_stream);
                 __atomic_store_n (&rev_data->fn, 0, __ATOMIC_RELEASE);
               }
             usleep (1);
           }
       else
         r = CUDA_CALL_NOCHECK (cuCtxSynchronize, );
       if (reverse_offload)
         CUDA_CALL_ASSERT (cuStreamDestroy, copy_stream);
     [...]

Instead of this 'while (true)', 'usleep (1)' loop, shouldn't we be able
to use "Stream Memory Operations",
<https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEMOP.html
that allow to "Wait on a memory location", "until the given condition on
the memory is satisfied"?

For reference, current 'libgomp/config/nvptx/target.c:GOMP_target_ext':

     [...]
       GOMP_REV_OFFLOAD_VAR->mapnum = mapnum;
       GOMP_REV_OFFLOAD_VAR->addrs = (uint64_t) hostaddrs;
       GOMP_REV_OFFLOAD_VAR->sizes = (uint64_t) sizes;
       GOMP_REV_OFFLOAD_VAR->kinds = (uint64_t) kinds;
       GOMP_REV_OFFLOAD_VAR->dev_num = GOMP_ADDITIONAL_ICVS.device_num;

       /* Set 'fn' to trigger processing on the host; wait for completion,
          which is flagged by setting 'fn' back to 0 on the host.  */
       uint64_t addr_struct_fn = (uint64_t) &GOMP_REV_OFFLOAD_VAR->fn;
     #if __PTX_SM__ >= 700
       asm volatile ("st.global.release.sys.u64 [%0], %1;"
                     : : "r"(addr_struct_fn), "r" (fn) : "memory");
     #else
       __sync_synchronize ();  /* membar.sys */
       asm volatile ("st.volatile.global.u64 [%0], %1;"
                     : : "r"(addr_struct_fn), "r" (fn) : "memory");
     #endif

     #if __PTX_SM__ >= 700
       uint64_t fn2;
       do
         {
           asm volatile ("ld.acquire.sys.global.u64 %0, [%1];"
                         : "=r" (fn2) : "r" (addr_struct_fn) : "memory");
         }
       while (fn2 != 0);
     #else
       /* ld.global.u64 %r64,[__gomp_rev_offload_var];
          ld.u64 %r36,[%r64];
          membar.sys;  */
       while (__atomic_load_n (&GOMP_REV_OFFLOAD_VAR->fn, __ATOMIC_ACQUIRE) != 
0)
         ;  /* spin  */
     #endif
     [...]


Grüße
  Thomas
-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 
München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas 
Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht 
München, HRB 106955
-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 
München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas 
Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht 
München, HRB 106955

Reply via email to