Hi Tobias!
On 2023-04-28T10:28:22+0200, Tobias Burnus <[email protected]> wrote:
> maybe I misunderstood your suggestion, but
Forst, note that those CUDA "Stream Memory Operations" are something that
I found by chance, and don't have any actual experience with. I can't
seem to find a lot of documentation/usage of this API?
By the way, a similar thing also exists for AMD GPUs:
'hipStreamWaitValue32', etc.
> "Wait on a memory location"
> assumes that there will be a change – but if a target region happens to
> have no reverse offload, the memory location will never change, but
> still the target region should return to the host.
Oh indeed. ;-) Details...
> What we would need: Wait on memory location – and return if either the
> kernel stopped *or* the memory location changed.
Or, have a way to "cancel", from the host, the 'cuStreamWaitValue32',
'cuStreamWaitValue64', after the actual 'target' kernel completed?
> My impression is that "return if the kernel stopped" is not really
> guaranteed. Of did I miss some fineprint?
No, you're right. I suppose this is as designed: for example, generally,
there may be additional kernel launches, and the "wait" will then
eventually trigger.
Could we, after the actual 'target' kernel completed, issue a host-side
"write" ('cuStreamWriteValue32', 'cuStreamWriteValue64') to that memory
location, to signal end of processing for reverse offloads?
That is:
- enqueue 'cuLaunchKernel'
- enqueue 'cuStreamWriteValue' (to signal end of processing for reverse
offloads)
- loop on 'cuStreamWaitValue' (until end of processing for reverse offloads)
Grüße
Thomas
> On 04.04.23 16:40, Thomas Schwinge wrote:
>> Hi!
>>
>> During GCC/OpenMP/nvptx reverse offload investigations, about how to
>> replace the problematic global 'GOMP_REV_OFFLOAD_VAR', I may have found
>> something re:
>>
>> On 2022-08-26T11:07:28+0200, Tobias Burnus <[email protected]> wrote:
>>> Better suggestions are welcome for the busy loop in
>>> libgomp/plugin/plugin-nvptx.c regarding the variable placement and checking
>>> its value.
>>> On the host side, the last address is checked - if fn_addr != NULL,
>>> it passes all arguments on to the generic (target.c) gomp_target_rev
>>> to do the actual offloading.
>>>
>>> CUDA does lockup when trying to copy data from the currently running
>>> stream; hence, a new stream is generated to do the memory copying.
>>> Future work for nvptx:
>>> * Adjust 'sleep', possibly [...]
>>> to do shorter sleeps than usleep(1)?
>> ... this busy loop.
>>
>> Current 'libgomp/plugin/plugin-nvptx.c:GOMP_OFFLOAD_run':
>>
>> [...]
>> if (reverse_offload)
>> CUDA_CALL_ASSERT (cuStreamCreate, ©_stream,
>> CU_STREAM_NON_BLOCKING);
>> r = CUDA_CALL_NOCHECK (cuLaunchKernel, function, teams, 1, 1,
>> 32, threads, 1, 0, NULL, NULL, config);
>> if (r != CUDA_SUCCESS)
>> GOMP_PLUGIN_fatal ("cuLaunchKernel error: %s", cuda_error (r));
>> if (reverse_offload)
>> while (true)
>> {
>> r = CUDA_CALL_NOCHECK (cuStreamQuery, NULL);
>> if (r == CUDA_SUCCESS)
>> break;
>> if (r == CUDA_ERROR_LAUNCH_FAILED)
>> GOMP_PLUGIN_fatal ("cuStreamQuery error: %s %s\n", cuda_error
>> (r),
>> maybe_abort_msg);
>> else if (r != CUDA_ERROR_NOT_READY)
>> GOMP_PLUGIN_fatal ("cuStreamQuery error: %s", cuda_error (r));
>>
>> if (__atomic_load_n (&ptx_dev->rev_data->fn, __ATOMIC_ACQUIRE)
>> != 0)
>> {
>> struct rev_offload *rev_data = ptx_dev->rev_data;
>> GOMP_PLUGIN_target_rev (rev_data->fn, rev_data->mapnum,
>> rev_data->addrs, rev_data->sizes,
>> rev_data->kinds, rev_data->dev_num,
>> rev_off_dev_to_host_cpy,
>> rev_off_host_to_dev_cpy,
>> copy_stream);
>> CUDA_CALL_ASSERT (cuStreamSynchronize, copy_stream);
>> __atomic_store_n (&rev_data->fn, 0, __ATOMIC_RELEASE);
>> }
>> usleep (1);
>> }
>> else
>> r = CUDA_CALL_NOCHECK (cuCtxSynchronize, );
>> if (reverse_offload)
>> CUDA_CALL_ASSERT (cuStreamDestroy, copy_stream);
>> [...]
>>
>> Instead of this 'while (true)', 'usleep (1)' loop, shouldn't we be able
>> to use "Stream Memory Operations",
>> <https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEMOP.html
>> that allow to "Wait on a memory location", "until the given condition on
>> the memory is satisfied"?
>>
>> For reference, current 'libgomp/config/nvptx/target.c:GOMP_target_ext':
>>
>> [...]
>> GOMP_REV_OFFLOAD_VAR->mapnum = mapnum;
>> GOMP_REV_OFFLOAD_VAR->addrs = (uint64_t) hostaddrs;
>> GOMP_REV_OFFLOAD_VAR->sizes = (uint64_t) sizes;
>> GOMP_REV_OFFLOAD_VAR->kinds = (uint64_t) kinds;
>> GOMP_REV_OFFLOAD_VAR->dev_num = GOMP_ADDITIONAL_ICVS.device_num;
>>
>> /* Set 'fn' to trigger processing on the host; wait for completion,
>> which is flagged by setting 'fn' back to 0 on the host. */
>> uint64_t addr_struct_fn = (uint64_t) &GOMP_REV_OFFLOAD_VAR->fn;
>> #if __PTX_SM__ >= 700
>> asm volatile ("st.global.release.sys.u64 [%0], %1;"
>> : : "r"(addr_struct_fn), "r" (fn) : "memory");
>> #else
>> __sync_synchronize (); /* membar.sys */
>> asm volatile ("st.volatile.global.u64 [%0], %1;"
>> : : "r"(addr_struct_fn), "r" (fn) : "memory");
>> #endif
>>
>> #if __PTX_SM__ >= 700
>> uint64_t fn2;
>> do
>> {
>> asm volatile ("ld.acquire.sys.global.u64 %0, [%1];"
>> : "=r" (fn2) : "r" (addr_struct_fn) : "memory");
>> }
>> while (fn2 != 0);
>> #else
>> /* ld.global.u64 %r64,[__gomp_rev_offload_var];
>> ld.u64 %r36,[%r64];
>> membar.sys; */
>> while (__atomic_load_n (&GOMP_REV_OFFLOAD_VAR->fn, __ATOMIC_ACQUIRE)
>> != 0)
>> ; /* spin */
>> #endif
>> [...]
>>
>>
>> Grüße
>> Thomas
>> -----------------
>> Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201,
>> 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer:
>> Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München;
>> Registergericht München, HRB 106955
> -----------------
> Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634
> München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas
> Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht
> München, HRB 106955
-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634
München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas
Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht
München, HRB 106955