Hi Tobias! On 2023-04-28T10:28:22+0200, Tobias Burnus <tob...@codesourcery.com> wrote: > maybe I misunderstood your suggestion, but
Forst, note that those CUDA "Stream Memory Operations" are something that I found by chance, and don't have any actual experience with. I can't seem to find a lot of documentation/usage of this API? By the way, a similar thing also exists for AMD GPUs: 'hipStreamWaitValue32', etc. > "Wait on a memory location" > assumes that there will be a change – but if a target region happens to > have no reverse offload, the memory location will never change, but > still the target region should return to the host. Oh indeed. ;-) Details... > What we would need: Wait on memory location – and return if either the > kernel stopped *or* the memory location changed. Or, have a way to "cancel", from the host, the 'cuStreamWaitValue32', 'cuStreamWaitValue64', after the actual 'target' kernel completed? > My impression is that "return if the kernel stopped" is not really > guaranteed. Of did I miss some fineprint? No, you're right. I suppose this is as designed: for example, generally, there may be additional kernel launches, and the "wait" will then eventually trigger. Could we, after the actual 'target' kernel completed, issue a host-side "write" ('cuStreamWriteValue32', 'cuStreamWriteValue64') to that memory location, to signal end of processing for reverse offloads? That is: - enqueue 'cuLaunchKernel' - enqueue 'cuStreamWriteValue' (to signal end of processing for reverse offloads) - loop on 'cuStreamWaitValue' (until end of processing for reverse offloads) Grüße Thomas > On 04.04.23 16:40, Thomas Schwinge wrote: >> Hi! >> >> During GCC/OpenMP/nvptx reverse offload investigations, about how to >> replace the problematic global 'GOMP_REV_OFFLOAD_VAR', I may have found >> something re: >> >> On 2022-08-26T11:07:28+0200, Tobias Burnus <tob...@codesourcery.com> wrote: >>> Better suggestions are welcome for the busy loop in >>> libgomp/plugin/plugin-nvptx.c regarding the variable placement and checking >>> its value. >>> On the host side, the last address is checked - if fn_addr != NULL, >>> it passes all arguments on to the generic (target.c) gomp_target_rev >>> to do the actual offloading. >>> >>> CUDA does lockup when trying to copy data from the currently running >>> stream; hence, a new stream is generated to do the memory copying. >>> Future work for nvptx: >>> * Adjust 'sleep', possibly [...] >>> to do shorter sleeps than usleep(1)? >> ... this busy loop. >> >> Current 'libgomp/plugin/plugin-nvptx.c:GOMP_OFFLOAD_run': >> >> [...] >> if (reverse_offload) >> CUDA_CALL_ASSERT (cuStreamCreate, ©_stream, >> CU_STREAM_NON_BLOCKING); >> r = CUDA_CALL_NOCHECK (cuLaunchKernel, function, teams, 1, 1, >> 32, threads, 1, 0, NULL, NULL, config); >> if (r != CUDA_SUCCESS) >> GOMP_PLUGIN_fatal ("cuLaunchKernel error: %s", cuda_error (r)); >> if (reverse_offload) >> while (true) >> { >> r = CUDA_CALL_NOCHECK (cuStreamQuery, NULL); >> if (r == CUDA_SUCCESS) >> break; >> if (r == CUDA_ERROR_LAUNCH_FAILED) >> GOMP_PLUGIN_fatal ("cuStreamQuery error: %s %s\n", cuda_error >> (r), >> maybe_abort_msg); >> else if (r != CUDA_ERROR_NOT_READY) >> GOMP_PLUGIN_fatal ("cuStreamQuery error: %s", cuda_error (r)); >> >> if (__atomic_load_n (&ptx_dev->rev_data->fn, __ATOMIC_ACQUIRE) >> != 0) >> { >> struct rev_offload *rev_data = ptx_dev->rev_data; >> GOMP_PLUGIN_target_rev (rev_data->fn, rev_data->mapnum, >> rev_data->addrs, rev_data->sizes, >> rev_data->kinds, rev_data->dev_num, >> rev_off_dev_to_host_cpy, >> rev_off_host_to_dev_cpy, >> copy_stream); >> CUDA_CALL_ASSERT (cuStreamSynchronize, copy_stream); >> __atomic_store_n (&rev_data->fn, 0, __ATOMIC_RELEASE); >> } >> usleep (1); >> } >> else >> r = CUDA_CALL_NOCHECK (cuCtxSynchronize, ); >> if (reverse_offload) >> CUDA_CALL_ASSERT (cuStreamDestroy, copy_stream); >> [...] >> >> Instead of this 'while (true)', 'usleep (1)' loop, shouldn't we be able >> to use "Stream Memory Operations", >> <https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEMOP.html >> that allow to "Wait on a memory location", "until the given condition on >> the memory is satisfied"? >> >> For reference, current 'libgomp/config/nvptx/target.c:GOMP_target_ext': >> >> [...] >> GOMP_REV_OFFLOAD_VAR->mapnum = mapnum; >> GOMP_REV_OFFLOAD_VAR->addrs = (uint64_t) hostaddrs; >> GOMP_REV_OFFLOAD_VAR->sizes = (uint64_t) sizes; >> GOMP_REV_OFFLOAD_VAR->kinds = (uint64_t) kinds; >> GOMP_REV_OFFLOAD_VAR->dev_num = GOMP_ADDITIONAL_ICVS.device_num; >> >> /* Set 'fn' to trigger processing on the host; wait for completion, >> which is flagged by setting 'fn' back to 0 on the host. */ >> uint64_t addr_struct_fn = (uint64_t) &GOMP_REV_OFFLOAD_VAR->fn; >> #if __PTX_SM__ >= 700 >> asm volatile ("st.global.release.sys.u64 [%0], %1;" >> : : "r"(addr_struct_fn), "r" (fn) : "memory"); >> #else >> __sync_synchronize (); /* membar.sys */ >> asm volatile ("st.volatile.global.u64 [%0], %1;" >> : : "r"(addr_struct_fn), "r" (fn) : "memory"); >> #endif >> >> #if __PTX_SM__ >= 700 >> uint64_t fn2; >> do >> { >> asm volatile ("ld.acquire.sys.global.u64 %0, [%1];" >> : "=r" (fn2) : "r" (addr_struct_fn) : "memory"); >> } >> while (fn2 != 0); >> #else >> /* ld.global.u64 %r64,[__gomp_rev_offload_var]; >> ld.u64 %r36,[%r64]; >> membar.sys; */ >> while (__atomic_load_n (&GOMP_REV_OFFLOAD_VAR->fn, __ATOMIC_ACQUIRE) >> != 0) >> ; /* spin */ >> #endif >> [...] >> >> >> Grüße >> Thomas >> ----------------- >> Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, >> 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: >> Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; >> Registergericht München, HRB 106955 > ----------------- > Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 > München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas > Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht > München, HRB 106955 ----------------- Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955