On Wed, 28 Oct 2020 15:25:56 +0800 Chung-Lin Tang <clt...@codesourcery.com> wrote:
> On 2020/10/27 9:17 PM, Julian Brown wrote: > >> And, in which context are cuStreamAddCallback registered callbacks > >> run? E.g. if it is inside of asynchronous interrput, using locking > >> in there might not be the best thing to do. > > The cuStreamAddCallback API is documented here: > > > > https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__STREAM.html#group__CUDA__STREAM_1g613d97a277d7640f4cb1c03bd51c2483 > > > > We're quite limited in what we can do in the callback function since > > "Callbacks must not make any CUDA API calls". So what*can* a > > callback function do? It is mentioned that the callback function's > > execution will "pause" the stream it is logically running on. So > > can we get deadlock, e.g. if multiple host threads are launching > > offload kernels simultaneously? I don't think so, but I don't know > > how to prove it! > > I think it's not deadlock that's a problem here, but that the locking > acquiring in nvptx_stack_acquire will effectively serialize GPU > kernel execution to just one host thread (since you're holding it > till kernel completion). Also in that case, why do you need to use a > CUDA callback? You can just call the unlock directly afterwards. IIUC, there's a single GPU queue used for synchronous launches no matter which host thread initiates the operation, and kernel execution is serialised anyway, so that shouldn't be a problem. The only way to get different kernels executing simultaneously is to use different CUDA streams -- but I think that's still TBD for OpenMP ("TODO: Implement GOMP_OFFLOAD_async_run"). > I think a better way is to use a list of stack blocks in ptx_dev, and > quickly retrieve/unlock it in nvptx_stack_acquire, like how we did it > in GOMP_OFFLOAD_alloc for general device memory allocation. If it weren't for the serialisation, we could also keep a stack cache per-host-thread in nvptx_thread. But as it is, I don't think we need the extra complication. When we do OpenMP async support, maybe a stack cache can be put per-stream in goacc_asyncqueue or the OpenMP equivalent. Thanks, Julian