On 2020/10/27 9:17 PM, Julian Brown wrote:
And, in which context are cuStreamAddCallback registered callbacks
run? E.g. if it is inside of asynchronous interrput, using locking in
there might not be the best thing to do.
The cuStreamAddCallback API is documented here:
https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__STREAM.html#group__CUDA__STREAM_1g613d97a277d7640f4cb1c03bd51c2483
We're quite limited in what we can do in the callback function since
"Callbacks must not make any CUDA API calls". So what*can* a callback
function do? It is mentioned that the callback function's execution will
"pause" the stream it is logically running on. So can we get deadlock,
e.g. if multiple host threads are launching offload kernels
simultaneously? I don't think so, but I don't know how to prove it!
I think it's not deadlock that's a problem here, but that the locking acquiring
in nvptx_stack_acquire will effectively serialize GPU kernel execution to just
one host thread (since you're holding it till kernel completion).
Also in that case, why do you need to use a CUDA callback? You can just call the
unlock directly afterwards.
I think a better way is to use a list of stack blocks in ptx_dev, and quickly
retrieve/unlock it in nvptx_stack_acquire, like how we did it in
GOMP_OFFLOAD_alloc for
general device memory allocation.
Chung-Lin