https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105042
Thomas Schwinge <tschwinge at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |tschwinge at gcc dot gnu.org --- Comment #7 from Thomas Schwinge <tschwinge at gcc dot gnu.org> --- By the way, I'm not reproducing this 'GOMP_NVPTX_JIT=-O0' issue on my current Nvidia Quadro P1000 GPU system (Driver Version: 450.119.03), but what you've found sounds plausible. (In reply to Tom de Vries from comment #5) > (In reply to Richard Biener from comment #1) > > Doesn't whatever driver/library API we use from libgomp to invoke workloads > > report actual errors? Maybe we need to improve there. > > This: > ... > libgomp: cuStreamSynchronize error: the launch timed out and was terminated > ... > seems to be the string for cudaErrorLaunchTimeout, which AFAICT is dedicated > to this situation, so we could treat that error code specially in cuda_error > in plugin-nvptx.c and emit a custom message. > > Say: > ... > libgomp: cuStreamSynchronize error: the launch timed out and was terminated > (5 second time-out caused by launching on a device running a display manager) > ... Not sure if that's really worth it? And, "5 second time-out" seems a detail that we shouldn't rely on. Is really "display manager" the only way this timeout may get enabled? > Alternatively, we could detect cudaDeviceProp::kernelExecTimeoutEnabled and > emit a warning when initializing or before launching the first kernel. That sounds noisy to me, given that most of all GPU kernel launches still finish successfully? A 'GOMP_debug' note for that sounds fine. But, well, to be helpful to the user: how about we indeed catch the 'CUDA_ERROR_LAUNCH_TIMEOUT' error case, (if that makes sense, then 'assert' that 'CU_DEVICE_ATTRIBUTE_KERNEL_EXEC_TIMEOUT' is set), and emit an additional message like "run time limit for kernels executed on the device" (per <https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__DEVICE.html#group__CUDA__DEVICE_1g9c3e1414f0ad901d3278a4d6645fc266>, 'CU_DEVICE_ATTRIBUTE_KERNEL_EXEC_TIMEOUT')? That is, like we have 'maybe_abort_msg'.