Hi Tobias! On 2024-01-23T10:55:16+0100, Tobias Burnus <tbur...@baylibre.com> wrote: > Slightly changed patch: > > nvptx_attach_host_thread_to_device now fails again with an error for > CUDA_ERROR_DEINITIALIZED, except for GOMP_OFFLOAD_fini_device. > > I think it makes more sense that way.
Agreed. > Tobias Burnus wrote: >> Testing showed that the libgomp.c/target-52.c failed with: >> >> libgomp: cuCtxGetDevice error: unknown cuda error >> >> libgomp: device finalization failed >> >> This testcase uses OMP_DISPLAY_ENV=true and >> OMP_TARGET_OFFLOAD=mandatory, and those env vars matter, i.e. it only >> fails if dg-set-target-env-var is honored. >> >> If both env vars are set, the device initialization occurs earlier as >> OMP_DEFAULT_DEVICE is shown due to the display-env env var and its >> value (when target-offload-var is 'mandatory') might be either >> 'omp_invalid_device' or '0'. >> >> It turned out that this had an effect on device finalization, which >> caused CUDA to stop earlier than expected. This patch now handles this >> case gracefully. For details, see the commit log message in the >> attached patch and/or the PR. > plugin/plugin-nvptx.c: Fix fini_device call when already shutdown [PR113513] > > The following issue was found when running libgomp.c/target-52.c with > nvptx offloading when the dg-set-target-env-var was honored. Curious, I've never seen this failure mode in my several different configurations. :-| > The issue > occurred for both -foffload=disable and with offloading configured when > an nvidia device is available. > > At the end of the program, the offloading parts are shutdown via two means: > The callback registered via 'atexit (gomp_target_fini)' and - via code > generated in mkoffload, the '__attribute__((destructor)) fini' function > that calls GOMP_offload_unregister_ver. > > In normal processing, first gomp_target_fini is called - which then sets > GOMP_DEVICE_FINALIZED for the device - and later GOMP_offload_unregister_ver, > but that's then because the state is GOMP_DEVICE_FINALIZED. > If both OMP_DISPLAY_ENV=true and OMP_TARGET_OFFLOAD="mandatory" are set, > the call omp_display_env already invokes gomp_init_targets_once, i.e. it > occurs earlier than usual and is invoked via __attribute__((constructor)) > initialize_env. > > For some unknown reasons, while this does not have an effect on the > order of the called plugin functions for initialization, it changes the > order of function calls for shutting down. Namely, when the two environment > variables are set, GOMP_offload_unregister_ver is called now before > gomp_target_fini. Re "unknown reasons", isn't that indeed explained by the different 'atexit' function/'__attribute__((destructor))' sequencing, due to different order of 'atexit'/'__attribute__((constructor))' calls? I think I agree that, defensively, we should behave correctly in libgomp finitialization, no matter in which these calls occur. > And it seems as if CUDA regards a call to cuModuleUnload > (or unloading the last module?) as indication that the device context should > be destroyed - or, at least, afterwards calling cuCtxGetDevice will return > CUDA_ERROR_DEINITIALIZED. However, this I don't understand -- but would like to. Are you saying that for: --- libgomp/plugin/plugin-nvptx.c +++ libgomp/plugin/plugin-nvptx.c @@ -1556,8 +1556,16 @@ GOMP_OFFLOAD_unload_image (int ord, unsigned version, const void *target_data) if (image->target_data == target_data) { *prev_p = image->next; - if (CUDA_CALL_NOCHECK (cuModuleUnload, image->module) != CUDA_SUCCESS) + CUresult r; + r = CUDA_CALL_NOCHECK (cuModuleUnload, image->module); + GOMP_PLUGIN_debug (0, "%s: cuModuleUnload: %s\n", __FUNCTION__, cuda_error (r)); + if (r != CUDA_SUCCESS) ret = false; + CUdevice dev_; + r = CUDA_CALL_NOCHECK (cuCtxGetDevice, &dev_); + GOMP_PLUGIN_debug (0, "%s: cuCtxGetDevice: %s\n", __FUNCTION__, cuda_error (r)); + GOMP_PLUGIN_debug (0, "%s: dev_=%d, dev->dev=%d\n", __FUNCTION__, dev_, dev->dev); + assert (dev_ == dev->dev); free (image->fns); free (image); break; ..., you're seeing an error for 'libgomp.c/target-52.c' with 'env OMP_TARGET_OFFLOAD=mandatory OMP_DISPLAY_ENV=true'? I get: GOMP_OFFLOAD_unload_image: cuModuleUnload: no error GOMP_OFFLOAD_unload_image: cuCtxGetDevice: no error GOMP_OFFLOAD_unload_image: dev_=0, dev->dev=0 Or, is something else happening in between the 'cuModuleUnload' and your reportedly failing 'cuCtxGetDevice'? Re your PR113513 details, I don't see how your failure mode could be related to (a) the PTX code ('--with-arch=sm_80'), or the GPU hardware ("NVIDIA RTX A1000 6GB") (..., unless the Nvidia Driver is doing "funny" things, of course...), so could this possibly be due to a recent change in the CUDA Driver/Nvidia Driver? You say "CUDA Version: 12.3", but which which Nvidia Driver version? The latest I've now tested are: Driver Version: 525.147.05 CUDA Version: 12.0 Driver Version: 535.154.05 CUDA Version: 12.2 I'll re-try with a more recent version. > As the previous code in nvptx_attach_host_thread_to_device wasn't expecting > that result, it called > GOMP_PLUGIN_error ("cuCtxGetDevice error: %s", cuda_error (r)); > causing a fatal error of the program. > > This commit handles now CUDA_ERROR_DEINITIALIZED in a special way such > that GOMP_OFFLOAD_fini_device just works. I'd like to please defer that one until we understand the actual origin of the misbehavior. > When reading the code, the following was observed in addition: > When gomp_fini_device is called, it invokes goacc_fini_asyncqueues > to ensure that the queue is emptied. It seems to make sense to do > likewise for GOMP_offload_unregister_ver, which this commit does in > addition. I don't understand why offload image unregistration (a) should trigger 'goacc_fini_asyncqueues', and (b) how that relates to PR113513? Grüße Thomas