Re: [v2][patch] plugin/plugin-nvptx.c: Fix fini_device call when already shutdown [PR113513]

Thomas Schwinge Mon, 29 Jan 2024 07:53:46 -0800

Hi Tobias!

On 2024-01-23T10:55:16+0100, Tobias Burnus <tbur...@baylibre.com> wrote:
> Slightly changed patch:
>
> nvptx_attach_host_thread_to_device now fails again with an error for 
> CUDA_ERROR_DEINITIALIZED, except for GOMP_OFFLOAD_fini_device.
>
> I think it makes more sense that way.


Agreed.

> Tobias Burnus wrote:
>> Testing showed that the libgomp.c/target-52.c failed with:
>>
>> libgomp: cuCtxGetDevice error: unknown cuda error
>>
>> libgomp: device finalization failed
>>
>> This testcase uses OMP_DISPLAY_ENV=true and 
>> OMP_TARGET_OFFLOAD=mandatory, and those env vars matter, i.e. it only 
>> fails if dg-set-target-env-var is honored.
>>
>> If both env vars are set, the device initialization occurs earlier as 
>> OMP_DEFAULT_DEVICE is shown due to the display-env env var and its 
>> value (when target-offload-var is 'mandatory') might be either 
>> 'omp_invalid_device' or '0'.
>>
>> It turned out that this had an effect on device finalization, which 
>> caused CUDA to stop earlier than expected. This patch now handles this 
>> case gracefully. For details, see the commit log message in the 
>> attached patch and/or the PR.

> plugin/plugin-nvptx.c: Fix fini_device call when already shutdown [PR113513]
>
> The following issue was found when running libgomp.c/target-52.c with
> nvptx offloading when the dg-set-target-env-var was honored.

Curious, I've never seen this failure mode in my several different
configurations.  :-|

> The issue
> occurred for both -foffload=disable and with offloading configured when
> an nvidia device is available.
>
> At the end of the program, the offloading parts are shutdown via two means:
> The callback registered via 'atexit (gomp_target_fini)' and - via code
> generated in mkoffload, the '__attribute__((destructor)) fini' function
> that calls GOMP_offload_unregister_ver.
>
> In normal processing, first gomp_target_fini is called - which then sets
> GOMP_DEVICE_FINALIZED for the device - and later GOMP_offload_unregister_ver,
> but that's then because the state is GOMP_DEVICE_FINALIZED.
> If both OMP_DISPLAY_ENV=true and OMP_TARGET_OFFLOAD="mandatory" are set,
> the call omp_display_env already invokes gomp_init_targets_once, i.e. it
> occurs earlier than usual and is invoked via __attribute__((constructor))
> initialize_env.
>
> For some unknown reasons, while this does not have an effect on the
> order of the called plugin functions for initialization, it changes the
> order of function calls for shutting down. Namely, when the two environment
> variables are set, GOMP_offload_unregister_ver is called now before
> gomp_target_fini.

Re "unknown reasons", isn't that indeed explained by the different
'atexit' function/'__attribute__((destructor))' sequencing, due to
different order of 'atexit'/'__attribute__((constructor))' calls?

I think I agree that, defensively, we should behave correctly in libgomp
finitialization, no matter in which these calls occur.

> And it seems as if CUDA regards a call to cuModuleUnload
> (or unloading the last module?) as indication that the device context should
> be destroyed - or, at least, afterwards calling cuCtxGetDevice will return
> CUDA_ERROR_DEINITIALIZED.

However, this I don't understand -- but would like to.  Are you saying
that for:

    --- libgomp/plugin/plugin-nvptx.c
    +++ libgomp/plugin/plugin-nvptx.c
    @@ -1556,8 +1556,16 @@ GOMP_OFFLOAD_unload_image (int ord, unsigned 
version, const void *target_data)
         if (image->target_data == target_data)
           {
        *prev_p = image->next;
    -   if (CUDA_CALL_NOCHECK (cuModuleUnload, image->module) != CUDA_SUCCESS)
    +   CUresult r;
    +   r = CUDA_CALL_NOCHECK (cuModuleUnload, image->module);
    +   GOMP_PLUGIN_debug (0, "%s: cuModuleUnload: %s\n", __FUNCTION__, 
cuda_error (r));
    +   if (r != CUDA_SUCCESS)
          ret = false;
    +   CUdevice dev_;
    +   r = CUDA_CALL_NOCHECK (cuCtxGetDevice, &dev_);
    +   GOMP_PLUGIN_debug (0, "%s: cuCtxGetDevice: %s\n", __FUNCTION__, 
cuda_error (r));
    +   GOMP_PLUGIN_debug (0, "%s: dev_=%d, dev->dev=%d\n", __FUNCTION__, dev_, 
dev->dev);
    +   assert (dev_ == dev->dev);
        free (image->fns);
        free (image);
        break;

..., you're seeing an error for 'libgomp.c/target-52.c' with
'env OMP_TARGET_OFFLOAD=mandatory OMP_DISPLAY_ENV=true'?  I get:

    GOMP_OFFLOAD_unload_image: cuModuleUnload: no error
    GOMP_OFFLOAD_unload_image: cuCtxGetDevice: no error
    GOMP_OFFLOAD_unload_image: dev_=0, dev->dev=0

Or, is something else happening in between the 'cuModuleUnload' and your
reportedly failing 'cuCtxGetDevice'?

Re your PR113513 details, I don't see how your failure mode could be
related to (a) the PTX code ('--with-arch=sm_80'), or the GPU hardware
("NVIDIA RTX A1000 6GB") (..., unless the Nvidia Driver is doing "funny"
things, of course...), so could this possibly be due to a recent change
in the CUDA Driver/Nvidia Driver?  You say "CUDA Version: 12.3", but
which which Nvidia Driver version?  The latest I've now tested are:

    Driver Version: 525.147.05   CUDA Version: 12.0
    Driver Version: 535.154.05   CUDA Version: 12.2

I'll re-try with a more recent version.

> As the previous code in nvptx_attach_host_thread_to_device wasn't expecting
> that result, it called
>   GOMP_PLUGIN_error ("cuCtxGetDevice error: %s", cuda_error (r));
> causing a fatal error of the program.
>
> This commit handles now CUDA_ERROR_DEINITIALIZED in a special way such
> that GOMP_OFFLOAD_fini_device just works.

I'd like to please defer that one until we understand the actual origin
of the misbehavior.


> When reading the code, the following was observed in addition:
> When gomp_fini_device is called, it invokes goacc_fini_asyncqueues
> to ensure that the queue is emptied.  It seems to make sense to do
> likewise for GOMP_offload_unregister_ver, which this commit does in
> addition.

I don't understand why offload image unregistration (a) should trigger
'goacc_fini_asyncqueues', and (b) how that relates to PR113513?


Grüße
 Thomas

Re: [v2][patch] plugin/plugin-nvptx.c: Fix fini_device call when already shutdown [PR113513]

Reply via email to