On 07.08.25 10:46, Liu01 Tong wrote:
> The early commit b8adc31cc0ca ("drm/amdgpu: Avoid extra evict-restore
> process.") changed amdgpu_vm_wait_idle to use drm_sched_entity_flush
> instead of dma_resv_wait_timeout to avoid KFD eviction fence signaling.
> But this introduce a race condition when processes are killed.
> 
> During process kill, drm_sched_entity_flush() will kill the vm entities.
> Concurrent job submissions of this process will fail.

Clear NAK to that. This is essentially why we call drm_sched_entity_flush() 
here in the first place.

Regards,
Christian.

> 
> Fix by skipping vm entity flushing when the process is being killed.
> 
> Signed-off-by: Liu01 Tong <[email protected]>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> index 283dd44f04b0..ae43a378f866 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> @@ -2415,6 +2415,13 @@ void amdgpu_vm_adjust_size(struct amdgpu_device *adev, 
> uint32_t min_vm_size,
>   */
>  long amdgpu_vm_wait_idle(struct amdgpu_vm *vm, long timeout)
>  {
> +     /* If the process is being killed, skip flush VM entities
> +      * as entities of concurrent job submission of this process
> +      * might be in an inconsistent state
> +      */
> +     if (current->flags & PF_EXITING)
> +             return timeout;
> +
>       timeout = drm_sched_entity_flush(&vm->immediate, timeout);
>       if (timeout <= 0)
>               return timeout;

Reply via email to