[AMD Public Use] Thanks Dennis. Yes, that's valid case. skipping the reset and scheduler resume sound reasonable to me.
The patch is Reviewed-by: Hawking Zhang <[email protected]> Regards, Hawking -----Original Message----- From: Li, Dennis <[email protected]> Sent: Thursday, August 20, 2020 16:40 To: Zhang, Hawking <[email protected]>; [email protected]; Deucher, Alexander <[email protected]>; Kuehling, Felix <[email protected]>; Koenig, Christian <[email protected]> Subject: RE: [PATCH] drm/amdgpu: fix the nullptr issue when reenter GPU recovery [AMD Public Use] Hi, Hawking, When RAS uncorrectable error happens, RAS interrupt will trigger a GPU recovery. At the same time, if a GFX or compute job is timeout, driver will trigger a new one. Best Regards Dennis Li -----Original Message----- From: Zhang, Hawking <[email protected]> Sent: Thursday, August 20, 2020 4:24 PM To: Li, Dennis <[email protected]>; [email protected]; Deucher, Alexander <[email protected]>; Kuehling, Felix <[email protected]>; Koenig, Christian <[email protected]> Cc: Li, Dennis <[email protected]> Subject: RE: [PATCH] drm/amdgpu: fix the nullptr issue when reenter GPU recovery [AMD Public Use] Hi Dennis, Can you elaborate the case that driver re-enter GPU recovery in sGPU system? I'm wondering whether this is a valid case or we shall prevent this from the beginning. Regards, Hawking -----Original Message----- From: Dennis Li <[email protected]> Sent: Thursday, August 20, 2020 10:21 To: [email protected]; Deucher, Alexander <[email protected]>; Kuehling, Felix <[email protected]>; Zhang, Hawking <[email protected]>; Koenig, Christian <[email protected]> Cc: Li, Dennis <[email protected]> Subject: [PATCH] drm/amdgpu: fix the nullptr issue when reenter GPU recovery in single gpu system, if driver reenter gpu recovery, amdgpu_device_lock_adev will return false, but hive is nullptr now. Signed-off-by: Dennis Li <[email protected]> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 82242e2f5658..81b1d9a1dca0 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -4371,8 +4371,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev, if (!amdgpu_device_lock_adev(tmp_adev)) { DRM_INFO("Bailing on TDR for s_job:%llx, as another already in progress", job ? job->base.id : -1); - mutex_unlock(&hive->hive_lock); - return 0; + r = 0; + goto skip_recovery; } /* @@ -4505,6 +4505,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev, amdgpu_device_unlock_adev(tmp_adev); } +skip_recovery: if (hive) { atomic_set(&hive->in_reset, 0); mutex_unlock(&hive->hive_lock); -- 2.17.1 _______________________________________________ amd-gfx mailing list [email protected] https://lists.freedesktop.org/mailman/listinfo/amd-gfx
