RE: [PATCH] drm/amdgpu: fix the nullptr issue when reenter GPU recovery

Zhang, Hawking Thu, 20 Aug 2020 03:06:12 -0700

[AMD Public Use]

Thanks Dennis. Yes, that's valid case. skipping the reset and scheduler resume 
sound reasonable to me.


The patch is

Reviewed-by: Hawking Zhang <[email protected]>

Regards,
Hawking
-----Original Message-----
From: Li, Dennis <[email protected]> 
Sent: Thursday, August 20, 2020 16:40
To: Zhang, Hawking <[email protected]>; [email protected]; 
Deucher, Alexander <[email protected]>; Kuehling, Felix 
<[email protected]>; Koenig, Christian <[email protected]>
Subject: RE: [PATCH] drm/amdgpu: fix the nullptr issue when reenter GPU recovery

[AMD Public Use]

Hi, Hawking,
      When RAS uncorrectable error happens, RAS interrupt will trigger a GPU 
recovery.  At the same time, if a GFX or compute job is timeout, driver will 
trigger a new one. 

Best Regards
Dennis Li
-----Original Message-----
From: Zhang, Hawking <[email protected]> 
Sent: Thursday, August 20, 2020 4:24 PM
To: Li, Dennis <[email protected]>; [email protected]; Deucher, 
Alexander <[email protected]>; Kuehling, Felix 
<[email protected]>; Koenig, Christian <[email protected]>
Cc: Li, Dennis <[email protected]>
Subject: RE: [PATCH] drm/amdgpu: fix the nullptr issue when reenter GPU recovery

[AMD Public Use]

Hi Dennis,

Can you elaborate the case that driver re-enter GPU recovery in sGPU system? 
I'm wondering whether this is a valid case or we shall prevent this from the 
beginning.

Regards,
Hawking

-----Original Message-----
From: Dennis Li <[email protected]> 
Sent: Thursday, August 20, 2020 10:21
To: [email protected]; Deucher, Alexander 
<[email protected]>; Kuehling, Felix <[email protected]>; Zhang, 
Hawking <[email protected]>; Koenig, Christian <[email protected]>
Cc: Li, Dennis <[email protected]>
Subject: [PATCH] drm/amdgpu: fix the nullptr issue when reenter GPU recovery

in single gpu system, if driver reenter gpu recovery, amdgpu_device_lock_adev 
will return false, but hive is nullptr now.

Signed-off-by: Dennis Li <[email protected]>

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 82242e2f5658..81b1d9a1dca0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4371,8 +4371,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
                if (!amdgpu_device_lock_adev(tmp_adev)) {
                        DRM_INFO("Bailing on TDR for s_job:%llx, as another 
already in progress",
                                  job ? job->base.id : -1);
-                       mutex_unlock(&hive->hive_lock);
-                       return 0;
+                       r = 0;
+                       goto skip_recovery;
                }
 
                /*
@@ -4505,6 +4505,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
                amdgpu_device_unlock_adev(tmp_adev);
        }
 
+skip_recovery:
        if (hive) {
                atomic_set(&hive->in_reset, 0);
                mutex_unlock(&hive->hive_lock);
--
2.17.1
_______________________________________________
amd-gfx mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

RE: [PATCH] drm/amdgpu: fix the nullptr issue when reenter GPU recovery

Reply via email to