Reviewed-by: Guchun Chen <[email protected]>

Regards,
Guchun

-----Original Message-----
From: Guilherme G. Piccoli <[email protected]> 
Sent: Thursday, February 2, 2023 12:48 AM
To: [email protected]
Cc: [email protected]; Deucher, Alexander 
<[email protected]>; Koenig, Christian <[email protected]>; Pan, 
Xinhui <[email protected]>; [email protected]; [email protected]; 
Guilherme G. Piccoli <[email protected]>; Chen, Guchun <[email protected]>; 
Tuikov, Luben <[email protected]>; Limonciello, Mario 
<[email protected]>
Subject: [PATCH v2] drm/amdgpu/fence: Fix oops due to non-matching drm_sched 
init/fini

Currently amdgpu calls drm_sched_fini() from the fence driver sw fini routine - 
such function is expected to be called only after the respective init function 
- drm_sched_init() - was executed successfully.

Happens that we faced a driver probe failure in the Steam Deck recently, and 
the function drm_sched_fini() was called even without its counter-part had been 
previously called, causing the following oops:

amdgpu: probe of 0000:04:00.0 failed with error -110
BUG: kernel NULL pointer dereference, address: 0000000000000090 PGD 0 P4D 0
Oops: 0002 [#1] PREEMPT SMP NOPTI
CPU: 0 PID: 609 Comm: systemd-udevd Not tainted 6.2.0-rc3-gpiccoli #338 
Hardware name: Valve Jupiter/Jupiter, BIOS F7A0113 11/04/2022
RIP: 0010:drm_sched_fini+0x84/0xa0 [gpu_sched] [...] Call Trace:
 <TASK>
 amdgpu_fence_driver_sw_fini+0xc8/0xd0 [amdgpu]
 amdgpu_device_fini_sw+0x2b/0x3b0 [amdgpu]
 amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
 devm_drm_dev_init_release+0x49/0x70
 [...]

To prevent that, check if the drm_sched was properly initialized for a given 
ring before calling its fini counter-part.

Notice ideally we'd use sched.ready for that; such field is set as the latest 
thing on drm_sched_init(). But amdgpu seems to "override" the meaning of such 
field - in the above oops for example, it was a GFX ring causing the crash, and 
the sched.ready field was set to true in the ring init routine, regardless of 
the state of the DRM scheduler. Hence, we ended-up using sched.ops as per 
Christian's suggestion [0].

[0] 
https://lore.kernel.org/amd-gfx/[email protected]/

Fixes: 067f44c8b459 ("drm/amdgpu: avoid over-handle of fence driver fini in s3 
test (v2)")
Suggested-by: Christian König <[email protected]>
Cc: Guchun Chen <[email protected]>
Cc: Luben Tuikov <[email protected]>
Cc: Mario Limonciello <[email protected]>
Signed-off-by: Guilherme G. Piccoli <[email protected]>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index 00444203220d..3b962cb680a6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -618,7 +618,13 @@ void amdgpu_fence_driver_sw_fini(struct amdgpu_device 
*adev)
                if (!ring || !ring->fence_drv.initialized)
                        continue;
 
-               if (!ring->no_scheduler)
+               /*
+                * Notice we check for sched.ops since there's some
+                * override on the meaning of sched.ready by amdgpu.
+                * The natural check would be sched.ready, which is
+                * set as drm_sched_init() finishes...
+                */
+               if (!ring->no_scheduler && ring->sched.ops)
                        drm_sched_fini(&ring->sched);
 
                for (j = 0; j <= ring->fence_drv.num_fences_mask; ++j)
--
2.39.0

Reply via email to