[AMD Official Use Only - Internal Distribution Only] Reviewed-by: Evan Quan <[email protected]>
-----Original Message----- From: Horace Chen <[email protected]> Sent: Thursday, January 14, 2021 9:37 PM To: [email protected] Cc: Grodzovsky, Andrey <[email protected]>; Quan, Evan <[email protected]>; Chen, Horace <[email protected]>; Tuikov, Luben <[email protected]>; Koenig, Christian <[email protected]>; Deucher, Alexander <[email protected]>; Xiao, Jack <[email protected]>; Zhang, Hawking <[email protected]>; Liu, Monk <[email protected]>; Xu, Feifei <[email protected]>; Wang, Kevin(Yang) <[email protected]>; Xiaojie Yuan <[email protected]> Subject: [PATCH 1/2] drm/amdgpu: race issue when jobs on 2 ring timeout Fix a racing issue when jobs on 2 rings timeout simultaneously. If 2 rings timed out at the same time, the amdgpu_device_gpu_recover will be reentered. Then the adev->gmc.xgmi.head will be grabbed by 2 local linked list, which may cause wild pointer issue in iterating. lock the device earily to prevent the node be added to 2 different lists. for xgmi there is a hive lock which can promise there won't have 2 devices on same hive reenter the interation. So xgmi doesn't need to lock the device. Signed-off-by: Horace Chen <[email protected]> --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 15 ++++++++------- 1 file changed, 8 insertions(+), 7 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 4d434803fb49..a28e138ac72c 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -4591,19 +4591,20 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev, list_rotate_to_front(&adev->gmc.xgmi.head, &hive->device_list); device_list_handle = &hive->device_list; } else { +/* if current dev is already in reset, skip adding list to prevent race issue */ +if (!amdgpu_device_lock_adev(adev, hive)) { +dev_info(adev->dev, "Bailing on TDR for s_job:%llx, as another already in progress", +job ? job->base.id : -1); +r = 0; +goto skip_recovery; +} + list_add_tail(&adev->gmc.xgmi.head, &device_list); device_list_handle = &device_list; } /* block all schedulers and reset given job's ring */ list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) { -if (!amdgpu_device_lock_adev(tmp_adev, hive)) { -dev_info(tmp_adev->dev, "Bailing on TDR for s_job:%llx, as another already in progress", - job ? job->base.id : -1); -r = 0; -goto skip_recovery; -} - /* * Try to put the audio codec into suspend state * before gpu reset started. -- 2.17.1 _______________________________________________ amd-gfx mailing list [email protected] https://lists.freedesktop.org/mailman/listinfo/amd-gfx
