Agree with your thoughts that we drop amdgpu_ras_enable=2 condition. The only
concern in my side is that besides fatal_error, another result may happen that
atombios_init timeout on xgmi by baco (not sure psp mode1 reset causes this as
well).
Assuming no amdgpu_ras_enable=2 check, if PMFW > 40.52, the use cases as my
understanding includes:
1. sGPU without RAS:
* new: baco
* old: baco
2. sGPU with RAS:
* new: baco
* old: psp mode1 chain reset and legacy fatal_error handling
1. XGMI with RAS: baco
* new: baco
* old: psp mode1 chain reset and legacy fatal_error handling
2. XGMI without RAS: baco
* new: baco
* old: psp mode1 chain reset
That is to say, all uses cases go on baco road when PMFW > 40.52.
Regards,
Ma Le
-----Original Message-----
From: Zhang, Hawking <[email protected]>
Sent: Wednesday, November 27, 2019 7:28 PM
To: Ma, Le <[email protected]>; [email protected]
Cc: Chen, Guchun <[email protected]>; Zhou1, Tao <[email protected]>; Li,
Dennis <[email protected]>; Deucher, Alexander <[email protected]>; Ma,
Le <[email protected]>
Subject: RE: [PATCH 06/10] drm/amdgpu: add condition to enable baco for
xgmi/ras case
[AMD Public Use]
After thinking it a bit, I think we can just rely on PMFW version to decide to
go RAS recovery or legacy fatal_error handling for the platforms that support
RAS. Leveraging amdgpu_ras_enable as a temporary solution seems not necessary?
Even baco ras recovery not stable, it is the same result as legacy fatal_error
handling that user has to reboot the node manually.
So the new soc reset use cases are:
XGMI (without RAS): use PSP mode1 based chain reset, RAS enabled (with PMFW
40.52 and onwards): use BACO based RAS recovery, RAS enabled (with PMFW prior
to 40.52): use legacy fatal_error handling.
Anything else?
Regards,
Hawking
-----Original Message-----
From: Le Ma <[email protected]<mailto:[email protected]>>
Sent: 2019年11月27日 17:15
To: [email protected]<mailto:[email protected]>
Cc: Zhang, Hawking <[email protected]<mailto:[email protected]>>; Chen,
Guchun <[email protected]<mailto:[email protected]>>; Zhou1, Tao
<[email protected]<mailto:[email protected]>>; Li, Dennis
<[email protected]<mailto:[email protected]>>; Deucher, Alexander
<[email protected]<mailto:[email protected]>>; Ma, Le
<[email protected]<mailto:[email protected]>>
Subject: [PATCH 06/10] drm/amdgpu: add condition to enable baco for xgmi/ras
case
Avoid to change default reset behavior for production card by checking
amdgpu_ras_enable equal to 2. And only new enough smu ucode can support baco
for xgmi/ras case.
Change-Id: I07c3e6862be03e068745c73db8ea71f428ecba6b
Signed-off-by: Le Ma <[email protected]<mailto:[email protected]>>
---
drivers/gpu/drm/amd/amdgpu/soc15.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/soc15.c
b/drivers/gpu/drm/amd/amdgpu/soc15.c
index 951327f..6202333 100644
--- a/drivers/gpu/drm/amd/amdgpu/soc15.c
+++ b/drivers/gpu/drm/amd/amdgpu/soc15.c
@@ -577,7 +577,9 @@ soc15_asic_reset_method(struct amdgpu_device *adev)
struct amdgpu_hive_info *hive =
amdgpu_get_xgmi_hive(adev, 0);
struct amdgpu_ras *ras =
amdgpu_ras_get_context(adev);
- if (hive || (ras && ras->supported))
+ if ((hive || (ras && ras->supported)) &&
+ (amdgpu_ras_enable != 2 ||
+ adev->pm.fw_version <= 0x283400))
baco_reset = false;
}
break;
--
2.7.4
_______________________________________________
amd-gfx mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/amd-gfx