amdgpu: break driver init process when it's bad GPU

Chen, Guchun Tue, 28 Jul 2020 07:11:56 -0700

[AMD Public Use]

Hi Dennis,


Please check my response after yours.

Regards,
Guchun

-----Original Message-----
From: Li, Dennis <[email protected]> 
Sent: Tuesday, July 28, 2020 5:43 PM
To: Chen, Guchun <[email protected]>; [email protected]; Deucher, 
Alexander <[email protected]>; Zhang, Hawking <[email protected]>; 
Grodzovsky, Andrey <[email protected]>; Zhou1, Tao <[email protected]>; 
Clements, John <[email protected]>; Lazar, Lijo <[email protected]>; 
Koenig, Christian <[email protected]>; Yang, Stanley 
<[email protected]>
Subject: RE: [PATCH 04/12] drm/amdgpu: break driver init process when it's bad 
GPU

[AMD Official Use Only - Internal Distribution Only]

Hi, Guchun,
      Please see my below comments.

Best Regards
Dennis Li
-----Original Message-----
From: Chen, Guchun <[email protected]> 
Sent: Tuesday, July 28, 2020 3:49 PM
To: [email protected]; Deucher, Alexander 
<[email protected]>; Zhang, Hawking <[email protected]>; Li, Dennis 
<[email protected]>; Grodzovsky, Andrey <[email protected]>; Zhou1, Tao 
<[email protected]>; Clements, John <[email protected]>; Lazar, Lijo 
<[email protected]>; Koenig, Christian <[email protected]>; Yang, 
Stanley <[email protected]>
Cc: Chen, Guchun <[email protected]>
Subject: [PATCH 04/12] drm/amdgpu: break driver init process when it's bad GPU

When retrieving bad gpu tag from eeprom, GPU init should fail as the GPU needs 
to be retired for further check.

v2: Fix spelling typo, correct the condition to detect
    bad gpu tag and refine error message.

v3: Refine function argument name.

Signed-off-by: Guchun Chen <[email protected]>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c     | 12 +++++++++---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c        | 18 ++++++++++++++++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 10 +++++++++-  
drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h |  3 ++-
 4 files changed, 36 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 2662cd7c8685..30af0dfee1a1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2059,13 +2059,19 @@ static int amdgpu_device_ip_init(struct amdgpu_device 
*adev)
         * it should be called after amdgpu_device_ip_hw_init_phase2  since
         * for some ASICs the RAS EEPROM code relies on SMU fully functioning
         * for I2C communication which only true at this point.
-        * recovery_init may fail, but it can free all resources allocated by
-        * itself and its failure should not stop amdgpu init process.
+        *
+        * amdgpu_ras_recovery_init may fail, but the upper only cares the
+        * failure from bad gpu situation and stop amdgpu init process
+        * accordingly. For other failed cases, it will still release all
+        * the resource and print error message, rather than returning one
+        * negative value to upper level.
         *
         * Note: theoretically, this should be called before all vram 
allocations
         * to protect retired page from abusing
         */
-       amdgpu_ras_recovery_init(adev);
+       r = amdgpu_ras_recovery_init(adev);
+       if (r)
+               goto init_failed;
 
        if (adev->gmc.xgmi.num_physical_nodes > 1)
                amdgpu_xgmi_add_device(adev);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 3c4c142e9d8a..56e1aeba2d64 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -1822,6 +1822,7 @@ int amdgpu_ras_recovery_init(struct amdgpu_device *adev)
        struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
        struct ras_err_handler_data **data;
        uint32_t max_eeprom_records_len = 0;
+       bool exc_err_limit = false;
        int ret;
 
        if (con)
@@ -1843,9 +1844,15 @@ int amdgpu_ras_recovery_init(struct amdgpu_device *adev)
        max_eeprom_records_len = amdgpu_ras_eeprom_get_record_max_length();
        amdgpu_ras_validate_threshold(adev, max_eeprom_records_len);
 
-       ret = amdgpu_ras_eeprom_init(&con->eeprom_control);
-       if (ret)
+       ret = amdgpu_ras_eeprom_init(&con->eeprom_control, &exc_err_limit);
+       /*
+        * We only fail this calling and halt booting up
+        * when exc_err_limit is true.
+        */
+       if (exc_err_limit) {
+               ret = -EINVAL;
                goto free;
+       }

[Dennis Li] Compared with old codes,  new change miss checking ret.
[Guchun] Yeah, this hits me that another if condition is that ret should be 
checked as well when exc_err_limit is false,
that means there is some problem with eeprom i2c functionality.
It will be addressed in next patch set.
 
        if (con->eeprom_control.num_recs) {
                ret = amdgpu_ras_load_bad_pages(adev); @@ -1868,6 +1875,13 @@ 
int amdgpu_ras_recovery_init(struct amdgpu_device *adev)
 out:
        dev_warn(adev->dev, "Failed to initialize ras recovery!\n");
 
+       /*
+        * Except error threshold exceeding case, other failure cases in this
+        * function would not fail amdgpu driver init.
+        */
+       if (!exc_err_limit)
+               ret = 0;
+
        return ret;
 }
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
index 35c0c849d49b..67995b66d7d4 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
@@ -241,7 +241,8 @@ int amdgpu_ras_eeprom_reset_table(struct 
amdgpu_ras_eeprom_control *control)
 
 }
 
-int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control)
+int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control,
+                       bool *exceed_err_limit)

 {
        int ret = 0;
        struct amdgpu_device *adev = to_amdgpu_device(control); @@ -254,6 
+255,8 @@ int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control)
                        .buf    = buff,
        };
 
+       *exceed_err_limit = false;
+
        /* Verify i2c adapter is initialized */
        if (!adev->pm.smu_i2c.algo)
                return -ENOENT;
@@ -282,6 +285,11 @@ int amdgpu_ras_eeprom_init(struct 
amdgpu_ras_eeprom_control *control)
                DRM_DEBUG_DRIVER("Found existing EEPROM table with %d records",
                                 control->num_recs);
 
+       } else if ((hdr->header == EEPROM_TABLE_HDR_BAD) &&
+                       (amdgpu_bad_page_threshold != 0)) {
+               *exceed_err_limit = true;
+               DRM_ERROR("Exceeding the bad_page_threshold parameter, "
+                               "disabling the GPU.\n");

[Dennis Li] Why must introduce a new parameter exceed_err_limit?  I think it 
can return -EINVAL directly here.
[Guchun]We need to definitely know what's the error case and decide next step 
concisely. When this variable exceed_err_limit is true, that means
GPU bad tag is detected, and consequently, this scenario will be returned to 
upper layer to halt driver's boot up. For other errors returned by this
function, they may be caused by eeprom i2c functionality, in such case, amdgpu 
driver's probe will not be impacted, as generally, eeprom is
one external device only.

        } else {
                DRM_INFO("Creating new EEPROM table");
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h
index b272840cb069..f245b96d9599 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h
@@ -77,7 +77,8 @@ struct eeprom_table_record {
        unsigned char mcumc_id;
 }__attribute__((__packed__));
 
-int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control);
+int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control,
+                       bool *exceed_err_limit);
 int amdgpu_ras_eeprom_reset_table(struct amdgpu_ras_eeprom_control *control);
 
 int amdgpu_ras_eeprom_process_recods(struct amdgpu_ras_eeprom_control *control,
--
2.17.1
_______________________________________________
amd-gfx mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

RE: [PATCH 04/12] drm/amdgpu: break driver init process when it's bad GPU

Reply via email to