[AMD Official Use Only - AMD Internal Distribution Only] It's better to add ras support check, ras is not supported on any ASIC.
Tao > -----Original Message----- > From: Xie, Patrick <[email protected]> > Sent: Friday, July 11, 2025 3:47 PM > To: Lazar, Lijo <[email protected]>; [email protected] > Cc: Zhou1, Tao <[email protected]> > Subject: RE: [PATCH 1/2] drm/amdgpu: refine eeprom data check > > [AMD Official Use Only - AMD Internal Distribution Only] > > Thanks, will add this NULL check > > -----Original Message----- > From: Lazar, Lijo <[email protected]> > Sent: Friday, July 11, 2025 3:17 PM > To: Xie, Patrick <[email protected]>; [email protected] > Cc: Zhou1, Tao <[email protected]> > Subject: Re: [PATCH 1/2] drm/amdgpu: refine eeprom data check > > > > On 7/11/2025 8:10 AM, ganglxie wrote: > > add eeprom data checksum check before driver unload. reset eeprom and > > save correct data to eeprom when check failed > > > > Signed-off-by: ganglxie <[email protected]> > > --- > > drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 1 + > > .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 25 +++++++++++++++++++ > > .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h | 2 ++ > > 3 files changed, 28 insertions(+) > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > > index 571b70da4562..1009b60f6ae4 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > > @@ -2560,6 +2560,7 @@ amdgpu_pci_remove(struct pci_dev *pdev) > > struct drm_device *dev = pci_get_drvdata(pdev); > > struct amdgpu_device *adev = drm_to_adev(dev); > > > > + amdgpu_ras_eeprom_check_and_recover(adev); > > amdgpu_xcp_dev_unplug(adev); > > amdgpu_gmc_prepare_nps_mode_change(adev); > > drm_dev_unplug(dev); > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c > > index 2af14c369bb9..2458c67526c9 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c > > @@ -1522,3 +1522,28 @@ int amdgpu_ras_eeprom_check(struct > > amdgpu_ras_eeprom_control *control) > > > > return res < 0 ? res : 0; > > } > > + > > +void amdgpu_ras_eeprom_check_and_recover(struct amdgpu_device *adev) > > +{ > > + struct amdgpu_ras *ras = amdgpu_ras_get_context(adev); > > Doesn't this require a NULL check? > > Thanks, > Lijo > > > + struct amdgpu_ras_eeprom_control *control = &ras->eeprom_control; > > + int res = 0; > > + > > + if (!control->is_eeprom_valid) > > + return; > > + res = __verify_ras_table_checksum(control); > > + if (res) { > > + dev_warn(adev->dev, > > + "RAS table incorrect checksum or error:%d, try to > > recover\n", > > + res); > > + if (!amdgpu_ras_eeprom_reset_table(control)) > > + if (!amdgpu_ras_save_bad_pages(adev, NULL)) > > + if (!__verify_ras_table_checksum(control)) { > > + dev_info(adev->dev, "RAS table > > recovery succeed\n"); > > + return; > > + } > > + dev_err(adev->dev, "RAS table recovery failed\n"); > > + control->is_eeprom_valid = false; > > + } > > + return; > > +} > > \ No newline at end of file > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h > > index 35c69ac3dbeb..ebfca4cb5688 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h > > @@ -161,6 +161,8 @@ void amdgpu_ras_debugfs_set_ret_size(struct > > amdgpu_ras_eeprom_control *control); > > > > int amdgpu_ras_eeprom_check(struct amdgpu_ras_eeprom_control > > *control); > > > > +void amdgpu_ras_eeprom_check_and_recover(struct amdgpu_device *adev); > > + > > extern const struct file_operations > > amdgpu_ras_debugfs_eeprom_size_ops; > > extern const struct file_operations > > amdgpu_ras_debugfs_eeprom_table_ops; > > >
