amdgpu: Avoid accessing HW when suspending SW state

Andrey Grodzovsky Thu, 03 Sep 2020 08:25:40 -0700


On 9/2/20 5:56 PM, Bjorn Helgaas wrote:

On Wed, Sep 02, 2020 at 02:42:03PM -0400, Andrey Grodzovsky wrote:

At this point the ASIC is already post reset by the HW/PSP
so the HW not in proper state to be configured for suspension,
some blocks might be even gated and so best is to avoid touching it.


v2: Rename in_dpc to more meaningful name

Signed-off-by: Andrey Grodzovsky <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
---
  drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  1 +
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 38 ++++++++++++++++++++++++++++++
  drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c    |  6 +++++
  drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c    |  6 +++++
  drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c     | 18 ++++++++------
  drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c     |  3 +++
  6 files changed, 65 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index c311a3c..b20354f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -992,6 +992,7 @@ struct amdgpu_device {
        atomic_t                        throttling_logging_enabled;
        struct ratelimit_state          throttling_logging_rs;
        uint32_t                        ras_features;
+       bool                            in_pci_err_recovery;
  };

static inline struct amdgpu_device *drm_to_adev(struct drm_device *ddev)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 74a1c03..1fbf8a1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -319,6 +319,9 @@ uint32_t amdgpu_mm_rreg(struct amdgpu_device *adev, 
uint32_t reg,
  {
        uint32_t ret;

+ if (adev->in_pci_err_recovery)

+               return 0;

I don't know the whole scheme of this, but this looks racy.

It looks like the normal path through this function is the readl(),
which I assume is an MMIO read from the PCI device.  If this is called
after a PCI error occurs, but before amdgpu_pci_slot_reset() sets
adev->in_pci_err_recovery, the readl() will return 0xffffffff.

If this is called after amdgpu_pci_slot_reset() sets
adev->in_pci_err_recovery, it will return 0.  Do you really want those
two different cases?

This is not intended to to avoid hardware accessed by other threads when doingPCI recovery (answers also Denis's question) -in slot reset callback we suspend and then resume the IP blocks so we can bringthe SW and HW back to operational. For thiswe first call IPs suspend and then resume. But the ASIC was already reset by thetime we execute this code and so thereis a misalignment between the HW and the SW states. The HW is in clean 'freshpoweron or init' state while the SW is in running state.So I want to align SW state with the HW state by calling suspend IPs sequencewithout touching the HW and so this flag is used toskip all HW registers r/w while I do it. Regarding other threads which mightaccess the registers this should not happen as I sopped all new internaland external command submissions and took GPU reset device lock so at least weare protected here same as during ordinary GPU reset.Yes, it's not 100% proof and there still might be some accesses from otherthreads during this time and they will return 0xffffffff values.


Andrey

        if (!(acc_flags & AMDGPU_REGS_NO_KIQ) && amdgpu_sriov_runtime(adev))
                return amdgpu_kiq_rreg(adev, reg);
@@ -4773,7 +4809,9 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev 
*pdev)

pci_restore_state(pdev);+ adev->in_pci_err_recovery = true;

        r = amdgpu_device_ip_suspend(adev);
+       adev->in_pci_err_recovery = false;
        if (r)
                goto out;

_______________________________________________
amd-gfx mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH v4 1/8] drm/amdgpu: Avoid accessing HW when suspending SW state

Reply via email to