On 9/2/20 5:56 PM, Bjorn Helgaas wrote:
On Wed, Sep 02, 2020 at 02:42:03PM -0400, Andrey Grodzovsky wrote:
At this point the ASIC is already post reset by the HW/PSP
so the HW not in proper state to be configured for suspension,
some blocks might be even gated and so best is to avoid touching it.

v2: Rename in_dpc to more meaningful name

Signed-off-by: Andrey Grodzovsky <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
---
  drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  1 +
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 38 ++++++++++++++++++++++++++++++
  drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c    |  6 +++++
  drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c    |  6 +++++
  drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c     | 18 ++++++++------
  drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c     |  3 +++
  6 files changed, 65 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index c311a3c..b20354f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -992,6 +992,7 @@ struct amdgpu_device {
        atomic_t                        throttling_logging_enabled;
        struct ratelimit_state          throttling_logging_rs;
        uint32_t                        ras_features;
+       bool                            in_pci_err_recovery;
  };
static inline struct amdgpu_device *drm_to_adev(struct drm_device *ddev)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 74a1c03..1fbf8a1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -319,6 +319,9 @@ uint32_t amdgpu_mm_rreg(struct amdgpu_device *adev, 
uint32_t reg,
  {
        uint32_t ret;
+ if (adev->in_pci_err_recovery)
+               return 0;
I don't know the whole scheme of this, but this looks racy.

It looks like the normal path through this function is the readl(),
which I assume is an MMIO read from the PCI device.  If this is called
after a PCI error occurs, but before amdgpu_pci_slot_reset() sets
adev->in_pci_err_recovery, the readl() will return 0xffffffff.

If this is called after amdgpu_pci_slot_reset() sets
adev->in_pci_err_recovery, it will return 0.  Do you really want those
two different cases?


This is not intended to to avoid hardware accessed by other threads when doing PCI recovery (answers also Denis's question) - in slot reset callback we suspend and then resume the IP blocks so we can bring the SW and HW back to operational. For this we first call IPs suspend and then resume. But the ASIC was already reset by the time we execute this code and so there is a misalignment between the HW and the SW states. The HW is in clean 'fresh poweron or init' state while the SW is in running state. So I want to align SW state with the HW state by calling suspend IPs sequence without touching the HW and so this flag is used to skip all HW registers r/w while I do it. Regarding other threads which might access the registers this should not happen as I sopped all new internal and external command submissions  and took GPU reset device lock so at least we are protected here same as during ordinary GPU reset. Yes, it's not 100% proof and there still might be some accesses from other threads during this time and they will return 0xffffffff values.

Andrey



        if (!(acc_flags & AMDGPU_REGS_NO_KIQ) && amdgpu_sriov_runtime(adev))
                return amdgpu_kiq_rreg(adev, reg);
@@ -4773,7 +4809,9 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev 
*pdev)
pci_restore_state(pdev); + adev->in_pci_err_recovery = true;
        r = amdgpu_device_ip_suspend(adev);
+       adev->in_pci_err_recovery = false;
        if (r)
                goto out;
_______________________________________________
amd-gfx mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Reply via email to