On 6/5/2026 12:20 PM, Rob Clark wrote:
> On Thu, Jun 4, 2026 at 1:10 PM Akhil P Oommen <[email protected]> 
> wrote:
>>
>> From: Jie Zhang <[email protected]>
>>
>> Once a hang is triggered by the msm_recovery test, the gpu error irq
>> remains asserted and triggers an interrupt storm. In the worst case,
>> this IRQ storm lands on the CPU core where the hangcheck timer is
>> scheduled, blocking it from running. This eventually leads to CPU
>> watchdog timeouts.
>>
>> To fix this, mask the gpu error irqs during msm_recovery test and
>> enable them back during the recovery.
>>
>> Fixes: 5edf2750d998 ("drm/msm: Add debugfs to disable hw err handling")
>> Signed-off-by: Jie Zhang <[email protected]>
>> Signed-off-by: Akhil P Oommen <[email protected]>
>> ---
>>  drivers/gpu/drm/msm/adreno/a5xx_gpu.c | 5 +++++
>>  drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 5 ++++-
>>  drivers/gpu/drm/msm/adreno/a8xx_gpu.c | 5 ++++-
>>  drivers/gpu/drm/msm/msm_gpu.c         | 2 ++
>>  4 files changed, 15 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/msm/adreno/a5xx_gpu.c 
>> b/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
>> index 2c0bbac43c52..f1df2514c613 100644
>> --- a/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
>> +++ b/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
>> @@ -1275,6 +1275,11 @@ static irqreturn_t a5xx_irq(struct msm_gpu *gpu)
>>                 status & ~A5XX_RBBM_INT_0_MASK_RBBM_AHB_ERROR);
>>
>>         if (priv->disable_err_irq) {
>> +               /* Turn off interrupts to avoid interrupt storm */
>> +               gpu_write(gpu, REG_A5XX_RBBM_INT_0_MASK,
>> +                              A5XX_RBBM_INT_0_MASK_CP_CACHE_FLUSH_TS |
>> +                              A5XX_RBBM_INT_0_MASK_CP_SW);
>> +
>>                 status &= A5XX_RBBM_INT_0_MASK_CP_CACHE_FLUSH_TS |
>>                           A5XX_RBBM_INT_0_MASK_CP_SW;
>>         }
>> diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c 
>> b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
>> index 8b3bb2fd433b..9a4f9d0e1780 100644
>> --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
>> +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
>> @@ -1911,8 +1911,11 @@ static irqreturn_t a6xx_irq(struct msm_gpu *gpu)
>>
>>         gpu_write(gpu, REG_A6XX_RBBM_INT_CLEAR_CMD, status);
>>
>> -       if (priv->disable_err_irq)
>> +       if (priv->disable_err_irq) {
>> +               /* Turn off interrupts to avoid interrupt storm */
>> +               gpu_write(gpu, REG_A6XX_RBBM_INT_0_MASK, 
>> A6XX_RBBM_INT_0_MASK_CP_CACHE_FLUSH_TS);
>>                 status &= A6XX_RBBM_INT_0_MASK_CP_CACHE_FLUSH_TS;
>> +       }
>>
>>         if (status & A6XX_RBBM_INT_0_MASK_RBBM_HANG_DETECT)
>>                 a6xx_fault_detect_irq(gpu);
>> diff --git a/drivers/gpu/drm/msm/adreno/a8xx_gpu.c 
>> b/drivers/gpu/drm/msm/adreno/a8xx_gpu.c
>> index 9e44fd1ae634..0f6fd35bd587 100644
>> --- a/drivers/gpu/drm/msm/adreno/a8xx_gpu.c
>> +++ b/drivers/gpu/drm/msm/adreno/a8xx_gpu.c
>> @@ -1211,8 +1211,11 @@ irqreturn_t a8xx_irq(struct msm_gpu *gpu)
>>
>>         gpu_write(gpu, REG_A8XX_RBBM_INT_CLEAR_CMD, status);
>>
>> -       if (priv->disable_err_irq)
>> +       if (priv->disable_err_irq) {
>> +               /* Turn off interrupts to avoid interrupt storm */
>> +               gpu_write(gpu, REG_A8XX_RBBM_INT_0_MASK, 
>> A6XX_RBBM_INT_0_MASK_CP_CACHE_FLUSH_TS);
>>                 status &= A6XX_RBBM_INT_0_MASK_CP_CACHE_FLUSH_TS;
>> +       }
>>
>>         if (status & A6XX_RBBM_INT_0_MASK_RBBM_HANG_DETECT)
>>                 a8xx_fault_detect_irq(gpu);
>> diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c
>> index 9ac7740a87f0..48ac51f4119b 100644
>> --- a/drivers/gpu/drm/msm/msm_gpu.c
>> +++ b/drivers/gpu/drm/msm/msm_gpu.c
>> @@ -552,6 +552,8 @@ static void recover_worker(struct kthread_work *work)
>>                 msm_update_fence(ring->fctx, fence);
>>         }
>>
>> +       priv->disable_err_irq = false;
> 
> Ok, so we rely on recovery to re-enable the error irqs..  that is
> probably ok, given the intended purpose of the debugfs file.  And,
> well, it is debugfs.  But why do we clear disable_err_irq here?

Now that we are updating the IRQ mask register which won't reset until
there is a gpu suspend, its side effect will be felt even after
userspace deasserts the debugfs knob, potentially into the next
testcase. This is different from the older behavior. So, I felt it would
be better to reset this flag during the recovery, considering
msm_recovery is the only user of this knob, afaiu.

I should have explicitly called out this new behavior of disable_err_irq
in the commit text, but I forgot.

-Akhil.

> 
> BR,
> -R
> 
>> +
>>         gpu->funcs->recover(gpu);
>>
>>         /* retire completed submits, plus the one that hung: */
>>
>> --
>> 2.51.0
>>

Reply via email to