Thank you for your contribution! Sashiko AI review found 3 potential issue(s) 
to consider:
- [High] Toggling `disable_err_irq` back to false via debugfs leaves hardware 
error interrupts permanently disabled until a GPU reset.
- [High] Writing a restrictive interrupt mask in `a6xx_irq` and `a8xx_irq` 
drops the `CP_SW` interrupt, actively breaking preemption.
- [Medium] The `recover_worker` function unilaterally clears the 
user-controlled `disable_err_irq` debugfs boolean during GPU recovery.
--

commit c227655841882ac95655f120b99ea1eb3c758dc2
Author: Jie Zhang <[email protected]>

drm/msm/a6xx: Fix IRQ storm during msm_recovery test

This commit masks GPU error IRQs during the msm_recovery test to prevent
an interrupt storm when a hang is triggered. It then attempts to re-enable
the interrupts during GPU recovery.

> diff --git a/drivers/gpu/drm/msm/adreno/a5xx_gpu.c 
> b/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
> index 2c0bbac43c52b..f1df2514c6132 100644
> --- a/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
> +++ b/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
[ ... ]
> @@ -1275,6 +1275,11 @@ static irqreturn_t a5xx_irq(struct msm_gpu *gpu)
>               status & ~A5XX_RBBM_INT_0_MASK_RBBM_AHB_ERROR);
>  
>       if (priv->disable_err_irq) {
> +             /* Turn off interrupts to avoid interrupt storm */
> +             gpu_write(gpu, REG_A5XX_RBBM_INT_0_MASK,
> +                            A5XX_RBBM_INT_0_MASK_CP_CACHE_FLUSH_TS |
> +                            A5XX_RBBM_INT_0_MASK_CP_SW);
> +
>               status &= A5XX_RBBM_INT_0_MASK_CP_CACHE_FLUSH_TS |
>                         A5XX_RBBM_INT_0_MASK_CP_SW;
>       }
> diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c 
> b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> index 8b3bb2fd433ba..9a4f9d0e17800 100644
> --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> @@ -1911,8 +1911,11 @@ static irqreturn_t a6xx_irq(struct msm_gpu *gpu)
>  
>       gpu_write(gpu, REG_A6XX_RBBM_INT_CLEAR_CMD, status);
>  
> -     if (priv->disable_err_irq)
> +     if (priv->disable_err_irq) {
> +             /* Turn off interrupts to avoid interrupt storm */
> +             gpu_write(gpu, REG_A6XX_RBBM_INT_0_MASK, 
> A6XX_RBBM_INT_0_MASK_CP_CACHE_FLUSH_TS);

[Severity: High]
If priv->disable_err_irq is managed via debugfs_create_bool(), setting it only
updates the variable in memory.

If a user sets it to true, we mask the hardware here. However, if they
subsequently set it to false, there is no callback or mechanism to restore the
hardware mask back to its default state.

Does this leave hardware error detection (like AHB errors and faults)
permanently crippled until a full GPU reset occurs?

[Severity: High]
Also, writing this mask drops the CP_SW interrupt entirely.

Both a6xx (which handles A7xx) and a8xx rely on CP_SW for preemption completion
events. Physically disabling it in hardware means subsequent preemptions will
never be signaled to the CPU as complete.

Will dropping it cause a spurious GPU hang during the test window, instead
of correctly preserving both CP_CACHE_FLUSH_TS and CP_SW like a5xx_irq() does?

>               status &= A6XX_RBBM_INT_0_MASK_CP_CACHE_FLUSH_TS;
> +     }
>  
>       if (status & A6XX_RBBM_INT_0_MASK_RBBM_HANG_DETECT)
>               a6xx_fault_detect_irq(gpu);
[ ... ]
> diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c
> index 9ac7740a87f01..48ac51f4119be 100644
> --- a/drivers/gpu/drm/msm/msm_gpu.c
> +++ b/drivers/gpu/drm/msm/msm_gpu.c
> @@ -552,6 +552,8 @@ static void recover_worker(struct kthread_work *work)
>               msm_update_fence(ring->fctx, fence);
>       }
>  
> +     priv->disable_err_irq = false;

[Severity: Medium]
The disable_err_irq flag is exposed via debugfs for userspace testing tools
(like IGT msm_recovery) to intentionally bypass hardware error interrupts and
validate the software hangcheck fallback.

By forcibly clearing it here upon the first recovery, the kernel silently
overrides the user's explicit configuration.

If a test expects this flag to remain active across multiple subtests or
submissions, won't subsequent hangs unexpectedly fall back to hardware
error detection?

Since hardware interrupts are natively re-enabled by the hardware reset during
recovery, should this debugfs flag remain strictly under user control instead
of being unilaterally cleared?

> +
>       gpu->funcs->recover(gpu);
>  
>       /* retire completed submits, plus the one that hung: */

-- 
Sashiko AI review ยท 
https://sashiko.dev/#/patchset/[email protected]?part=5

Reply via email to