Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider: - [High] Toggling `disable_err_irq` back to false via debugfs leaves hardware error interrupts permanently disabled until a GPU reset. - [High] Writing a restrictive interrupt mask in `a6xx_irq` and `a8xx_irq` drops the `CP_SW` interrupt, actively breaking preemption. - [Medium] The `recover_worker` function unilaterally clears the user-controlled `disable_err_irq` debugfs boolean during GPU recovery. --
commit c227655841882ac95655f120b99ea1eb3c758dc2 Author: Jie Zhang <[email protected]> drm/msm/a6xx: Fix IRQ storm during msm_recovery test This commit masks GPU error IRQs during the msm_recovery test to prevent an interrupt storm when a hang is triggered. It then attempts to re-enable the interrupts during GPU recovery. > diff --git a/drivers/gpu/drm/msm/adreno/a5xx_gpu.c > b/drivers/gpu/drm/msm/adreno/a5xx_gpu.c > index 2c0bbac43c52b..f1df2514c6132 100644 > --- a/drivers/gpu/drm/msm/adreno/a5xx_gpu.c > +++ b/drivers/gpu/drm/msm/adreno/a5xx_gpu.c [ ... ] > @@ -1275,6 +1275,11 @@ static irqreturn_t a5xx_irq(struct msm_gpu *gpu) > status & ~A5XX_RBBM_INT_0_MASK_RBBM_AHB_ERROR); > > if (priv->disable_err_irq) { > + /* Turn off interrupts to avoid interrupt storm */ > + gpu_write(gpu, REG_A5XX_RBBM_INT_0_MASK, > + A5XX_RBBM_INT_0_MASK_CP_CACHE_FLUSH_TS | > + A5XX_RBBM_INT_0_MASK_CP_SW); > + > status &= A5XX_RBBM_INT_0_MASK_CP_CACHE_FLUSH_TS | > A5XX_RBBM_INT_0_MASK_CP_SW; > } > diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c > b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c > index 8b3bb2fd433ba..9a4f9d0e17800 100644 > --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c > +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c > @@ -1911,8 +1911,11 @@ static irqreturn_t a6xx_irq(struct msm_gpu *gpu) > > gpu_write(gpu, REG_A6XX_RBBM_INT_CLEAR_CMD, status); > > - if (priv->disable_err_irq) > + if (priv->disable_err_irq) { > + /* Turn off interrupts to avoid interrupt storm */ > + gpu_write(gpu, REG_A6XX_RBBM_INT_0_MASK, > A6XX_RBBM_INT_0_MASK_CP_CACHE_FLUSH_TS); [Severity: High] If priv->disable_err_irq is managed via debugfs_create_bool(), setting it only updates the variable in memory. If a user sets it to true, we mask the hardware here. However, if they subsequently set it to false, there is no callback or mechanism to restore the hardware mask back to its default state. Does this leave hardware error detection (like AHB errors and faults) permanently crippled until a full GPU reset occurs? [Severity: High] Also, writing this mask drops the CP_SW interrupt entirely. Both a6xx (which handles A7xx) and a8xx rely on CP_SW for preemption completion events. Physically disabling it in hardware means subsequent preemptions will never be signaled to the CPU as complete. Will dropping it cause a spurious GPU hang during the test window, instead of correctly preserving both CP_CACHE_FLUSH_TS and CP_SW like a5xx_irq() does? > status &= A6XX_RBBM_INT_0_MASK_CP_CACHE_FLUSH_TS; > + } > > if (status & A6XX_RBBM_INT_0_MASK_RBBM_HANG_DETECT) > a6xx_fault_detect_irq(gpu); [ ... ] > diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c > index 9ac7740a87f01..48ac51f4119be 100644 > --- a/drivers/gpu/drm/msm/msm_gpu.c > +++ b/drivers/gpu/drm/msm/msm_gpu.c > @@ -552,6 +552,8 @@ static void recover_worker(struct kthread_work *work) > msm_update_fence(ring->fctx, fence); > } > > + priv->disable_err_irq = false; [Severity: Medium] The disable_err_irq flag is exposed via debugfs for userspace testing tools (like IGT msm_recovery) to intentionally bypass hardware error interrupts and validate the software hangcheck fallback. By forcibly clearing it here upon the first recovery, the kernel silently overrides the user's explicit configuration. If a test expects this flag to remain active across multiple subtests or submissions, won't subsequent hangs unexpectedly fall back to hardware error detection? Since hardware interrupts are natively re-enabled by the hardware reset during recovery, should this debugfs flag remain strictly under user control instead of being unilaterally cleared? > + > gpu->funcs->recover(gpu); > > /* retire completed submits, plus the one that hung: */ -- Sashiko AI review ยท https://sashiko.dev/#/patchset/[email protected]?part=5
