On 7/25/2025 1:31 AM, Connor Abbott wrote: > On Thu, Jul 24, 2025 at 3:48 PM Akhil P Oommen <[email protected]> > wrote: >> >> On 7/21/2025 9:02 PM, Rob Clark wrote: >>> On Fri, Jul 18, 2025 at 6:50 AM Connor Abbott <[email protected]> wrote: >>>> >>>> If there is a flood of faults then the MMU can become saturated while it >>>> waits for the kernel to process the first fault and resume it, so that >>>> the GMU becomes blocked. This is mainly a problem when the kernel reads >>>> the state of the GPU for a devcoredump, because this takes a while. If >>>> we timeout waiting for the GMU, check if this has happened and retry >>>> after we're finished. >>>> >>>> Signed-off-by: Connor Abbott <[email protected]> >>>> --- >>>> drivers/gpu/drm/msm/adreno/a6xx_gmu.c | 21 ++++++++++++++++++--- >>>> drivers/gpu/drm/msm/adreno/a6xx_hfi.c | 21 ++++++++++++++++++--- >>>> drivers/gpu/drm/msm/adreno/adreno_gpu.c | 11 +++++++++++ >>>> drivers/gpu/drm/msm/adreno/adreno_gpu.h | 2 ++ >>>> 4 files changed, 49 insertions(+), 6 deletions(-) >>>> >>>> diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c >>>> b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c >>>> index >>>> 28e6705c6da682c7b41c748e375dda59a6551898..6ec396fab22d194481a76d30b2d36ea5fb662241 >>>> 100644 >>>> --- a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c >>>> +++ b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c >>>> @@ -340,6 +340,7 @@ int a6xx_gmu_set_oob(struct a6xx_gmu *gmu, enum >>>> a6xx_gmu_oob_state state) >>>> int ret; >>>> u32 val; >>>> int request, ack; >>>> + struct a6xx_gpu *a6xx_gpu = container_of(gmu, struct a6xx_gpu, >>>> gmu); >>>> >>>> WARN_ON_ONCE(!mutex_is_locked(&gmu->lock)); >>>> >>>> @@ -363,9 +364,23 @@ int a6xx_gmu_set_oob(struct a6xx_gmu *gmu, enum >>>> a6xx_gmu_oob_state state) >>>> /* Trigger the equested OOB operation */ >>>> gmu_write(gmu, REG_A6XX_GMU_HOST2GMU_INTR_SET, 1 << request); >>>> >>>> - /* Wait for the acknowledge interrupt */ >>>> - ret = gmu_poll_timeout(gmu, REG_A6XX_GMU_GMU2HOST_INTR_INFO, val, >>>> - val & (1 << ack), 100, 10000); >>>> + do { >>>> + /* Wait for the acknowledge interrupt */ >>>> + ret = gmu_poll_timeout(gmu, >>>> REG_A6XX_GMU_GMU2HOST_INTR_INFO, val, >>>> + val & (1 << ack), 100, 10000); >>>> + >>>> + if (!ret) >>>> + break; >>>> + >>>> + if (completion_done(&a6xx_gpu->base.fault_coredump_done)) >> >> I didn't get why this is required. Could you please add a comment? > > Without this, if the GMU timed out for some other reason not related > to SMMU then we'd loop infinitely. This gives up if there isn't > currently a crashstate pending.
Ah! That api doc somehow confused me. > >> >>>> + break; >>>> + >>>> + /* We may timeout because the GMU is temporarily wedged >>>> from >>>> + * pending faults from the GPU and we are taking a >>>> devcoredump. >>>> + * Wait until the MMU is resumed and try again. >>>> + */ >>>> + wait_for_completion(&a6xx_gpu->base.fault_coredump_done); use the interruptible version? we may reach here from a process context. >>>> + } while (true); >>> >>> It is a bit sad to duplicate this nearly identical code twice. And I >>> wonder if other gmu_poll_timeout()'s need similar treatment? Maybe >>> Akhil has an opinion about whether we should just build this into >>> gmu_poll_timeout() instead? >> >> Yeah. That make sense. A potential issue I see is that we might be >> holding both gpu and gmu locks here and the crashstate capture in the pf >> handler tries to lock gpu, which can result in a dead lock. > > I think there would already be a deadlock, or at least timeout in that > situation now. Any task waiting for the GMU to complete while holding > the GPU lock would block the crashstate capture from completing and > allowing the GMU to continue. Timeout is fine as there is progress eventually. But deadlock is not acceptable. Also, userspace can easily trigger this deadlock which makes it a security issue. I agree, we need to improve the gmu error handling situation overall. I thought about this a few years ago actually. At that time, I thought it would be simpler if we always did coredump/recovery from a single thread. Not sure if that idea still makes sense. On a related topic, stall-on-fault cannot be used in production. GMU is very critical as it interacts directly with SoC power management subsystems and also every year, there is an additional responsibility on GMU to do a very time critical mitigation like CLX, thermal, BCL etc. And these mitigations should be handled within a few microseconds. So GMU should never be blocked, even for microseconds. Apart from that, even GPU's internal bus can get locked up in rare cases which can lead to a fatal system bus/NoC error when KMD access a register in the GPU space. But stall-on-fault is useful while debugging. So any improvements in this area is useful. -Akhil.
