On 17/11/2025 15:44, Boris Brezillon wrote: > On Mon, 17 Nov 2025 14:02:35 +0000 > Steven Price <[email protected]> wrote: > >> On 13/11/2025 10:39, Boris Brezillon wrote: >>> We have seen a few cases where the whole memory subsystem is blocked >>> and flush operations never complete. When that happens, we want to: >>> >>> - schedule a reset, so we can recover from this situation >>> - in the reset path, we need to reset the pending_reqs so we can send >>> new commands after the reset >>> - if more panthor_gpu_flush_caches() operations are queued after >>> the timeout, we skip them and return -EIO directly to avoid needless >>> waits (the memory block won't miraculously work again) >> >> You've removed the WARN from this last case. Is this intentional? I >> agree the recovery is better, but I don't think we expect this to happen >> - so it's pointing to something else being broken. > > I did because there's a way for the UMD to trigger that (see the link > to the bug in the cover letter) without any mitigation we can put in > place kernel side, other than the GPU reset I'm adding here.I > tend to use WARN_ON()s only for things the kernel code has control on, > not stuff users can force the kernel driver into. Note that I kept the > drm_err(), so we still have a trace of such errors in the logs (along > with some timeouts).
Ah, fair enough - would be good to put that in the commit message ;) The cover letter for this posting doesn't seem to have the link either. With an updated commit message: Reviewed-by: Steven Price <[email protected]> >> >>> >>> v2: >>> - New patch >>> >>> Fixes: 5cd894e258c4 ("drm/panthor: Add the GPU logical block") >>> Signed-off-by: Boris Brezillon <[email protected]> >>> --- >>> drivers/gpu/drm/panthor/panthor_gpu.c | 19 ++++++++++++------- >>> 1 file changed, 12 insertions(+), 7 deletions(-) >>> >>> diff --git a/drivers/gpu/drm/panthor/panthor_gpu.c >>> b/drivers/gpu/drm/panthor/panthor_gpu.c >>> index eda670229184..abd2fde04da9 100644 >>> --- a/drivers/gpu/drm/panthor/panthor_gpu.c >>> +++ b/drivers/gpu/drm/panthor/panthor_gpu.c >>> @@ -283,38 +283,42 @@ int panthor_gpu_l2_power_on(struct panthor_device >>> *ptdev) >>> int panthor_gpu_flush_caches(struct panthor_device *ptdev, >>> u32 l2, u32 lsc, u32 other) >>> { >>> - bool timedout = false; >>> unsigned long flags; >>> + int ret = 0; >>> >>> /* Serialize cache flush operations. */ >>> guard(mutex)(&ptdev->gpu->cache_flush_lock); >>> >>> spin_lock_irqsave(&ptdev->gpu->reqs_lock, flags); >>> - if (!drm_WARN_ON(&ptdev->base, >>> - ptdev->gpu->pending_reqs & >>> GPU_IRQ_CLEAN_CACHES_COMPLETED)) { >>> + if (!(ptdev->gpu->pending_reqs & GPU_IRQ_CLEAN_CACHES_COMPLETED)) { >>> ptdev->gpu->pending_reqs |= GPU_IRQ_CLEAN_CACHES_COMPLETED; >>> gpu_write(ptdev, GPU_CMD, GPU_FLUSH_CACHES(l2, lsc, other)); >>> + } else { >>> + ret = -EIO; >>> } >>> spin_unlock_irqrestore(&ptdev->gpu->reqs_lock, flags); >>> >>> + if (ret) >>> + return ret; >>> + >>> if (!wait_event_timeout(ptdev->gpu->reqs_acked, >>> !(ptdev->gpu->pending_reqs & >>> GPU_IRQ_CLEAN_CACHES_COMPLETED), >>> msecs_to_jiffies(100))) { >>> spin_lock_irqsave(&ptdev->gpu->reqs_lock, flags); >>> if ((ptdev->gpu->pending_reqs & GPU_IRQ_CLEAN_CACHES_COMPLETED) >>> != 0 && >>> !(gpu_read(ptdev, GPU_INT_RAWSTAT) & >>> GPU_IRQ_CLEAN_CACHES_COMPLETED)) >>> - timedout = true; >>> + ret = -ETIMEDOUT; >>> else >>> ptdev->gpu->pending_reqs &= >>> ~GPU_IRQ_CLEAN_CACHES_COMPLETED; >>> spin_unlock_irqrestore(&ptdev->gpu->reqs_lock, flags); >>> } >>> >>> - if (timedout) { >>> + if (ret) { >>> + panthor_device_schedule_reset(ptdev); >>> drm_err(&ptdev->base, "Flush caches timeout"); >>> - return -ETIMEDOUT; >>> } >>> >>> - return 0; >>> + return ret; >>> } >>> >>> /** >>> @@ -354,6 +358,7 @@ int panthor_gpu_soft_reset(struct panthor_device *ptdev) >>> return -ETIMEDOUT; >>> } >>> >>> + ptdev->gpu->pending_reqs = 0; >>> return 0; >>> } >>> >> >
