panthor: Recover from panthor_gpu_flush_caches() failures

Steven Price Mon, 17 Nov 2025 08:10:16 -0800

On 17/11/2025 15:44, Boris Brezillon wrote:
> On Mon, 17 Nov 2025 14:02:35 +0000
> Steven Price <[email protected]> wrote:
> 
>> On 13/11/2025 10:39, Boris Brezillon wrote:
>>> We have seen a few cases where the whole memory subsystem is blocked
>>> and flush operations never complete. When that happens, we want to:
>>>
>>> - schedule a reset, so we can recover from this situation
>>> - in the reset path, we need to reset the pending_reqs so we can send
>>>   new commands after the reset
>>> - if more panthor_gpu_flush_caches() operations are queued after
>>>   the timeout, we skip them and return -EIO directly to avoid needless
>>>   waits (the memory block won't miraculously work again)  
>>
>> You've removed the WARN from this last case. Is this intentional? I
>> agree the recovery is better, but I don't think we expect this to happen
>> - so it's pointing to something else being broken.
> 
> I did because there's a way for the UMD to trigger that (see the link
> to the bug in the cover letter) without any mitigation we can put in
> place kernel side, other than the GPU reset I'm adding here.I
> tend to use WARN_ON()s only for things the kernel code has control on,
> not stuff users can force the kernel driver into. Note that I kept the
> drm_err(), so we still have a trace of such errors in the logs (along
> with some timeouts).


Ah, fair enough - would be good to put that in the commit message ;) The
cover letter for this posting doesn't seem to have the link either. With
an updated commit message:

Reviewed-by: Steven Price <[email protected]>

>>
>>>
>>> v2:
>>> - New patch
>>>
>>> Fixes: 5cd894e258c4 ("drm/panthor: Add the GPU logical block")
>>> Signed-off-by: Boris Brezillon <[email protected]>
>>> ---
>>>  drivers/gpu/drm/panthor/panthor_gpu.c | 19 ++++++++++++-------
>>>  1 file changed, 12 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/panthor/panthor_gpu.c 
>>> b/drivers/gpu/drm/panthor/panthor_gpu.c
>>> index eda670229184..abd2fde04da9 100644
>>> --- a/drivers/gpu/drm/panthor/panthor_gpu.c
>>> +++ b/drivers/gpu/drm/panthor/panthor_gpu.c
>>> @@ -283,38 +283,42 @@ int panthor_gpu_l2_power_on(struct panthor_device 
>>> *ptdev)
>>>  int panthor_gpu_flush_caches(struct panthor_device *ptdev,
>>>                          u32 l2, u32 lsc, u32 other)
>>>  {
>>> -   bool timedout = false;
>>>     unsigned long flags;
>>> +   int ret = 0;
>>>  
>>>     /* Serialize cache flush operations. */
>>>     guard(mutex)(&ptdev->gpu->cache_flush_lock);
>>>  
>>>     spin_lock_irqsave(&ptdev->gpu->reqs_lock, flags);
>>> -   if (!drm_WARN_ON(&ptdev->base,
>>> -                    ptdev->gpu->pending_reqs & 
>>> GPU_IRQ_CLEAN_CACHES_COMPLETED)) {
>>> +   if (!(ptdev->gpu->pending_reqs & GPU_IRQ_CLEAN_CACHES_COMPLETED)) {
>>>             ptdev->gpu->pending_reqs |= GPU_IRQ_CLEAN_CACHES_COMPLETED;
>>>             gpu_write(ptdev, GPU_CMD, GPU_FLUSH_CACHES(l2, lsc, other));
>>> +   } else {
>>> +           ret = -EIO;
>>>     }
>>>     spin_unlock_irqrestore(&ptdev->gpu->reqs_lock, flags);
>>>  
>>> +   if (ret)
>>> +           return ret;
>>> +
>>>     if (!wait_event_timeout(ptdev->gpu->reqs_acked,
>>>                             !(ptdev->gpu->pending_reqs & 
>>> GPU_IRQ_CLEAN_CACHES_COMPLETED),
>>>                             msecs_to_jiffies(100))) {
>>>             spin_lock_irqsave(&ptdev->gpu->reqs_lock, flags);
>>>             if ((ptdev->gpu->pending_reqs & GPU_IRQ_CLEAN_CACHES_COMPLETED) 
>>> != 0 &&
>>>                 !(gpu_read(ptdev, GPU_INT_RAWSTAT) & 
>>> GPU_IRQ_CLEAN_CACHES_COMPLETED))
>>> -                   timedout = true;
>>> +                   ret = -ETIMEDOUT;
>>>             else
>>>                     ptdev->gpu->pending_reqs &= 
>>> ~GPU_IRQ_CLEAN_CACHES_COMPLETED;
>>>             spin_unlock_irqrestore(&ptdev->gpu->reqs_lock, flags);
>>>     }
>>>  
>>> -   if (timedout) {
>>> +   if (ret) {
>>> +           panthor_device_schedule_reset(ptdev);
>>>             drm_err(&ptdev->base, "Flush caches timeout");
>>> -           return -ETIMEDOUT;
>>>     }
>>>  
>>> -   return 0;
>>> +   return ret;
>>>  }
>>>  
>>>  /**
>>> @@ -354,6 +358,7 @@ int panthor_gpu_soft_reset(struct panthor_device *ptdev)
>>>             return -ETIMEDOUT;
>>>     }
>>>  
>>> +   ptdev->gpu->pending_reqs = 0;
>>>     return 0;
>>>  }
>>>    
>>
>

Re: [PATCH v2 3/6] drm/panthor: Recover from panthor_gpu_flush_caches() failures

Reply via email to