On 11/25/25 18:02, Lucas Stach wrote:
>>> I agree that distinguishing the use case that way is not ideal.
>>> However, who has the knowledge of how the hardware is being used by
>>> customers / users, if not the driver?
>>
>> Well the end user.
>>
>> Maybe we should move the whole timeout topic into the DRM layer or the 
>> scheduler component.
>>
>> Something like 2 seconds default (which BTW is the default on Windows as 
>> well), which can be overridden on a global, per device, per queue name basis.
>>
>> And 10 seconds maximum with only a warning that a not default timeout is 
>> used and everything above 10 seconds taints the kernel and should really 
>> only be used for testing/debugging.
> 
> The question really is what you want to do after you hit the (lowered)
> timeout? Users get grumpy if you block things for 10 seconds, but they
> get equally if not more grumpy when you kick out a valid workload that
> just happens to need a lot of GPU time.

Yeah, exactly that summarizes the problem pretty well.

> Fences are only defined to signal eventually, with no real concept of a
> timeout. IMO all timeouts waiting for fences should be long enough to
> only be considered last resort. You may want to give the user some
> indication of a failed fence wait instead of stalling indefinitely, but
> you really only want to do this after a quite long timeout, not in a
> sense of "Sorry, I ran out of patience after 2 seconds".
> 
> Sure memory management depends on fences making forward progress, but
> mm also depends on scheduled writeback making forward progress. You
> don't kick out writeback requests after an arbitrary timeout just
> because the backing storage happens to be loaded heavily.
> 
> This BTW is also why etnaviv has always had a quite short timeout of
> 500ms, with the option to extend the timeout when the GPU is still
> making progress. We don't ever want to shoot down valid workloads (we
> have some that need a few seconds to upload textures, etc on our wimpy
> GPU), but you also don't want to wait multiple seconds until you detect
> a real GPU hang.

That is a really good point. We considered that as well, but then abandoned the 
idea, see below for the background.

What we could also do is setting a flag on the fence when a process is killed 
and then waiting for that fence to signal so that it can clean up. Going to 
prototype that.

> So we use the short scheduler timeout to check in on the GPU and see if
> it is still making progress (for graphics workloads by looking at the
> frontend position within the command buffer and current primitive ID).
> If we can deduce that the GPU is stuck we do the usual reset/recovery
> dance within a reasonable reaction time, acceptable to users hitting a
> real GPU hang. But if the GPU is making progress we will give an
> infinite number of timeout extensions with no global timeout at all,
> only fulfilling the eventual signaling guarantee of the fence.

Well the question is how do you detect *reliable* that there is still forward 
progress?

I mean with the DMA engines we can trivially submit work which copies petabytes 
and needs hours or even a day to complete.

Without a global timeout that is a really nice deny of service attack against 
the system if you don't catch that.

Thanks,
Christian.

> 
> Regards,
> Lucas

Reply via email to