On 11/26/25 13:37, Philipp Stanner wrote:
> On Wed, 2025-11-26 at 13:31 +0100, Christian König wrote:
>> On 11/25/25 18:02, Lucas Stach wrote:
>>>>> I agree that distinguishing the use case that way is not ideal.
>>>>> However, who has the knowledge of how the hardware is being used by
>>>>> customers / users, if not the driver?
>>>>
>>>> Well the end user.
>>>>
>>>> Maybe we should move the whole timeout topic into the DRM layer or the 
>>>> scheduler component.
>>>>
>>>> Something like 2 seconds default (which BTW is the default on Windows as 
>>>> well), which can be overridden on a global, per device, per queue name 
>>>> basis.
>>>>
>>>> And 10 seconds maximum with only a warning that a not default timeout is 
>>>> used and everything above 10 seconds taints the kernel and should really 
>>>> only be used for testing/debugging.
>>>
>>> The question really is what you want to do after you hit the (lowered)
>>> timeout? Users get grumpy if you block things for 10 seconds, but they
>>> get equally if not more grumpy when you kick out a valid workload that
>>> just happens to need a lot of GPU time.
>>
>> Yeah, exactly that summarizes the problem pretty well.
>>
>>> Fences are only defined to signal eventually, with no real concept of a
>>> timeout. IMO all timeouts waiting for fences should be long enough to
>>> only be considered last resort. You may want to give the user some
>>> indication of a failed fence wait instead of stalling indefinitely, but
>>> you really only want to do this after a quite long timeout, not in a
>>> sense of "Sorry, I ran out of patience after 2 seconds".
>>>
>>> Sure memory management depends on fences making forward progress, but
>>> mm also depends on scheduled writeback making forward progress. You
>>> don't kick out writeback requests after an arbitrary timeout just
>>> because the backing storage happens to be loaded heavily.
>>>
>>> This BTW is also why etnaviv has always had a quite short timeout of
>>> 500ms, with the option to extend the timeout when the GPU is still
>>> making progress. We don't ever want to shoot down valid workloads (we
>>> have some that need a few seconds to upload textures, etc on our wimpy
>>> GPU), but you also don't want to wait multiple seconds until you detect
>>> a real GPU hang.
>>
>> That is a really good point. We considered that as well, but then abandoned 
>> the idea, see below for the background.
>>
>> What we could also do is setting a flag on the fence when a process is 
>> killed and then waiting for that fence to signal so that it can clean up. 
>> Going to prototype that.
>>
>>> So we use the short scheduler timeout to check in on the GPU and see if
>>> it is still making progress (for graphics workloads by looking at the
>>> frontend position within the command buffer and current primitive ID).
>>> If we can deduce that the GPU is stuck we do the usual reset/recovery
>>> dance within a reasonable reaction time, acceptable to users hitting a
>>> real GPU hang. But if the GPU is making progress we will give an
>>> infinite number of timeout extensions with no global timeout at all,
>>> only fulfilling the eventual signaling guarantee of the fence.
>>
>> Well the question is how do you detect *reliable* that there is still 
>> forward progress?
> 
> My understanding is that that's impossible since the internals of
> command submissions are only really understood by userspace, who
> submits them.

Right, but we can still try to do our best in the kernel to mitigate the 
situation.

I think for now amdgpu will implement something like checking if the HW still 
makes progress after a timeout but only a limited number of re-tries until we 
say that's it and reset anyway.

> I think the long-term solution can only be fully fledged GPU scheduling
> with preemption. That's why we don't need such a timeout mechanism for
> userspace processes: the scheduler simply interrupts and lets someone
> else run.

Yeah absolutely. 

> 
> My hope would be that in the mid-term future we'd get firmware rings
> that can be preempted through a firmware call for all major hardware.
> Then a huge share of our problems would disappear.

At least on AMD HW pre-emption is actually horrible unreliable as well.

Userspace basically needs to co-operate and provide a buffer where the state on 
a pre-emption is saved into.

> With the current situation, IDK either. My impression so far is that
> letting the drivers and driver programmers decide is the least bad
> choice.

Yeah, agree. It's the least evil thing we can do.

But I now have a plan how to proceed :)

Thanks for the input,
Christian.

> 
> 
> P.

Reply via email to