On 11/25/25 18:02, Lucas Stach wrote: >>> I agree that distinguishing the use case that way is not ideal. >>> However, who has the knowledge of how the hardware is being used by >>> customers / users, if not the driver? >> >> Well the end user. >> >> Maybe we should move the whole timeout topic into the DRM layer or the >> scheduler component. >> >> Something like 2 seconds default (which BTW is the default on Windows as >> well), which can be overridden on a global, per device, per queue name basis. >> >> And 10 seconds maximum with only a warning that a not default timeout is >> used and everything above 10 seconds taints the kernel and should really >> only be used for testing/debugging. > > The question really is what you want to do after you hit the (lowered) > timeout? Users get grumpy if you block things for 10 seconds, but they > get equally if not more grumpy when you kick out a valid workload that > just happens to need a lot of GPU time.
Yeah, exactly that summarizes the problem pretty well. > Fences are only defined to signal eventually, with no real concept of a > timeout. IMO all timeouts waiting for fences should be long enough to > only be considered last resort. You may want to give the user some > indication of a failed fence wait instead of stalling indefinitely, but > you really only want to do this after a quite long timeout, not in a > sense of "Sorry, I ran out of patience after 2 seconds". > > Sure memory management depends on fences making forward progress, but > mm also depends on scheduled writeback making forward progress. You > don't kick out writeback requests after an arbitrary timeout just > because the backing storage happens to be loaded heavily. > > This BTW is also why etnaviv has always had a quite short timeout of > 500ms, with the option to extend the timeout when the GPU is still > making progress. We don't ever want to shoot down valid workloads (we > have some that need a few seconds to upload textures, etc on our wimpy > GPU), but you also don't want to wait multiple seconds until you detect > a real GPU hang. That is a really good point. We considered that as well, but then abandoned the idea, see below for the background. What we could also do is setting a flag on the fence when a process is killed and then waiting for that fence to signal so that it can clean up. Going to prototype that. > So we use the short scheduler timeout to check in on the GPU and see if > it is still making progress (for graphics workloads by looking at the > frontend position within the command buffer and current primitive ID). > If we can deduce that the GPU is stuck we do the usual reset/recovery > dance within a reasonable reaction time, acceptable to users hitting a > real GPU hang. But if the GPU is making progress we will give an > infinite number of timeout extensions with no global timeout at all, > only fulfilling the eventual signaling guarantee of the fence. Well the question is how do you detect *reliable* that there is still forward progress? I mean with the DMA engines we can trivially submit work which copies petabytes and needs hours or even a day to complete. Without a global timeout that is a really nice deny of service attack against the system if you don't catch that. Thanks, Christian. > > Regards, > Lucas
