+Cc Michel

On Tue, 2025-11-25 at 15:26 +0100, Christian König wrote:
> On 11/25/25 11:56, Philipp Stanner wrote:
> > > > > 
> > > > > The GPU scheduler has a very similar define, 
> > > > > MAX_WAIT_SCHED_ENTITY_Q_EMPTY which is currently just 1 second.
> > > > > 
> > > > > The real question is what is the maximum amount of time we can wait 
> > > > > for the HW before we should trigger a timeout?
> > > > 
> > > > That's a question only the drivers can answer, which is why I like to
> > > > think that setting global constants constraining all parties is not the
> > > > right thing to do.
> > > 
> > > Exactly that's the reason why I bring that up. I think that drivers 
> > > should be in charge of timeouts is the wrong approach.
> > > 
> > > See the reason why we have the timeout (and documented that it is a must 
> > > have) is because we have both core memory management as well a desktop 
> > > responsiveness depend on it.
> > 
> > Good and well, but then patch 4 becomes even more problematic:
> > 
> > So we'd just have drivers fire warnings, and then they would still have
> > the freedom to set timeouts for drm/sched, as long as those timeouts
> > are smaller than your new global constant.
> > 
> > Why then not remove drm/sched's timeout parameter API completely and
> > always use your maximum value internally in drm/sched? Or maybe
> > truncate it with a warning?
> 
> I have considered that as well, but then thought that we should at least give 
> end users the possibility to override the timeout while still tainting the 
> kernel so that we know about this in bug reports, core dumps etc...
> 
> > "Maximum timeout parameter exceeded, truncating to %ld.\n"
> > 
> > I suppose some drivers want even higher responsiveness than those 2
> > seconds.
> 
> As far as I know some medical use cases for example have timeouts like 
> 100-200ms. But again that is the use case and not the driver.
> 
> > I do believe that more of the driver folks should be made aware of this
> > intended change.
> 
> I have no real intention of actually pushing those patches, at least not as 
> they are. I just wanted to kick of some discussion.

Can you then please use --rfc when creating such patches in the future?
That way you won't cause my heart rate to increase, searching for
immediate danger :D

> 
> > > 
> > > > What is even your motivation? What problem does this solve? Is the OOM
> > > > killer currently hanging for anyone? Can you link a bug report?
> > > 
> > > I'm not sure if we have an external bug report (we have an internal one), 
> > > but for amdgpu there were customer complains that 10 seconds is to long.
> > > 
> > > So we changed it to 2 seconds for amdgpu, and now there are complains 
> > > from internal AMD teams that 2 seconds is to short.
> > > 
> > > While working on that I realized that the timeout is actually not driver 
> > > dependent at all.
> > > 
> > > What can maybe argued is that a desktop system should have a shorter 
> > > timeout than some server, but that one driver needs a different timeout 
> > > than another driver doesn't really makes sense to me.
> > > 
> > > I mean what is actually HW dependent on the requirement that I need a 
> > > responsive desktop system?
> > 
> > I suppose some drivers are indeed only used for server hardware. And
> > for compute you might not care about responsiveness as long as your
> > result drops off at some point. But there's cloud gaming, too..
> 
> Good point with the cloud gaming.
> 
> > I agree that distinguishing the use case that way is not ideal.
> > However, who has the knowledge of how the hardware is being used by
> > customers / users, if not the driver?
> 
> Well the end user.
> 
> Maybe we should move the whole timeout topic into the DRM layer or the 
> scheduler component.

Who's the "user"? The entire system? One process sitting on top of its
ioctl and file descriptor?

That question plays into answering how and where timeouts should be
configured.


One might ask himself if then a kernel parameter would be the right way
to configure it. I'm not very experienced with the desires of
userspace.

I sumond Michel Dänzer to share his wisdom!


> 
> Something like 2 seconds default (which BTW is the default on Windows as 
> well), which can be overridden on a global, per device, per queue name basis.

I mean, the drivers can already set it per device. It seems to me that
what you actually want is finer control?

For Nouveau with its firmware scheduler having a timeout at all just
doesn't make much sense anywayys.

 * If a fw ring hangs, it hangs, and a shorter timeout will just have
   your app crash sooner.
 * If it's laggy and slow, it's laggy and slow, but with a high timeout
   at least still usable.
 * And if it's compute and slow, you at least get your results at some
   point.

But having a lower timeout wouldn't really repair anything, or am I
mistaken?

> 
> And 10 seconds maximum with only a warning that a not default timeout is used 
> and everything above 10 seconds taints the kernel and should really only be 
> used for testing/debugging.
> 
> Thoughts?


The most important thing for me regarding your RFC is that we don't add
shiny warnings by declaring driver behavior invalid that was
operational for years.

The most conservative way would be to send patches to the respective
drivers, setting their timeouts to the new desired defaults, and then
adding warnings so that future drivers become aware.


P.

Reply via email to