fence: give some reasonable maximum signaling timeout

Philipp Stanner Tue, 25 Nov 2025 02:57:03 -0800

On Tue, 2025-11-25 at 09:48 +0100, Christian König wrote:
> On 11/25/25 09:13, Philipp Stanner wrote:
> > On Tue, 2025-11-25 at 09:03 +0100, Christian König wrote:
> > > On 11/25/25 08:55, Philipp Stanner wrote:
> > > > >  
> > > > >


[…]

> > > > 
> > > > HZ can change depending on the config. Is that really a good choice? I
> > > > could see racy situations arising in some configs vs others
> > > 
> > > 2*HZ is always two seconds expressed in number of jiffies, I can use 
> > > msecs_to_jiffies(2000) to make that more obvious.
> > 
> > On AMD64 maybe. What about the other architectures?
> 
> HZ is defined as jiffies per second, So even if it changes to 10,100 or 1000 
> depending on the architecture 2*HZ is always two seconds expressed in jiffies.
> 
> The HZ define is actually there to make it architecture independent.

<german English> Again what learned </german Enlgish>

Although the amount of documentation for such a central feature is a
bit thin. Anyways. msecs_to_jiffies() is more readable, yes. Many
drivers prefer it, too


> 
> > > 
> > > The GPU scheduler has a very similar define, 
> > > MAX_WAIT_SCHED_ENTITY_Q_EMPTY which is currently just 1 second.
> > > 
> > > The real question is what is the maximum amount of time we can wait for 
> > > the HW before we should trigger a timeout?
> > 
> > That's a question only the drivers can answer, which is why I like to
> > think that setting global constants constraining all parties is not the
> > right thing to do.
> 
> Exactly that's the reason why I bring that up. I think that drivers should be 
> in charge of timeouts is the wrong approach.
> 
> See the reason why we have the timeout (and documented that it is a must 
> have) is because we have both core memory management as well a desktop 
> responsiveness depend on it.

Good and well, but then patch 4 becomes even more problematic:

So we'd just have drivers fire warnings, and then they would still have
the freedom to set timeouts for drm/sched, as long as those timeouts
are smaller than your new global constant.

Why then not remove drm/sched's timeout parameter API completely and
always use your maximum value internally in drm/sched? Or maybe
truncate it with a warning?

"Maximum timeout parameter exceeded, truncating to %ld.\n"

I suppose some drivers want even higher responsiveness than those 2
seconds.

I do believe that more of the driver folks should be made aware of this
intended change.

> 
> > What is even your motivation? What problem does this solve? Is the OOM
> > killer currently hanging for anyone? Can you link a bug report?
> 
> I'm not sure if we have an external bug report (we have an internal one), but 
> for amdgpu there were customer complains that 10 seconds is to long.
> 
> So we changed it to 2 seconds for amdgpu, and now there are complains from 
> internal AMD teams that 2 seconds is to short.
> 
> While working on that I realized that the timeout is actually not driver 
> dependent at all.
> 
> What can maybe argued is that a desktop system should have a shorter timeout 
> than some server, but that one driver needs a different timeout than another 
> driver doesn't really makes sense to me.
> 
> I mean what is actually HW dependent on the requirement that I need a 
> responsive desktop system?

I suppose some drivers are indeed only used for server hardware. And
for compute you might not care about responsiveness as long as your
result drops off at some point. But there's cloud gaming, too..

I agree that distinguishing the use case that way is not ideal.
However, who has the knowledge of how the hardware is being used by
customers / users, if not the driver?


P.

Re: [PATCH 1/4] dma-buf/fence: give some reasonable maximum signaling timeout

Reply via email to