fence: give some reasonable maximum signaling timeout

Philipp Stanner Wed, 26 Nov 2025 07:45:09 -0800

On Wed, 2025-11-26 at 16:03 +0100, Christian König wrote:
> 
> 
> On 11/26/25 13:37, Philipp Stanner wrote:
> > On Wed, 2025-11-26 at 13:31 +0100, Christian König wrote:
> > >


[…]

> > > Well the question is how do you detect *reliable* that there is
> > > still forward progress?
> > 
> > My understanding is that that's impossible since the internals of
> > command submissions are only really understood by userspace, who
> > submits them.
> 
> Right, but we can still try to do our best in the kernel to mitigate
> the situation.
> 
> I think for now amdgpu will implement something like checking if the
> HW still makes progress after a timeout but only a limited number of
> re-tries until we say that's it and reset anyway.

Oh oh, isn't that our dear hang_limit? :)

We agree that you can never really now whether userspace just submitted
a while(true) job, don't we? Even if some GPU register still indicates
"progress".

> 
> > I think the long-term solution can only be fully fledged GPU
> > scheduling
> > with preemption. That's why we don't need such a timeout mechanism
> > for
> > userspace processes: the scheduler simply interrupts and lets
> > someone
> > else run.
> 
> Yeah absolutely. 
> 
> > 
> > My hope would be that in the mid-term future we'd get firmware
> > rings
> > that can be preempted through a firmware call for all major
> > hardware.
> > Then a huge share of our problems would disappear.
> 
> At least on AMD HW pre-emption is actually horrible unreliable as
> well.

Do you mean new GPUs with firmware scheduling, or what is "HW pre-
emption"?

With firmware interfaces, my hope would be that you could simply tell

stop_running_ring(nr_of_ring)
// time slice for someone else
start_running_ring(nr_of_ring)

Thereby getting real scheduling and all that. And eliminating many
other problems we know well from drm/sched.

> 
> Userspace basically needs to co-operate and provide a buffer where
> the state on a pre-emption is saved into.

That's uncool. With CPU preemption all that is done automatically via
the processe's pages.


P.

Re: [PATCH 1/4] dma-buf/fence: give some reasonable maximum signaling timeout

Reply via email to