On Wed, 2025-11-26 at 16:03 +0100, Christian König wrote: > > > On 11/26/25 13:37, Philipp Stanner wrote: > > On Wed, 2025-11-26 at 13:31 +0100, Christian König wrote: > > >
[…] > > > Well the question is how do you detect *reliable* that there is > > > still forward progress? > > > > My understanding is that that's impossible since the internals of > > command submissions are only really understood by userspace, who > > submits them. > > Right, but we can still try to do our best in the kernel to mitigate > the situation. > > I think for now amdgpu will implement something like checking if the > HW still makes progress after a timeout but only a limited number of > re-tries until we say that's it and reset anyway. Oh oh, isn't that our dear hang_limit? :) We agree that you can never really now whether userspace just submitted a while(true) job, don't we? Even if some GPU register still indicates "progress". > > > I think the long-term solution can only be fully fledged GPU > > scheduling > > with preemption. That's why we don't need such a timeout mechanism > > for > > userspace processes: the scheduler simply interrupts and lets > > someone > > else run. > > Yeah absolutely. > > > > > My hope would be that in the mid-term future we'd get firmware > > rings > > that can be preempted through a firmware call for all major > > hardware. > > Then a huge share of our problems would disappear. > > At least on AMD HW pre-emption is actually horrible unreliable as > well. Do you mean new GPUs with firmware scheduling, or what is "HW pre- emption"? With firmware interfaces, my hope would be that you could simply tell stop_running_ring(nr_of_ring) // time slice for someone else start_running_ring(nr_of_ring) Thereby getting real scheduling and all that. And eliminating many other problems we know well from drm/sched. > > Userspace basically needs to co-operate and provide a buffer where > the state on a pre-emption is saved into. That's uncool. With CPU preemption all that is done automatically via the processe's pages. P.
