Re: drm_sched run_job and scheduling latency

Philipp Stanner Thu, 05 Mar 2026 00:38:51 -0800

On Thu, 2026-03-05 at 09:27 +0100, Boris Brezillon wrote:
> Hi Matthew,
> 
> On Wed, 4 Mar 2026 18:04:25 -0800
> Matthew Brost <[email protected]> wrote:
> 
> > On Wed, Mar 04, 2026 at 02:51:39PM -0800, Chia-I Wu wrote:
> > > Hi,
> > > 
> > > Our system compositor (surfaceflinger on android) submits gpu jobs
> > > from a SCHED_FIFO thread to an RT gpu queue. However, because
> > > workqueue threads are SCHED_NORMAL, the scheduling latency from submit
> > > to run_job can sometimes cause frame misses. We are seeing this on
> > > panthor and xe, but the issue should be common to all drm_sched users.
> > >   
> > 
> > I'm going to assume that since this is a compositor, you do not pass
> > input dependencies to the page-flip job. Is that correct?
> > 
> > If so, I believe we could fairly easily build an opt-in DRM sched path
> > that directly calls run_job in the exec IOCTL context (I assume this is
> > SCHED_FIFO) if the job has no dependencies.
> 
> I guess by ::run_job() you mean something slightly more involved that
> checks if:
> 
> - other jobs are pending
> - enough credits (AKA ringbuf space) is available
> - and probably other stuff I forgot about
> 
> > 
> > This would likely break some of Xe’s submission-backend assumptions
> > around mutual exclusion and ordering based on the workqueue, but that
> > seems workable. I don’t know how the Panthor code is structured or
> > whether they have similar issues.
> 
> Honestly, I'm not thrilled by this fast-path/call-run_job-directly idea
> you're describing. There's just so many things we can forget that would
> lead to races/ordering issues that will end up being hard to trigger and
> debug.
>


+1

I'm not thrilled either. More like the opposite of thrilled actually.

Even if we could get that to work. This is more of a maintainability
issue.

The scheduler is full of insane performance hacks for this or that
driver. Lockless accesses, a special lockless queue only used by that
one party in the kernel (a lockless queue which is nowadays, after N
reworks, being used with a lock. Ah well).

In the past discussions Danilo and I made it clear that more major
features in _new_ patch series aimed at getting merged into drm/sched
must be preceded by cleanup work to address some of the scheduler's
major problems.

That's especially true if it's features aimed at performance buffs.



P.

Re: drm_sched run_job and scheduling latency

Reply via email to