On Thu, 2026-03-05 at 01:10 -0800, Matthew Brost wrote:
> On Thu, Mar 05, 2026 at 09:38:16AM +0100, Philipp Stanner wrote:
> > On Thu, 2026-03-05 at 09:27 +0100, Boris Brezillon wrote:
> > 
> > > 

[…]

> > > Honestly, I'm not thrilled by this fast-path/call-run_job-directly idea
> > > you're describing. There's just so many things we can forget that would
> > > lead to races/ordering issues that will end up being hard to trigger and
> > > debug.
> > > 
> > 
> > +1
> > 
> > I'm not thrilled either. More like the opposite of thrilled actually.
> > 
> > Even if we could get that to work. This is more of a maintainability
> > issue.
> > 
> > The scheduler is full of insane performance hacks for this or that
> > driver. Lockless accesses, a special lockless queue only used by that
> > one party in the kernel (a lockless queue which is nowadays, after N
> > reworks, being used with a lock. Ah well).
> > 
> 
> This is not relevant to this discussion—see below. In general, I agree
> that the lockless tricks in the scheduler are not great, nor is the fact
> that the scheduler became a dumping ground for driver-specific features.
> But again, that is not what we’re talking about here—see below.
> 
> > In the past discussions Danilo and I made it clear that more major
> > features in _new_ patch series aimed at getting merged into drm/sched
> > must be preceded by cleanup work to address some of the scheduler's
> > major problems.
> 
> Ah, we've moved to dictatorship quickly. Noted.

I prefer the term "benevolent presidency" /s

Or even better: s/dictatorship/accountability enforcement.

How does it come that everyone is here and ready so quickly when it
comes to new use cases and features, yet I never saw anyone except for
Tvrtko and Maíra investing even 15 minutes to write a simple patch to
address some of the *various* significant issues in that code base?

You were on CC on all discussions we've had here for the last years
afair, but I rarely saw you participate. And you know what it's like:
who doesn't speak up silently agrees in open source.

But tell me one thing, if you can be so kind:

What is your theory why drm/sched came to be in such horrible shape?
What circumstances, what human behavioral patterns have caused this?

The DRM subsystem has a bad reputation regarding stability among Linux
users, as far as I have sensed. How can we do better?

> 
> > 
> 
> I can't say I agree with either of you here.
> 
> In about an hour, I seemingly have a bypass path working in DRM sched +
> Xe, and my diff is:
> 
> 108 insertions(+), 31 deletions(-)

LOC is a bad metric for complexity.

> 
> About 40 lines of the insertions are kernel-doc, so I'm not buying that
> this is a maintenance issue or a major feature - it is literally a
> single new function.
> 
> I understand a bypass path can create issues—for example, on certain
> queues in Xe I definitely can't use the bypass path, so Xe simply
> wouldn’t use it in those cases. This is the driver's choice to use or
> not. If a driver doesn't know how to use the scheduler, well, that’s on
> the driver. Providing a simple, documented function as a fast path
> really isn't some crazy idea.

We're effectively talking about a deviation from the default submission
mechanism, and all that seems to be desired for a luxury feature.

Then you end up with two submission mechanisms, whose correctness in
the future relies on someone remembering what the background was, why
it was added, and what the rules are..

The current scheduler rules are / were often not even documented, and
sometimes even Christian took a few weeks to remember again why
something had been added – and whether it can now be removed again or
not.

> 
> The alternative—asking for RT workqueues or changing the design to use
> kthread_worker—actually is.
> 
> > That's especially true if it's features aimed at performance buffs.
> > 
> 
> With the above mindset, I'm actually very confused why this series [1]
> would even be considered as this order of magnitude greater in
> complexity than my suggestion here.
> 
> Matt
> 
> [1] https://patchwork.freedesktop.org/series/159025/ 

The discussions about Tvrtko's CFS series were precisely the point
where Danilo brought up that after this can be merged, future rework of
the scheduler must focus on addressing some of the pending fundamental
issues.

The background is that Tvrtko has worked on that series already for
well over a year, it actually simplifies some things in the sense of
removing unused code (obviously it's a complex series, no argument
about that), and we agreed on XDC that this can be merged. So this is a
question of fairness to the contributor.

But at one point you have to finally draw a line. No one will ever
address major scheduler issues unless we demand it. Even very
experienced devs usually prefer to hack around the central design
issues in their drivers instead of fixing the shared infrastructure.


P.

Reply via email to