On 11/10/25 16:55, Philipp Stanner wrote: > On Mon, 2025-11-10 at 16:14 +0100, Christian König wrote: >> On 11/10/25 15:20, Philipp Stanner wrote: >>> On Mon, 2025-11-10 at 15:07 +0100, Christian König wrote: >>>> On 11/10/25 13:27, Philipp Stanner wrote: >>>> The problem isn't the burned CPU cycles, but rather the cache lines moved >>>> between CPUs. >>> >>> Which cache lines? The spinlock's? >>> >>> The queue data needs to move from one CPU to the other in either case. >>> It's the same data that is being moved with spinlock protection. >>> >>> A spinlock doesn't lead to more cache line moves as long as there's >>> still just a single consumer / producer. >> >> Looking at a couple of examples: >> >> 1. spinlock + double linked list (which is what the scheduler used >> initially). >> >> You have to touch 3-4 different cache lines, the lock, the previous, the >> current and the next element (next and prev are usually the same with the >> lock). > > list when pushing: > > Lock + head (same cache line) + head->next > head->next->next > > when popping: > > Lock + head + head->previous > head->previous->previous > > I don't see why you need a "current" element when you're always only > touching head or tail.
The current element is the one you insert or remove. >> >> 2. kfifo (attempt #2): >> >> 3 cache lines, one for the array, one for the rptr/wptr and one for the >> element. >> Plus the problem that you need to come up with some upper bound for it. >> >> 3. spsc (attempt #3) >> >> 2-3 cache lines, one for the queue (head/tail), one for the element and >> one for the previous element (but it is quite likely that this is >> pre-fetched). >> >> Saying this I just realized we could potentially trivially replace the spsc >> with an single linked list+pointer to the end+spinlock and have the same >> efficiency. We don't need all the lockless stuff for that at all. >> > > Now we're speaking mostly the same language :] > > If you could RB my DRM TODO patches we'd have a section for drm/sched, > and there we could then soonish add an item for getting rid of spsc. > > https://lore.kernel.org/dri-devel/[email protected]/ I can't find that in my inbox anywhere. Can you send it out one more with my AMD mail address on explicit CC? Thanks in advance. >>>> The problem is really to separate the push from the pop side so that as >>>> few cache lines as possible are transferred from one CPU to another. >>> >>> With a doubly linked list you can attach at the front and pull from the >>> tail. How is that transferring many cache lines? >> >> See above. >> >> We have some tests for old and trivial use cases (e.g. GLmark2) which on >> todays standards pretty much only depend on how fast you can push things to >> the HW. >> >> We could just extend the scheduler test cases to see how many submissions >> per second we can pump through a dummy implementation where both producer >> and consumer are nailed to separate CPUs. >> > > I disagree. That would be a microbenchmark for a very narrow use case. That is actually a rather common use case (unfortunately). > It would only tell us that a specific patch slows things done for the > microbenchmark, and we could only detect that if a developer runs the > unit tests with and without his patches. I could trigger adding that to AMDs CI systems. > > The few major reworks that touch such essentials have good realistic > tests anyways, see Tvrtko's CFS series. > > > Lockless magic should always be justified by real world use cases. > > By the way, back when spsc_queue was implemented, how large were the > real world performance gains you meassured by saving that 1 cache line? That was actually quite a bit. If you want a real world test case use glMark2 on any modern HW. And yeah I know how ridicules that is, the problem is that we still have people using this as indicator for the command submission overhead. Regards, Christian. > > > P.
