On Wed, May 01, 2019 at 04:35:02PM +1000, David Gwynne wrote:
> i originally came at this from the other side, where i wanted to run
> kqueue_enqueue and _dequeue without the KERNEL_LOCK, but that implied
> making kqueue_scan use the mutex too, which allowed the syscall to
> become less locked.
> 
> it assumes that the existing locking in kqueue_scan is in the right
> place, it just turns it into a mutex instead of KERNEL_LOCK with
> splhigh. it leaves the kqueue_register code under KERNEL_LOCK, but if
> you're not making changes with kevent then this should be a win.
> 
> there's an extra rwlock around the kqueue_scan call. this protects the
> kq_head list from having multiple marker structs attached to it. that is
> an extremely rare situation, ie, you'd have to have two threads execute
> kevent on the same kq fd concurrently, but that never happens. right?

FWIW, in Linux-land a shared event descriptor with edge-triggered events is
a not uncommon pattern for multithreaded dispatch loops as it removes the
need for userspace locking. kqueue supports the same pattern and some
portable event loops likely mix threads and shared event descriptors this
way. epoll and kqueue (and Solaris Ports, for that matter) are similar
enough that a thin wrapper doesn't even need to explicitly support this
pattern for it to be available to an application.

I don't see any reason to optimize for it at the moment though.[1] That lock
doesn't appear to change semantics, just serializes threads waiting on the
queue, right? Even if the order that threads are awoken changes it shouldn't
impact correctness.

[1] Or ever. Edge-triggered events and shared mutable data are individually
brittle, difficult to maintain design choices; combining them is just asking
for trouble. I've seen projects go down this path and then switch to oneshot
events instead of edge-triggered events to "solve" thread race bugs, which
can result in worse performance than a classic select loop. For example, in
low-latency and/or high load scenarios where the total number of syscalls
constantly rearming a descriptor is greater. Linux still doesn't have
batched updates, AFAIK, and they're prohibitively difficult to implement for
a multithreaded, lock-free dispatch loop, anyhow, so it's not a common
optimization.

Reply via email to