[RFC PATCH 0/1] gsync blocking the calling thread considered harmful

Sergey Bugaev Thu, 10 Jun 2021 09:47:56 -0700

Hello,

while hacking on rpctrace -p, I once again ran into gsync_wait () calls 
permanently hanging rpctrace.


The reason for this is simple: once rpctrace logs the gsync_wait () call it 
receives from a traced task, it forwards the same gsync_wait () call to the 
actual task port of the traced task, and this causes rpctrace itself to block 
since the gsync_wait () implementation always affects the calling thread. 
Normally, some time later the blocked thread would be woken up by a gsync_wake 
() call done by another thread; but since rpctrace itself is single-threaded, 
other threads in the traced tasks can enqueue gsync_wake_request ()s in vain, 
as they'll never even get received by the hanging rpctrace, nor forwarded to 
the kernel.

One way to work around this would be to make rpctrace multithreaded, and I've 
heard there's been some work in that direction.

But to me it sounds like gsync_wait () blocking rpctrace is the part that goes 
wrong. Generally, rpctrace is never supposed to block on an RPC made by a 
traced task: it forwards the request message without blocking for the reply 
(that may come later, or never).

So I've been long thinking that gsync_wait () should do the same: instead of 
actually blocking the thread calling gsync_wait (), gsync_wait_request () 
should return immediately, and the reply message will come once someone calls 
gsync_wake (). This doesn't change anything for the regular callers of 
gsync_wait (), since the call will still appear to block the same way other 
RPCs do, but it will actually block on msg receive, not inside gsync_wait () 
itself.

This must have been discussed before, and there must be a reason why gsync_wait 
() was made to behave the way that it does and not in the (arguably simpler and 
more consistent) way I'm proposing; but I can't find any relevant discussion.

Anyway, I thought I'd try implementing my idea and seeing what would break. 
Much like with glibc, I'm not very familiar with Mach-the-kernel internals, but 
to my surprise the first version that compiled appears to work just fine, 
booting a full working Hurd system (and rpctracing gsync_wait () totally 
works). Still, I probably messed something up: some locking or reference 
counting or somesuch; so please review :)

The part that I didn't figure out yet is how the kernel can listen for a right 
to become a dead name (like the dead-name notification in userspace). Perhaps 
it amounts to calling ipc_port_dnrequest () and listening for messages just 
like in userspace, but I have not figured out the details yet. Without this, 
the kernel cannot really know when the reply port is deallocated, either 
explicitly by userspace or because the task died, so it means the kernel will 
leak memory allocated for the waiters. Another to-do item is timeout support; 
ideally it should just turn into waittime in the MIG definition, but this again 
requires handling the reply port dying, and would change the message format, 
and also GNU MIG doesn't yet support conditional timeouts (although I might 
have a MIG patch or two pending for this).

So, what do you think?

Sergey

P.S. Feels so good to hack on something that I can just post about in public 
again!

[RFC PATCH 0/1] gsync blocking the calling thread considered harmful

Reply via email to