Hello, while hacking on rpctrace -p, I once again ran into gsync_wait () calls permanently hanging rpctrace.
The reason for this is simple: once rpctrace logs the gsync_wait () call it receives from a traced task, it forwards the same gsync_wait () call to the actual task port of the traced task, and this causes rpctrace itself to block since the gsync_wait () implementation always affects the calling thread. Normally, some time later the blocked thread would be woken up by a gsync_wake () call done by another thread; but since rpctrace itself is single-threaded, other threads in the traced tasks can enqueue gsync_wake_request ()s in vain, as they'll never even get received by the hanging rpctrace, nor forwarded to the kernel. One way to work around this would be to make rpctrace multithreaded, and I've heard there's been some work in that direction. But to me it sounds like gsync_wait () blocking rpctrace is the part that goes wrong. Generally, rpctrace is never supposed to block on an RPC made by a traced task: it forwards the request message without blocking for the reply (that may come later, or never). So I've been long thinking that gsync_wait () should do the same: instead of actually blocking the thread calling gsync_wait (), gsync_wait_request () should return immediately, and the reply message will come once someone calls gsync_wake (). This doesn't change anything for the regular callers of gsync_wait (), since the call will still appear to block the same way other RPCs do, but it will actually block on msg receive, not inside gsync_wait () itself. This must have been discussed before, and there must be a reason why gsync_wait () was made to behave the way that it does and not in the (arguably simpler and more consistent) way I'm proposing; but I can't find any relevant discussion. Anyway, I thought I'd try implementing my idea and seeing what would break. Much like with glibc, I'm not very familiar with Mach-the-kernel internals, but to my surprise the first version that compiled appears to work just fine, booting a full working Hurd system (and rpctracing gsync_wait () totally works). Still, I probably messed something up: some locking or reference counting or somesuch; so please review :) The part that I didn't figure out yet is how the kernel can listen for a right to become a dead name (like the dead-name notification in userspace). Perhaps it amounts to calling ipc_port_dnrequest () and listening for messages just like in userspace, but I have not figured out the details yet. Without this, the kernel cannot really know when the reply port is deallocated, either explicitly by userspace or because the task died, so it means the kernel will leak memory allocated for the waiters. Another to-do item is timeout support; ideally it should just turn into waittime in the MIG definition, but this again requires handling the reply port dying, and would change the message format, and also GNU MIG doesn't yet support conditional timeouts (although I might have a MIG patch or two pending for this). So, what do you think? Sergey P.S. Feels so good to hack on something that I can just post about in public again!