Hello! Here comes yet another bug description and a change proposal; hopefully not too long this time. (Update after having written like half of it: apparently it *is* going to be long.)
My system regularly experiences network deadlocks: the system itself still works if I access it through the console, but all my SSH sessions just hang, as do any other connections. This can happen soon after startup (in like 20% of times), and later during run time at arbitrary moments too; but in this second case it seems related to how much network activity there is. This is one of the reasons preventing me from running the glibc testsuite, since that spits out _a lot_ of output, and will almost surely lead to a networking deadlock at some point. I don't know why nobody else seems to be complaining about this; based on the analysis below the root cause should be fairly reproducible. Shutting down the system normally in this case does not work, since the dhcp client attempts to release its lease, but since the network stack is deadlocked it never manages to. And since (on Debian GNU/Hurd) the shutdown process is apparently synchronous and there are no timeouts and sigkills in place, it just hangs there forever, with no apparent way to even Ctrl-C it (which sort of makes sense since it's not an interactive shell). If I approach it calmly, I do syncfs --sync / (just in case) and then reboot-hurd, which works. If, however, if I have already managed to type in 'reboot', I am left with no choice but to forcefully power off the machine and risk corrupting the file system; and the consequences range from minor complaints from fsck on the next boot-up to the installation breaking so badly that it does not boot. So we can agree that this is slightly suboptimal. Yesterday I got one of these hangs again, and decided to bite the bullet & look into what's going on. There are two things that could be locking up: either pfinet or netdde; quick inspection revealed that pfinet is the guilty party. There is apparently a single global_lock in pfinet; it makes sense that if something manages to hold it indefinitely then everything else will hang. That something turned out to be the timer thread (see pfinet/timer-emul.c:timer_function). It grabs the lock after sleeping and proceeds to run the handlers, and apparently never exits from that loop; so nothing else can make any progress. But why doesn't it exit the loop? Surely there's only a finite, hopefully small, number of timers overall? Well, yes, but they re-schedule themselves from inside the handlers. As an aside, whoever wrote add_timer () in the same file obviously didn't care about releasing the mach_thread_self () right; currently a reference is leaked on each invocation (yay). It should either deallocate the right or use hurd_thread_self (). The offending function that reschedules itself over and over is pfinet/linux-src/net/ipv4/tcp_timer.c:tcp_sltimer_handler. I found out (which wasn't super easy when everything's <optimized out>, in addition to me being unable to use my terminal emulator of choice and having to squint at the Mach console which is tiny because it's not getting appropriately scaled for my HiDPI screen [0]) that the 'now + next' addition overflows and results in a small timestamp (number of jiffies), which is then getting taken by timer-emul to be a timestamp in the past, so it immediately runs the handler again, thus the busy-loop. [0]: https://gitlab.gnome.org/GNOME/gnome-boxes/-/issues/635 Why does it overflow? Because 'now' is huge. 'now' is set from 'jiffies' at the beginning of tcp_sltimer_handler. 'jiffies' is #defined to fetch_jiffies () from pfinet/mapped-time.h; which uses maptime_read () to read the current time from the Mach time device, and then subtracts 'root_jiffies' from it. 'root_jiffies' is filled from the Mach time device on pfinet startup. So clearly 'j - root_jiffies' itself overflows, resulting in a huge 'jiffies' value; I added an 'assert (j >= root_jiffies)' and sure enough, it fails at about the same rate that the network would lock up. In conclusion, what happens is pfinet reads a timestamp from the Mach time device, and then reads another timestamp, and the second one happens to be smaller than the first one, which confuses it into busy-looping and deadlocking. I then grepped gnumach to see how the time device is set up and realized that the time device is supposed to be CLOCK_REALTIME, not something like CLOCK_BOOTTIME or CLOCK_MONOTONIC. So it only makes sense that if *something* adjusts the clock with host_set_time (), the time as seen through the time device will experience a discontinuous jump, possibly into the past. And that *something* appears to be /usr/sbin/ntpdate. Now, how do we fix this? Clearly pfinet needs to implement 'jiffies' differently. A stopgap solution would be to detect the case of j < root_jiffies and just set root_jiffies = j; this will probably break all sort of timeouts, but at least will solve the busy-loop / deadlock. But also, Mach needs to expose a CLOCK_BOOTTIME-like clock in addition to the CLOCK_REALTIME one. There is already internal tracking for this, see kern/mach_clock.c:clock_boottime_offset; it only needs to be exposed to the userland. Since we have already broken ABI compatibility with the time device on other Mach versions (if it ever existed) by adding the *64 variants to the struct mapped_time_value, maybe we could just place the clock_boottime_offset into the same struct? Then the userspace would be able to pick it up and use it to calculate the boottime-relative time. We could even attempt to use this to implement some support for CLOCK_MONOTONIC in glibc, which I imagine would be a welcome addition. Maybe backcompat is not that much of a concern either: the kernel allocates a whole page for the mapped_time_value and zero-fills it; if we see that clock_boottime_offset is zero that clearly means that either the system was booted up on 1 January, 1970, or the Mach version in use doesn't support clock_boottime_offset. What do you think? Sergey