Hello, Just an update: I have fixed other issues with netdde crashing, the latest fixes are available in 2:1.8+git20191029-7, to be built & uploaded soon.
One thing that remains, however, is this: if netdde crashes while an irq is pending, it will not make the hardware drop the irq. The interrupt handler will thus keep getting called, until another netdde is started to reset the board and thus drop the irq. That works fine (just like it used to) when netdde is alone using this irq, because in that case when netdde goes away we don't have interrupt actions any more and thus gnumach disables the interrupt. If the irq is shared, however, the interrupt will be kept enabled, and keep raising, without any driver being able to make that stop. To summarize, there is progress: we can share an irq between several netddes/rumps and even gnumach. But if netdde or rump share an irq and dies, we get a hard hang. So either you want netdde/rump to be able to die, so you have to manage to get it alone on its irq, or you want to share an irq, and then netdde/rump shouldn't die. A way to "avoid" this would be not to re-enable the irq if netdde dies. But this means that other drivers using the same irq will not get any interrupt any more and thus stop working. And when a newer netdde comes, we need to know that it indeed drives the hardware which is keeping the interrupt raised, and thus has reset it, before re-enabling the irq. Worse, if it's the disk driver which shares the same irq as netdde, one could not even be able to reload the netdde binary from the disk. An ugly way to work around this could be to mitigate irq frequency: when a userland interrupt listener dies before properly unregistering its interrupt, we could throttle the interrupt delivery to e.g. once per clock tick, which could be enough to reload the netdde binary from the disk in a few seconds. Once netdde re-registers the interrupt, we can re-enable the interrupt normally. We could even automatically detect such condition: use a counter of how many times the interrupt handler got called again and again without the hardware clock managing to trigger. If that gets to something like a million times, the irq is stuck and we have to throttle it until a driver for that irq is restarted. Samuel