Re: [PATCH]: gnumach - simplify interrupt handling

Samuel Thibault Tue, 12 Nov 2019 15:54:19 -0800

Hello,

Just an update: I have fixed other issues with netdde crashing, the
latest fixes are available in 2:1.8+git20191029-7, to be built &
uploaded soon.


One thing that remains, however, is this: if netdde crashes while an irq
is pending, it will not make the hardware drop the irq. The interrupt
handler will thus keep getting called, until another netdde is started
to reset the board and thus drop the irq. That works fine (just like it
used to) when netdde is alone using this irq, because in that case when
netdde goes away we don't have interrupt actions any more and thus
gnumach disables the interrupt. If the irq is shared, however, the
interrupt will be kept enabled, and keep raising, without any driver
being able to make that stop.

To summarize, there is progress: we can share an irq between several
netddes/rumps and even gnumach. But if netdde or rump share an irq and
dies, we get a hard hang. So either you want netdde/rump to be able to
die, so you have to manage to get it alone on its irq, or you want to
share an irq, and then netdde/rump shouldn't die.

A way to "avoid" this would be not to re-enable the irq if netdde dies.
But this means that other drivers using the same irq will not get any
interrupt any more and thus stop working. And when a newer netdde comes,
we need to know that it indeed drives the hardware which is keeping the
interrupt raised, and thus has reset it, before re-enabling the irq.
Worse, if it's the disk driver which shares the same irq as netdde, one
could not even be able to reload the netdde binary from the disk.

An ugly way to work around this could be to mitigate irq frequency: when
a userland interrupt listener dies before properly unregistering its
interrupt, we could throttle the interrupt delivery to e.g. once per
clock tick, which could be enough to reload the netdde binary from the
disk in a few seconds. Once netdde re-registers the interrupt, we can
re-enable the interrupt normally.

We could even automatically detect such condition: use a counter of how
many times the interrupt handler got called again and again without
the hardware clock managing to trigger. If that gets to something like
a million times, the irq is stuck and we have to throttle it until a
driver for that irq is restarted.

Samuel

Re: [PATCH]: gnumach - simplify interrupt handling

Reply via email to