Re: [PATCH 2/2] remote: Fix a stuck remote call pipeline causing testing to hang

Jacob Bachmeyer Thu, 21 May 2020 18:37:32 -0700

Maciej W. Rozycki wrote:

On Wed, 20 May 2020, Jacob Bachmeyer wrote:
So the solution to the problem is twofold. First pending force-killsare not killed after `wait' if there are more than one PID in the listpassed to `close_wait_program'. This follows the observation that ifthere was only one PID on the list, then the process must have beencreated directly by `spawn' rather than by assigning a spawn ID to apipeline and the return from `wait' would mean the process associatedwith the PID must have already been cleaned up after, so it is only whenthere are more there is a possibility any may have been left behindlive.
Why are some cases assigning a spawn ID to a pipeline and othersdirectly using spawn?
Have you seen the descriptions around `local_exec'? Or are you afterfurther explanations?

If I understand correctly, the problems occur when local_exec starts apipeline using open instead of directly running the process usingspawn. Could this be fixed at the source by using spawn to run {/bin/sh-c "exec \"$commandline\" $inp $outp 2>&1"} (including the braces thistime) instead of opening a pipeline? The shell would set up theredirections and then replace itself with the requested process.

Second if a pipeline has been used, then the channel associated with thepipeline is set to the nonblocking mode in case any of the processesthat may have left live is stuck in the noninterruptible sleep (aka D)state. Such a process would necessarily ignore even SIGKILL so long asit remains in that state and would cause wait(2) called by `close' tohang possibly indefinitely, and we want the testsuite to proceed ratherthan hang even in bad circumstances.
This amounts to leaking PIDs but should be tolerable since the conditioncausing it is abnormal and external to the testsuite.
Having a process permanently stuck in the D state is exceedingly rare andoften signifies a hardware failure (oh well) or a software bug somewhere(but we want to be able to run software verification with buggy systems,sometimes even with imperfect hardware we want to bring up, or otherwisehow would we chase those bugs?), and since the process has been treatedwith SIGKILL already it will go away right away if it ever leaves thatstate.

I have had exactly that happen a long time ago with a bug in the ext3filesystem driver: the process that triggered the bug died at the oops(IIRC) but any process that touched the broken mountpoint afterwardswould get stuck in D state. (Really Bad when that mountpoint is/home!) The only recovery was to use the SysRq reboot trigger becauseattempting an orderly shutdown would hang.

Otherwise there is nothing really we can do about the process asit's the kernel that holds it. Therefore I wouldn't consider it PIDleakage.

It is a leak caused by something we do not control and for which we arenot responsible, but could there be some way to detect repeatedoccurrences of this condition and abort the testsuite before effectivelyforkbombing the system?

dejagnu-remote-close-wait-unblock.diff
Index: dejagnu/lib/remote.exp
===================================================================
--- dejagnu.orig/lib/remote.exp
+++ dejagnu/lib/remote.exp
@@ -109,10 +109,15 @@ proc close_wait_program { program_id pid
# Reap it.
     set res [catch "wait -i $program_id" wres]
-    if {$exec_pid != -1} {
+    if { $exec_pid != -1 && [llength $pid] == 1 } {
        # We reaped the process, so cancel the pending force-kills, as
        # otherwise if the PID is reused for some other unrelated
        # process, we'd kill the wrong process.
+       #
+       # Do this if the PID list only has a single entry however, as
+       # otherwise `wait' will have returned once any single process
+       # of the pipeline has exited regardless of whether any other
+       # ones have remained.
        #
        # Use `catch' in case the force-kills have completed, so as not
        # to cause TCL to choke if `kill' returns a failure.
@@ -243,6 +248,12 @@ proc local_exec { commandline inp outp t
     }
     set r2 [close_wait_program $spawn_id $pid wres]
     if { $id > 0 } {
+       if { ! $got_eof } {
+           # If timed-out, don't wait for all the processes associated
+           # with the pipeline to terminate as a stuck one would cause
+           # us to hang.
+           set r2 [catch "fconfigure $id -blocking false" res]
+       }
        set r2 [catch "close $id" res]
     } else {
        verbose "waitres is $wres" 2
The big question I have is whether this patch breaks the GDB testsuite?The "cancel force-kill" mechanism was added to fix a problem withtesting GDB.
I'm not set up to run remote GDB testing with QEMU at the moment; I'llsee if I can either wire it or use a different system with real remotehardware. I think it is not going to matter as commit e7d7a3e0b0cdmentions GDB itself being affected rather than any test cases run (via`gdbserver') on the remote system, and I'd expect GDB to be directlyspawned rather than run as a pipeline, in which case we'll have a singlePID only, the invocation of `wait' will have wait(2)ed on it and theforce-kills will be cancelled as per the `[llength $pid] == 1' condition.

In that case, at least check that the patch does not break the nativeGDB testsuite.



-- Jacob

Re: [PATCH 2/2] remote: Fix a stuck remote call pipeline causing testing to hang

Reply via email to