Re: [PATCH 2/2] remote: Fix a stuck remote call pipeline causing testing to hang

Jacob Bachmeyer Wed, 20 May 2020 19:15:20 -0700

Maciej W. Rozycki wrote:

[...]
So the solution to the problem is twofold. First pending force-killsare not killed after `wait' if there are more than one PID in the listpassed to `close_wait_program'. This follows the observation that ifthere was only one PID on the list, then the process must have beencreated directly by `spawn' rather than by assigning a spawn ID to apipeline and the return from `wait' would mean the process associatedwith the PID must have already been cleaned up after, so it is only whenthere are more there is a possibility any may have been left behindlive.

Why are some cases assigning a spawn ID to a pipeline and othersdirectly using spawn?

Second if a pipeline has been used, then the channel associated with thepipeline is set to the nonblocking mode in case any of the processesthat may have left live is stuck in the noninterruptible sleep (aka D)state. Such a process would necessarily ignore even SIGKILL so long asit remains in that state and would cause wait(2) called by `close' tohang possibly indefinitely, and we want the testsuite to proceed ratherthan hang even in bad circumstances.

This amounts to leaking PIDs but should be tolerable since the conditioncausing it is abnormal and external to the testsuite.

Finally it appears to be safe to leave pending force-kills to completetheir job after `wait' has been called in `close_wait_program', becausebased on the observation made here the command does not actually callwait(2) if issued on a spawn ID associated with a pipeline created by`open' rather than a process created by `spawn'. Instead the PIDs froma pipeline are supposed to be cleaned up after by calling wait(2) fromthe `close' command call made on the pipeline channel. If on the otherhand the channel is set to the nonblocking mode before `close', theneven that command does not call wait(2) on the associated PIDs.
Therefore the PIDs on the list passed are not subject to PID reuse andthe force-kills won't accidentally kill an unrelated process, as a PIDcannot be allocated by the kernel for a new process until any previousprocess's status has been consumed from its PID by wait(2). And thenPIDs of any children that have actually terminated one way or anotherare wait(2)ed for by Tcl automatically, so no mess is left behind.

The above two paragraphs need to be in the Git commit message if thispatch is merged; they describe assumptions critical to correct operationthat I do not believe are the documented behavior of Tcl.

* lib/remote.exp (close_wait_program): Only kill the pendingforce-kills if the PID list has a single entry.(local_exec): If we didn't see an EOF, then set the channelabout to be closed to the nonblocking mode.
Signed-off-by: Maciej W. Rozycki <[email protected]>
---
Hi,
I have observed the hang in actual testing, where I ran full GCC testing(including all language front-ends and all target libraries) remotely forthe `riscv64-linux-gnu' target with QEMU in the system emulation mode. Ileft it running over the UK long weekend of May 8th-10th and upon returnfound the testsuite stuck for several days. The underlying reason was anunindentified issue with QEMU or the Linux installation within causing theemulator's virtual network interface to go down. However the testsuiteought not to have got stuck indefinitely on the host system. The processstate on the host was as shown above, with `ssh' still live and both `cat'and `sh' in the zombie state.
I have since reproduced the issue with a shell script substituting `ssh'with a command that reliably hangs, which let me track down the cause anddiagnose it as above. This fix then let GCC testing run to completion,despite intermittent issues with QEMU remaining and causing occasionaltime-outs.
 Therefore, please apply.  FAOD this has been formatted for `git am' use.

  Maciej
---
 lib/remote.exp |   13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

dejagnu-remote-close-wait-unblock.diff
Index: dejagnu/lib/remote.exp
===================================================================
--- dejagnu.orig/lib/remote.exp
+++ dejagnu/lib/remote.exp
@@ -109,10 +109,15 @@ proc close_wait_program { program_id pid
# Reap it.
     set res [catch "wait -i $program_id" wres]
-    if {$exec_pid != -1} {
+    if { $exec_pid != -1 && [llength $pid] == 1 } {
        # We reaped the process, so cancel the pending force-kills, as
        # otherwise if the PID is reused for some other unrelated
        # process, we'd kill the wrong process.
+       #
+       # Do this if the PID list only has a single entry however, as
+       # otherwise `wait' will have returned once any single process
+       # of the pipeline has exited regardless of whether any other
+       # ones have remained.
        #
        # Use `catch' in case the force-kills have completed, so as not
        # to cause TCL to choke if `kill' returns a failure.
@@ -243,6 +248,12 @@ proc local_exec { commandline inp outp t
     }
     set r2 [close_wait_program $spawn_id $pid wres]
     if { $id > 0 } {
+       if { ! $got_eof } {
+           # If timed-out, don't wait for all the processes associated
+           # with the pipeline to terminate as a stuck one would cause
+           # us to hang.
+           set r2 [catch "fconfigure $id -blocking false" res]
+       }
        set r2 [catch "close $id" res]
     } else {
        verbose "waitres is $wres" 2

The big question I have is whether this patch breaks the GDB testsuite?The "cancel force-kill" mechanism was added to fix a problem withtesting GDB.


-- Jacob

Re: [PATCH 2/2] remote: Fix a stuck remote call pipeline causing testing to hang

Reply via email to