Maciej W. Rozycki wrote:
[...]
So the solution to the problem is twofold. First pending force-kills
are not killed after `wait' if there are more than one PID in the list
passed to `close_wait_program'. This follows the observation that if
there was only one PID on the list, then the process must have been
created directly by `spawn' rather than by assigning a spawn ID to a
pipeline and the return from `wait' would mean the process associated
with the PID must have already been cleaned up after, so it is only when
there are more there is a possibility any may have been left behind
live.
Why are some cases assigning a spawn ID to a pipeline and others
directly using spawn?
Second if a pipeline has been used, then the channel associated with the
pipeline is set to the nonblocking mode in case any of the processes
that may have left live is stuck in the noninterruptible sleep (aka D)
state. Such a process would necessarily ignore even SIGKILL so long as
it remains in that state and would cause wait(2) called by `close' to
hang possibly indefinitely, and we want the testsuite to proceed rather
than hang even in bad circumstances.
This amounts to leaking PIDs but should be tolerable since the condition
causing it is abnormal and external to the testsuite.
Finally it appears to be safe to leave pending force-kills to complete
their job after `wait' has been called in `close_wait_program', because
based on the observation made here the command does not actually call
wait(2) if issued on a spawn ID associated with a pipeline created by
`open' rather than a process created by `spawn'. Instead the PIDs from
a pipeline are supposed to be cleaned up after by calling wait(2) from
the `close' command call made on the pipeline channel. If on the other
hand the channel is set to the nonblocking mode before `close', then
even that command does not call wait(2) on the associated PIDs.
Therefore the PIDs on the list passed are not subject to PID reuse and
the force-kills won't accidentally kill an unrelated process, as a PID
cannot be allocated by the kernel for a new process until any previous
process's status has been consumed from its PID by wait(2). And then
PIDs of any children that have actually terminated one way or another
are wait(2)ed for by Tcl automatically, so no mess is left behind.
The above two paragraphs need to be in the Git commit message if this
patch is merged; they describe assumptions critical to correct operation
that I do not believe are the documented behavior of Tcl.
* lib/remote.exp (close_wait_program): Only kill the pending
force-kills if the PID list has a single entry.
(local_exec): If we didn't see an EOF, then set the channel
about to be closed to the nonblocking mode.
Signed-off-by: Maciej W. Rozycki <ma...@wdc.com>
---
Hi,
I have observed the hang in actual testing, where I ran full GCC testing
(including all language front-ends and all target libraries) remotely for
the `riscv64-linux-gnu' target with QEMU in the system emulation mode. I
left it running over the UK long weekend of May 8th-10th and upon return
found the testsuite stuck for several days. The underlying reason was an
unindentified issue with QEMU or the Linux installation within causing the
emulator's virtual network interface to go down. However the testsuite
ought not to have got stuck indefinitely on the host system. The process
state on the host was as shown above, with `ssh' still live and both `cat'
and `sh' in the zombie state.
I have since reproduced the issue with a shell script substituting `ssh'
with a command that reliably hangs, which let me track down the cause and
diagnose it as above. This fix then let GCC testing run to completion,
despite intermittent issues with QEMU remaining and causing occasional
time-outs.
Therefore, please apply. FAOD this has been formatted for `git am' use.
Maciej
---
lib/remote.exp | 13 ++++++++++++-
1 file changed, 12 insertions(+), 1 deletion(-)
dejagnu-remote-close-wait-unblock.diff
Index: dejagnu/lib/remote.exp
===================================================================
--- dejagnu.orig/lib/remote.exp
+++ dejagnu/lib/remote.exp
@@ -109,10 +109,15 @@ proc close_wait_program { program_id pid
# Reap it.
set res [catch "wait -i $program_id" wres]
- if {$exec_pid != -1} {
+ if { $exec_pid != -1 && [llength $pid] == 1 } {
# We reaped the process, so cancel the pending force-kills, as
# otherwise if the PID is reused for some other unrelated
# process, we'd kill the wrong process.
+ #
+ # Do this if the PID list only has a single entry however, as
+ # otherwise `wait' will have returned once any single process
+ # of the pipeline has exited regardless of whether any other
+ # ones have remained.
#
# Use `catch' in case the force-kills have completed, so as not
# to cause TCL to choke if `kill' returns a failure.
@@ -243,6 +248,12 @@ proc local_exec { commandline inp outp t
}
set r2 [close_wait_program $spawn_id $pid wres]
if { $id > 0 } {
+ if { ! $got_eof } {
+ # If timed-out, don't wait for all the processes associated
+ # with the pipeline to terminate as a stuck one would cause
+ # us to hang.
+ set r2 [catch "fconfigure $id -blocking false" res]
+ }
set r2 [catch "close $id" res]
} else {
verbose "waitres is $wres" 2
The big question I have is whether this patch breaks the GDB testsuite?
The "cancel force-kill" mechanism was added to fix a problem with
testing GDB.
-- Jacob