Maciej W. Rozycki wrote:
On Wed, 20 May 2020, Jacob Bachmeyer wrote:
So the solution to the problem is twofold. First pending force-kills
are not killed after `wait' if there are more than one PID in the list
passed to `close_wait_program'. This follows the observation that if
there was only one PID on the list, then the process must have been
created directly by `spawn' rather than by assigning a spawn ID to a
pipeline and the return from `wait' would mean the process associated
with the PID must have already been cleaned up after, so it is only when
there are more there is a possibility any may have been left behind
live.
Why are some cases assigning a spawn ID to a pipeline and others
directly using spawn?
Have you seen the descriptions around `local_exec'? Or are you after
further explanations?
If I understand correctly, the problems occur when local_exec starts a
pipeline using open instead of directly running the process using
spawn. Could this be fixed at the source by using spawn to run {/bin/sh
-c "exec \"$commandline\" $inp $outp 2>&1"} (including the braces this
time) instead of opening a pipeline? The shell would set up the
redirections and then replace itself with the requested process.
Second if a pipeline has been used, then the channel associated with the
pipeline is set to the nonblocking mode in case any of the processes
that may have left live is stuck in the noninterruptible sleep (aka D)
state. Such a process would necessarily ignore even SIGKILL so long as
it remains in that state and would cause wait(2) called by `close' to
hang possibly indefinitely, and we want the testsuite to proceed rather
than hang even in bad circumstances.
This amounts to leaking PIDs but should be tolerable since the condition
causing it is abnormal and external to the testsuite.
Having a process permanently stuck in the D state is exceedingly rare and
often signifies a hardware failure (oh well) or a software bug somewhere
(but we want to be able to run software verification with buggy systems,
sometimes even with imperfect hardware we want to bring up, or otherwise
how would we chase those bugs?), and since the process has been treated
with SIGKILL already it will go away right away if it ever leaves that
state.
I have had exactly that happen a long time ago with a bug in the ext3
filesystem driver: the process that triggered the bug died at the oops
(IIRC) but any process that touched the broken mountpoint afterwards
would get stuck in D state. (Really Bad when that mountpoint is
/home!) The only recovery was to use the SysRq reboot trigger because
attempting an orderly shutdown would hang.
Otherwise there is nothing really we can do about the process as
it's the kernel that holds it. Therefore I wouldn't consider it PID
leakage.
It is a leak caused by something we do not control and for which we are
not responsible, but could there be some way to detect repeated
occurrences of this condition and abort the testsuite before effectively
forkbombing the system?
dejagnu-remote-close-wait-unblock.diff
Index: dejagnu/lib/remote.exp
===================================================================
--- dejagnu.orig/lib/remote.exp
+++ dejagnu/lib/remote.exp
@@ -109,10 +109,15 @@ proc close_wait_program { program_id pid
# Reap it.
set res [catch "wait -i $program_id" wres]
- if {$exec_pid != -1} {
+ if { $exec_pid != -1 && [llength $pid] == 1 } {
# We reaped the process, so cancel the pending force-kills, as
# otherwise if the PID is reused for some other unrelated
# process, we'd kill the wrong process.
+ #
+ # Do this if the PID list only has a single entry however, as
+ # otherwise `wait' will have returned once any single process
+ # of the pipeline has exited regardless of whether any other
+ # ones have remained.
#
# Use `catch' in case the force-kills have completed, so as not
# to cause TCL to choke if `kill' returns a failure.
@@ -243,6 +248,12 @@ proc local_exec { commandline inp outp t
}
set r2 [close_wait_program $spawn_id $pid wres]
if { $id > 0 } {
+ if { ! $got_eof } {
+ # If timed-out, don't wait for all the processes associated
+ # with the pipeline to terminate as a stuck one would cause
+ # us to hang.
+ set r2 [catch "fconfigure $id -blocking false" res]
+ }
set r2 [catch "close $id" res]
} else {
verbose "waitres is $wres" 2
The big question I have is whether this patch breaks the GDB testsuite?
The "cancel force-kill" mechanism was added to fix a problem with
testing GDB.
I'm not set up to run remote GDB testing with QEMU at the moment; I'll
see if I can either wire it or use a different system with real remote
hardware. I think it is not going to matter as commit e7d7a3e0b0cd
mentions GDB itself being affected rather than any test cases run (via
`gdbserver') on the remote system, and I'd expect GDB to be directly
spawned rather than run as a pipeline, in which case we'll have a single
PID only, the invocation of `wait' will have wait(2)ed on it and the
force-kills will be cancelled as per the `[llength $pid] == 1' condition.
In that case, at least check that the patch does not break the native
GDB testsuite.
-- Jacob