Re: [PATCH 2/2] remote: Fix a stuck remote call pipeline causing testing to hang

Maciej W. Rozycki Thu, 21 May 2020 17:14:53 -0700

On Wed, 20 May 2020, Jacob Bachmeyer wrote:

> > So the solution to the problem is twofold.  First pending force-kills 
> > are not killed after `wait' if there are more than one PID in the list 
> > passed to `close_wait_program'.  This follows the observation that if 
> > there was only one PID on the list, then the process must have been 
> > created directly by `spawn' rather than by assigning a spawn ID to a 
> > pipeline and the return from `wait' would mean the process associated 
> > with the PID must have already been cleaned up after, so it is only when 
> > there are more there is a possibility any may have been left behind 
> > live.
> >   
> 
> Why are some cases assigning a spawn ID to a pipeline and others 
> directly using spawn?


 Have you seen the descriptions around `local_exec'?  Or are you after 
further explanations?

> > Second if a pipeline has been used, then the channel associated with the 
> > pipeline is set to the nonblocking mode in case any of the processes 
> > that may have left live is stuck in the noninterruptible sleep (aka D) 
> > state.  Such a process would necessarily ignore even SIGKILL so long as 
> > it remains in that state and would cause wait(2) called by `close' to 
> > hang possibly indefinitely, and we want the testsuite to proceed rather 
> > than hang even in bad circumstances.
> >   
> 
> This amounts to leaking PIDs but should be tolerable since the condition 
> causing it is abnormal and external to the testsuite.

 Having a process permanently stuck in the D state is exceedingly rare and 
often signifies a hardware failure (oh well) or a software bug somewhere 
(but we want to be able to run software verification with buggy systems, 
sometimes even with imperfect hardware we want to bring up, or otherwise 
how would we chase those bugs?), and since the process has been treated 
with SIGKILL already it will go away right away if it ever leaves that 
state.  Otherwise there is nothing really we can do about the process as 
it's the kernel that holds it.  Therefore I wouldn't consider it PID 
leakage.
 
> > dejagnu-remote-close-wait-unblock.diff
> > Index: dejagnu/lib/remote.exp
> > ===================================================================
> > --- dejagnu.orig/lib/remote.exp
> > +++ dejagnu/lib/remote.exp
> > @@ -109,10 +109,15 @@ proc close_wait_program { program_id pid
> >  
> >      # Reap it.
> >      set res [catch "wait -i $program_id" wres]
> > -    if {$exec_pid != -1} {
> > +    if { $exec_pid != -1 && [llength $pid] == 1 } {
> >     # We reaped the process, so cancel the pending force-kills, as
> >     # otherwise if the PID is reused for some other unrelated
> >     # process, we'd kill the wrong process.
> > +   #
> > +   # Do this if the PID list only has a single entry however, as
> > +   # otherwise `wait' will have returned once any single process
> > +   # of the pipeline has exited regardless of whether any other
> > +   # ones have remained.
> >     #
> >     # Use `catch' in case the force-kills have completed, so as not
> >     # to cause TCL to choke if `kill' returns a failure.
> > @@ -243,6 +248,12 @@ proc local_exec { commandline inp outp t
> >      }
> >      set r2 [close_wait_program $spawn_id $pid wres]
> >      if { $id > 0 } {
> > +   if { ! $got_eof } {
> > +       # If timed-out, don't wait for all the processes associated
> > +       # with the pipeline to terminate as a stuck one would cause
> > +       # us to hang.
> > +       set r2 [catch "fconfigure $id -blocking false" res]
> > +   }
> >     set r2 [catch "close $id" res]
> >      } else {
> >     verbose "waitres is $wres" 2
> >   
> 
> The big question I have is whether this patch breaks the GDB testsuite?  
> The "cancel force-kill" mechanism was added to fix a problem with 
> testing GDB.

 I'm not set up to run remote GDB testing with QEMU at the moment; I'll 
see if I can either wire it or use a different system with real remote 
hardware.  I think it is not going to matter as commit e7d7a3e0b0cd 
mentions GDB itself being affected rather than any test cases run (via 
`gdbserver') on the remote system, and I'd expect GDB to be directly 
spawned rather than run as a pipeline, in which case we'll have a single 
PID only, the invocation of `wait' will have wait(2)ed on it and the 
force-kills will be cancelled as per the `[llength $pid] == 1' condition.

  Maciej

Re: [PATCH 2/2] remote: Fix a stuck remote call pipeline causing testing to hang

Reply via email to