Re: [PATCH 2/2] remote: Fix a stuck remote call pipeline causing testing to hang

Maciej W. Rozycki Wed, 10 Jun 2020 18:32:50 -0700

On Thu, 21 May 2020, Jacob Bachmeyer wrote:

> >  Have you seen the descriptions around `local_exec'?  Or are you after 
> > further explanations?
> >   
> 
> If I understand correctly, the problems occur when local_exec starts a 
> pipeline using open instead of directly running the process using 
> spawn.  Could this be fixed at the source by using spawn to run {/bin/sh 
> -c "exec \"$commandline\" $inp $outp 2>&1"} (including the braces this 
> time) instead of opening a pipeline?  The shell would set up the 
> redirections and then replace itself with the requested process.


 I suspect that your proposed alternative will work, however I think it 
will require indeed extensive verification.  For one extra quoting will 
likely be required for $commandline, externally supplied after all, as the 
shell will otherwise do its own expansions/substitutions/etc. that a TCL 
pipeline does not.  Question also is: why wasn't it done via `sh' from the 
beginning?

 Therefore I think keeping the fix to the minimum is a good solution to 
assure forward progress, and then whoever may want to do that can spend 
their time working comfortably on a code rewrite/improvement without the 
urge to fix an outright bug.

 NB we already have issues in this area as I can use say:

set_board_info rsh_prog "ssh -p 12345"

in a GCC board description file and that will do what's intended, that is 
the harness will run `ssh' with `-p' and `12345' command-line arguments, 
however the same arrangement will fail in a GDB board description file, as 
the harness will try to run a `ssh -p 12345' executable and fail to find 
it.  Obviously there is an inconsistency in quoting, and as a matter of 
interest I have tracked it down to the GDB testsuite using `remote_spawn' 
rather than `local_exec' to execute commands on a remote system.  So we do 
have to be very careful when putting a command shell into or removing it 
from the picture.

> >   Otherwise there is nothing really we can do about the process as 
> > it's the kernel that holds it.  Therefore I wouldn't consider it PID 
> > leakage.
> >   
> 
> It is a leak caused by something we do not control and for which we are 
> not responsible, but could there be some way to detect repeated 
> occurrences of this condition and abort the testsuite before effectively 
> forkbombing the system?

 I think the solution to that problem is called `ulimit'; I think there is 
no need to invent any alternative way to deal with this exceedingly rare 
situation.  And the timeouts we have that control progress with testing 
actually guarantee any resource exhaustion won't be so quick as to make 
the system unresponsive.

> >> The big question I have is whether this patch breaks the GDB testsuite?  
> >> The "cancel force-kill" mechanism was added to fix a problem with 
> >> testing GDB.

 Please note however that the circumstances do change with my proposed 
fix.

 Before the original fix was applied close(n) was called on a pipeline 
channel operating in the blocking mode.  That made the request issue 
wait(2) syscalls on all the PIDs associated with the pipeline, which in 
turn opened a PID reuse race with the force-kills, as one or more of those 
force-kills could have executed only once the respective PID has already 
been wait(2)ed for and handed out to an new process creaded meanwhile.

 Killing the force-kills before close(n) and consequently wait(2) has been 
called was the original fix, and it was correct in that it eliminated the 
race, however it introduced the deadlock I have observed here instead as a 
wait(2) call invoked by close(n) on a process that has no intent to 
terminate is going to stall forever.

 With the switch to the nonblocking mode for the channel I have eliminated 
the issue with close(n) calling wait(2), and therefore there is no issue 
with a PID reuse race anymore, because, as I have observed in the change 
description, the PID of a process that has not been wait(2)ed for remains 
assigned to the original process even once it's gone, so the PID cannot be 
handed out to a new process.  So the signals issued by force-kills will 
reach the original processes, be it live if stuck or zombiefied if already 
gone.

 I guess there might still be a question as to what happens if a pipeline 
has not become stuck and instead it merely completed "right after the 
timeout has triggered, but before all of this close-and-wait dance."  I'm 
not completely sure here and close(n) documentation cryptically says:

"If the channel is nonblocking and there is unflushed output, the channel 
remains open and the command returns immediately; output will be flushed 
in the background and the channel will be closed when all the flushing is 
complete."

without actually clarifying what "in the background" means here.  There is 
a further hint in fconfigure(n):

"For nonblocking mode to work correctly, the application must be using the 
Tcl event loop (e.g. by calling Tcl_DoOneEvent or invoking the vwait 
command)."

and "the application" here is obviously Expect.

 Expect documentation is rather terse as to its event loop handling, but I 
have looked through its sources and `Tcl_DoOneEvent' is only called in two 
places, specifically `exp_dsleep` and 'exp_get_next_event', which in turn 
are backends to delay and interactive commands respectively, like `sleep' 
and `expect'.

> >  I'm not set up to run remote GDB testing with QEMU at the moment; I'll 
> > see if I can either wire it or use a different system with real remote 
> > hardware.  I think it is not going to matter as commit e7d7a3e0b0cd 
> > mentions GDB itself being affected rather than any test cases run (via 
> > `gdbserver') on the remote system, and I'd expect GDB to be directly 
> > spawned rather than run as a pipeline, in which case we'll have a single 
> > PID only, the invocation of `wait' will have wait(2)ed on it and the 
> > force-kills will be cancelled as per the `[llength $pid] == 1' condition.
> 
> In that case, at least check that the patch does not break the native 
> GDB testsuite.

 I did verify now with both remotely with the `riscv64-linux-gnu' target 
and natively, with the `x86_64-linux-gnu' host in both cases (and site.exp 
tweaked in the latter case with CC_FOR_TARGET and CXX_FOR_TARGET settings 
so as to avoid the unix.exp regression discussed in the other thread).  
There were no issues, but honestly for such an intermittent failure as the 
"cancel force-kill" mechanism was meant to address I would not consider it 
good enough a sample.

 So instead I did some further manual investigation and examined what 
kinds of external program execution patterns there are in the GCC and GDB 
test suites.

 First of all, following the observation above, `local_exec' is not the 
only place where pipelines are created; `remote_spawn' is another place.  
For `remote_spawn' the pipeline termination sequence is invoked via 
`standard_close'.  So obviously that procedure has to be updated 
accordingly (done in v2 now).

 Additionally `rsh_download' uses `exec' to run `rm -f' on the remote 
system, and then likewise the `$RCP' command; if either gets stuck no 
timeout occurs and the testsuite hangs; these invocations ought to get 
updated to call `local_exec' instead I suppose, and indeed newer 
`ssh_download' already has that.  I suggest doing that, but won't work on 
it myself on this occasion as I need to switch to other commitments ASAP, 
and therefore I'm only going to try an tie up the loose ends rather than 
starting any new work.

 Finally `standard_spawn' is also used to run programs on the remote, but 
it always just uses `spawn' with $RSH.

 Now the usage is as follows:

1. GCC testing uses `local_exec' to run compilation and uses `spawn' there 
   rather than a pipeline as there's no redirection.

   * In native testing it uses `remote_spawn' with the `readonly' argument 
     to run the program under test though, so a pipeline does get created.

   * In remote testing it uses `local_exec' to run $RSH with both the 
     program under test and several other commands like `rm -f' or 
     `mkdir', always with input redirected, so a pipeline does get created 
     in all these cases.  As noted above `rsh_download' is also used that 
     calls $RSH with `exec'.

   * In simulator testing it uses `remote_spawn' with no arguments, so it 
     uses `spawn' and no pipeline.

2. GDB testing likewise uses `local_exec' to run compilation and uses 
   `spawn' there rather than a pipeline as there's no redirection.  It 
   uses `remote_spawn' with no arguments to run GDB, so again it uses 
   `spawn'.

   * In remote testing it uses `standard_spawn' to run `gdbserver' via 
     $RSH on the remote system, so again it uses spawn.  It also uses 
     `local_exec' to run $RSH with `rm -f' with input redirected, so a 
     pipeline does get created in this case.  Finally as noted above 
     `rsh_download' is also used that calls $RSH with `exec'.

 So it looks to me like there is still some, but low likelihood for the 
PID reuse to happen, although with GDB testing it will only be in the 
remote case and only iff it's `rm -f' that hangs, i.e. not with the usual 
test hangs.  Conversely the likelihood for the PID reuse is somewhat 
higher with GCC testing, both native and remote, but then timeouts (hangs) 
seem much rarer there.

 It looks to me like the benefits outweigh the risk, so I've posted v2 for 
consideration.

  Maciej

Re: [PATCH 2/2] remote: Fix a stuck remote call pipeline causing testing to hang

Reply via email to