On Thu, 21 May 2020, Jacob Bachmeyer wrote: > > Have you seen the descriptions around `local_exec'? Or are you after > > further explanations? > > > > If I understand correctly, the problems occur when local_exec starts a > pipeline using open instead of directly running the process using > spawn. Could this be fixed at the source by using spawn to run {/bin/sh > -c "exec \"$commandline\" $inp $outp 2>&1"} (including the braces this > time) instead of opening a pipeline? The shell would set up the > redirections and then replace itself with the requested process.
I suspect that your proposed alternative will work, however I think it will require indeed extensive verification. For one extra quoting will likely be required for $commandline, externally supplied after all, as the shell will otherwise do its own expansions/substitutions/etc. that a TCL pipeline does not. Question also is: why wasn't it done via `sh' from the beginning? Therefore I think keeping the fix to the minimum is a good solution to assure forward progress, and then whoever may want to do that can spend their time working comfortably on a code rewrite/improvement without the urge to fix an outright bug. NB we already have issues in this area as I can use say: set_board_info rsh_prog "ssh -p 12345" in a GCC board description file and that will do what's intended, that is the harness will run `ssh' with `-p' and `12345' command-line arguments, however the same arrangement will fail in a GDB board description file, as the harness will try to run a `ssh -p 12345' executable and fail to find it. Obviously there is an inconsistency in quoting, and as a matter of interest I have tracked it down to the GDB testsuite using `remote_spawn' rather than `local_exec' to execute commands on a remote system. So we do have to be very careful when putting a command shell into or removing it from the picture. > > Otherwise there is nothing really we can do about the process as > > it's the kernel that holds it. Therefore I wouldn't consider it PID > > leakage. > > > > It is a leak caused by something we do not control and for which we are > not responsible, but could there be some way to detect repeated > occurrences of this condition and abort the testsuite before effectively > forkbombing the system? I think the solution to that problem is called `ulimit'; I think there is no need to invent any alternative way to deal with this exceedingly rare situation. And the timeouts we have that control progress with testing actually guarantee any resource exhaustion won't be so quick as to make the system unresponsive. > >> The big question I have is whether this patch breaks the GDB testsuite? > >> The "cancel force-kill" mechanism was added to fix a problem with > >> testing GDB. Please note however that the circumstances do change with my proposed fix. Before the original fix was applied close(n) was called on a pipeline channel operating in the blocking mode. That made the request issue wait(2) syscalls on all the PIDs associated with the pipeline, which in turn opened a PID reuse race with the force-kills, as one or more of those force-kills could have executed only once the respective PID has already been wait(2)ed for and handed out to an new process creaded meanwhile. Killing the force-kills before close(n) and consequently wait(2) has been called was the original fix, and it was correct in that it eliminated the race, however it introduced the deadlock I have observed here instead as a wait(2) call invoked by close(n) on a process that has no intent to terminate is going to stall forever. With the switch to the nonblocking mode for the channel I have eliminated the issue with close(n) calling wait(2), and therefore there is no issue with a PID reuse race anymore, because, as I have observed in the change description, the PID of a process that has not been wait(2)ed for remains assigned to the original process even once it's gone, so the PID cannot be handed out to a new process. So the signals issued by force-kills will reach the original processes, be it live if stuck or zombiefied if already gone. I guess there might still be a question as to what happens if a pipeline has not become stuck and instead it merely completed "right after the timeout has triggered, but before all of this close-and-wait dance." I'm not completely sure here and close(n) documentation cryptically says: "If the channel is nonblocking and there is unflushed output, the channel remains open and the command returns immediately; output will be flushed in the background and the channel will be closed when all the flushing is complete." without actually clarifying what "in the background" means here. There is a further hint in fconfigure(n): "For nonblocking mode to work correctly, the application must be using the Tcl event loop (e.g. by calling Tcl_DoOneEvent or invoking the vwait command)." and "the application" here is obviously Expect. Expect documentation is rather terse as to its event loop handling, but I have looked through its sources and `Tcl_DoOneEvent' is only called in two places, specifically `exp_dsleep` and 'exp_get_next_event', which in turn are backends to delay and interactive commands respectively, like `sleep' and `expect'. > > I'm not set up to run remote GDB testing with QEMU at the moment; I'll > > see if I can either wire it or use a different system with real remote > > hardware. I think it is not going to matter as commit e7d7a3e0b0cd > > mentions GDB itself being affected rather than any test cases run (via > > `gdbserver') on the remote system, and I'd expect GDB to be directly > > spawned rather than run as a pipeline, in which case we'll have a single > > PID only, the invocation of `wait' will have wait(2)ed on it and the > > force-kills will be cancelled as per the `[llength $pid] == 1' condition. > > In that case, at least check that the patch does not break the native > GDB testsuite. I did verify now with both remotely with the `riscv64-linux-gnu' target and natively, with the `x86_64-linux-gnu' host in both cases (and site.exp tweaked in the latter case with CC_FOR_TARGET and CXX_FOR_TARGET settings so as to avoid the unix.exp regression discussed in the other thread). There were no issues, but honestly for such an intermittent failure as the "cancel force-kill" mechanism was meant to address I would not consider it good enough a sample. So instead I did some further manual investigation and examined what kinds of external program execution patterns there are in the GCC and GDB test suites. First of all, following the observation above, `local_exec' is not the only place where pipelines are created; `remote_spawn' is another place. For `remote_spawn' the pipeline termination sequence is invoked via `standard_close'. So obviously that procedure has to be updated accordingly (done in v2 now). Additionally `rsh_download' uses `exec' to run `rm -f' on the remote system, and then likewise the `$RCP' command; if either gets stuck no timeout occurs and the testsuite hangs; these invocations ought to get updated to call `local_exec' instead I suppose, and indeed newer `ssh_download' already has that. I suggest doing that, but won't work on it myself on this occasion as I need to switch to other commitments ASAP, and therefore I'm only going to try an tie up the loose ends rather than starting any new work. Finally `standard_spawn' is also used to run programs on the remote, but it always just uses `spawn' with $RSH. Now the usage is as follows: 1. GCC testing uses `local_exec' to run compilation and uses `spawn' there rather than a pipeline as there's no redirection. * In native testing it uses `remote_spawn' with the `readonly' argument to run the program under test though, so a pipeline does get created. * In remote testing it uses `local_exec' to run $RSH with both the program under test and several other commands like `rm -f' or `mkdir', always with input redirected, so a pipeline does get created in all these cases. As noted above `rsh_download' is also used that calls $RSH with `exec'. * In simulator testing it uses `remote_spawn' with no arguments, so it uses `spawn' and no pipeline. 2. GDB testing likewise uses `local_exec' to run compilation and uses `spawn' there rather than a pipeline as there's no redirection. It uses `remote_spawn' with no arguments to run GDB, so again it uses `spawn'. * In remote testing it uses `standard_spawn' to run `gdbserver' via $RSH on the remote system, so again it uses spawn. It also uses `local_exec' to run $RSH with `rm -f' with input redirected, so a pipeline does get created in this case. Finally as noted above `rsh_download' is also used that calls $RSH with `exec'. So it looks to me like there is still some, but low likelihood for the PID reuse to happen, although with GDB testing it will only be in the remote case and only iff it's `rm -f' that hangs, i.e. not with the usual test hangs. Conversely the likelihood for the PID reuse is somewhat higher with GCC testing, both native and remote, but then timeouts (hangs) seem much rarer there. It looks to me like the benefits outweigh the risk, so I've posted v2 for consideration. Maciej