Re: wait -n misses signaled subprocess

Steven Pelley Wed, 24 Jan 2024 09:41:20 -0800

> In the first case, if the subprocess N has terminated, its report is
> still queued and "wait" retrieves it.  In the second case, if the
> subprocess N has terminated, it doesn't exist and as the manual page
> says "If id specifies a non-existent process or job, the return status
> is 127."
>
> What you're pointing out is that that creates a race condition when the
> subprocess ends before the "wait".  And it seems that the kernel has
> enough information to tell "wait -n N", "process N doesn't exist, but
> you do have a queued termination report for it".  But it's not clear
> that there's a way to ask the kernel for that information without
> reading all the queued termination reports (and losing the ability to
> return them for other "wait" calls).


Thanks for the response, but I don't believe this is correct.

Your understanding of the wait syscall is correct except that the exit
code and process information always remains available until the
process is awaited by its parent -- it is the wait syscall that itself
reaps the process and makes it unavailable to later searches by pid.
There is a possibility that the parent (bash in this case) might reap
the process in multiple ways (i.e., from different threads, setting
the SIGCHLD disposition to SIG_IGN, setting flat SA_NOCLDWAIT for the
SIGCHLD handler -- the last 2 from NOTES of man waitpid on linux) that
race with each other, but the parent is always given an opportunity to
read the exit code and reap the process if not disabled with SIGCHLD
handler configuration.

My understanding of bash is that it internally maintains a queue/list
of finished child jobs to return such that wait -n mimics aspects of
the wait syscall.  The discussion at
https://lists.gnu.org/archive/html/bug-bash/2023-05/msg00063.html
supports that bash "silently" reaps child processes and decouples the
wait syscall from the wait command.

I assume it's possible to confirm that bash is awaiting the process
and retrieving the exit code via ptrace/strace but I'm unfamiliar with
these tools or bash logs.

The test below allows the subprocess to complete normally, without
being signaled, and then successfully retrieves its exit code via wait
-n.  This subprocess terminates before the call to wait -n.  I see no
documented reason that a process terminating without signal prior to
wait -n should be returned while a process terminating with signal
prior to wait -n should not.

echo "TEST: EXIT 0 PRIOR TO wait -n @${SECONDS}"
{ sleep 1; echo "child finishing @${SECONDS}"; exit 1; } &
pid=$!
echo "child proc $pid @${SECONDS}"

sleep 2
wait -n $pid
echo "wait -n $pid return code $? @${SECONDS}"


For which I get output:
TEST: EXIT 0 PRIOR TO wait -n @0
child proc 2270 @0
child finishing @1
wait -n 2270 return code 1 @2


Steve

Re: wait -n misses signaled subprocess

Reply via email to