> In the first case, if the subprocess N has terminated, its report is > still queued and "wait" retrieves it. In the second case, if the > subprocess N has terminated, it doesn't exist and as the manual page > says "If id specifies a non-existent process or job, the return status > is 127." > > What you're pointing out is that that creates a race condition when the > subprocess ends before the "wait". And it seems that the kernel has > enough information to tell "wait -n N", "process N doesn't exist, but > you do have a queued termination report for it". But it's not clear > that there's a way to ask the kernel for that information without > reading all the queued termination reports (and losing the ability to > return them for other "wait" calls).
Thanks for the response, but I don't believe this is correct. Your understanding of the wait syscall is correct except that the exit code and process information always remains available until the process is awaited by its parent -- it is the wait syscall that itself reaps the process and makes it unavailable to later searches by pid. There is a possibility that the parent (bash in this case) might reap the process in multiple ways (i.e., from different threads, setting the SIGCHLD disposition to SIG_IGN, setting flat SA_NOCLDWAIT for the SIGCHLD handler -- the last 2 from NOTES of man waitpid on linux) that race with each other, but the parent is always given an opportunity to read the exit code and reap the process if not disabled with SIGCHLD handler configuration. My understanding of bash is that it internally maintains a queue/list of finished child jobs to return such that wait -n mimics aspects of the wait syscall. The discussion at https://lists.gnu.org/archive/html/bug-bash/2023-05/msg00063.html supports that bash "silently" reaps child processes and decouples the wait syscall from the wait command. I assume it's possible to confirm that bash is awaiting the process and retrieving the exit code via ptrace/strace but I'm unfamiliar with these tools or bash logs. The test below allows the subprocess to complete normally, without being signaled, and then successfully retrieves its exit code via wait -n. This subprocess terminates before the call to wait -n. I see no documented reason that a process terminating without signal prior to wait -n should be returned while a process terminating with signal prior to wait -n should not. echo "TEST: EXIT 0 PRIOR TO wait -n @${SECONDS}" { sleep 1; echo "child finishing @${SECONDS}"; exit 1; } & pid=$! echo "child proc $pid @${SECONDS}" sleep 2 wait -n $pid echo "wait -n $pid return code $? @${SECONDS}" For which I get output: TEST: EXIT 0 PRIOR TO wait -n @0 child proc 2270 @0 child finishing @1 wait -n 2270 return code 1 @2 Steve