I'm continuing to debug the problem I reported earlier [1], where some executables called from within shell scripts don't fully terminate. This happens fairly often while sourcing the default /etc/profile, but not always, and not always in the same place. When this happens, if I use windows tools to look at the processes running, I see that the WinPID process has exited, but that the PID (bash.exe) process has not. The process is reported in ps, but is not killable (since kill doesn't think that it exists). The entries exist in /proc/<pid>/, but are not usable. (/proc/<pid>/{cwd,root} point to <defunct>, cat /proc/<pid>/status gives a newline character, etc.)
I believe that I've isolated approximately where in the code this seems to go wrong. For the purposes of this example, a /bin/bash instance with PID 3436 spawned another /bin/bash with PID 6036 which spawned /bin/echo with PID 1932. This shows up in ps as follows: $ ps PID PPID PGID WINPID TTY UID STIME COMMAND 6036 3436 3436 1932 con 500 Mar 18 /usr/bin/echo Looking through strace output, echo finishes its processing: ... 16:45:02 [main] echo 6036 pinfo::exit: Calling ExitProcess n 0x0, exitcode 0x0 ... 16:45:02 [main] echo 6036 pinfo::exit: Calling ExitProcess n 0x0, exitcode 0x0 ... Later in the strace log, the parent bash process considers whether or not to clean up the process: 16:45:02 [sig] bash 3436 checkstate: nprocs 2 16:45:02 [sig] bash 3436 stopped_or_terminated: considering pid 6036 16:45:02 [sig] bash 3436 stopped_or_terminated: considering pid 5652 16:45:02 [sig] bash 3436 remove_proc: removing procs[1], pid 5652, nprocs 2 16:45:02 [main] bash 3436 wait4: 0 = WaitForSingleObject (...) 16:45:02 [main] bash 3436 wait4: intpid -1, status 0x23BAA8, w->status 0, options 0, res 5652 16:45:02 [sig] bash 3436 checkstate: returning 1 And again: 16:45:02 [main] bash 3436 checkstate: nprocs 1 16:45:02 [main] bash 3436 stopped_or_terminated: considering pid 6036 16:45:02 [main] bash 3436 checkstate: no matching terminated children found 16:45:02 [main] bash 3436 checkstate: returning -1 Looking in sigproc.cc, I believe that it passes the first few conditions, since PID 3436 is greater than 0, and is different from the child PID of 6036. Given that remove_proc was never called for 6036, stopped_or_terminated must have returned 0. That means that one of the following conditions must be true: if (!((terminated = (child->process_state == PID_EXITED)) || ((w->options & WUNTRACED) && child->stopsig))) return 0; Attempts to attach gdb to any of the processes 1932, 6036, or 3436 were all unsuccessful, and pstack does not appear to exist in cygwin. (And even if it did, I'd be surprised if it could attach to a process to which gdb could not.) Since there are other users on this machine who rely on cygwin to do their work, I'd really rather not recompile the dll to add in more debugging output to determine what is going on here. Could anyone with better knowledge of the source guess at the problem? Is there some other method I could use to get that information out of the released binaries? Any other suggestions of workarounds or alternate approaches? Thanks in advance for any help! [1] http://cygwin.com/ml/cygwin/2008-03/msg00322.html -Sam -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/