Re: AIX and Interix also do early PID recycling.
On 07/25/12 19:06, Chet Ramey wrote: Well, _SC_CHILD_MAX is documented across platforms as: Heck, even POSIX specifies CHILD_MAX as: "Maximum number of simultaneous processes per real user ID." Also, one Linux machine actually shows the _SC_CHILD_MAX value equal to kernel.pid_max (32768 here), That's interesting, since Posix describes sysconf() as simply a way to retrieve values from limits.h or unistd.h that one wishes to get at run time rather than compile time. And interesting that it establishes a correspondence between CHILD_MAX and _SC_CHILD_MAX. There's this one sentence in sysconf spec: The value returned shall not be more restrictive than the corresponding value described to the application when it was compiled with the implementation's or . So CHILD_MAX is the /minimum/ value sysconf(_SC_CHILD_MAX) may return. And I suspect that the single change of significance is to not check against the childmax value when deciding whether or not to look for and remove this pid from the list of saved termination status values. Agreed - but is this still different to defining RECYCLES_PIDS then? It is not. It is one of the things that happens when you define RECYCLES_PIDS. The question is whether or not that is the single thing that makes a difference in this case. If it is, there is merit in removing the check against js.c_childmax entirely or making it dependent on something else. IMO, checking against js.c_childmax (sysconf's value) still makes sense to have some upper limit, while being large enough to be useful. However, defining the "useful" value is up to the kernel, which does guarantee for static CHILD_MAX (or _POSIX_CHILD_MAX) at least, while providing more than 100 in practice across various platforms. However, having the "useful" value unavailable to bash feels like rendering the RECYCLES_PIDS-implementation mandatory for /any/ platform. /haubi/
Unexpected return from subscripts
Hi all, I'm not sure whether I should post this here. Sorry if disturb. We met a very strange problem with bash "version 3.00.15(1)-release". We are using a hadoop-test script to test whether a file exists on HDFS. But we observed several times that the hadoop-test script, which is a subscript in a control-flow script, returned unexpectly. It seems the subscript was put in background, and the main-control script just go on and got a wrong return value. We added some debug log in the hadoop-test script, and it looks like this: == sh -x DEBUG LOG == + /home/work/hadoop-client/hadoop/bin/hadoop dfs -test -e xxxFile + '[' 0 -ne 0 ']'# this is unexpected, the real return value is 1 + some other things... + ... + ... + test: File does not exists: xxxFile# this is unexpeted, should be printed before the condition statement. Looks like test script goes to background + ... = This problem bothered us for several months because we have a large cluster with thousands of nodes running the hadoop-test script. We met this case every month. Anybody ever met the same problem? Thanks, Zhong
Re: Unexpected return from subscripts
On Thu, Jul 26, 2012 at 02:48:30AM -0700, wangzhong@gmail.com wrote: > Hi all, > > I'm not sure whether I should post this here. Sorry if disturb. help-bash would be a better choice than bug-bash. > We met a very strange problem with bash "version 3.00.15(1)-release". We are > using a hadoop-test script to test whether a file exists on HDFS. [...] Since you did not show the code that is failing, there's not much we can do. > == sh -x DEBUG LOG == > + /home/work/hadoop-client/hadoop/bin/hadoop dfs -test -e xxxFile > + '[' 0 -ne 0 ']'# this is unexpected, the real return value is 1 If you believe that your "hadoop" command should have returned 1, but bash believes that it returned 0, then perhaps the bug is in this "hadoop" command. Of course, since we have no idea what your bash code actually says, we can't conclude much of *anything* at this point.
Re: AIX and Interix also do early PID recycling.
On 7/25/12 1:08 PM, Chet Ramey wrote: > On 7/25/12 11:33 AM, Andreas Schwab wrote: > >> I cannot see how CHILD_MAX is related to pid reuse. CHILD_MAX is a >> per-user limit, but the pid namespace is global. If the shell forks a >> new process, and the pid of it matches one of the previously used pids >> for asynchronous jobs it can surely discard the remembered status for >> that job. > > Thanks, that's a good restatement of the problem. Your proposed solution > is one of the possibles. The question is whether or not it's necessary > (apparently on some systems) and sufficient (probably). OK, we have some data, we have a hypothesis, and we have a way to test it. Let's test it. Michael, please apply the attached patch, disable RECYCLES_PIDS, and run your tests again. This makes the check for previously-saved exit statuses unconditional. Let's see if this is the one change of significance. Chet -- ``The lyf so short, the craft so long to lerne.'' - Chaucer ``Ars longa, vita brevis'' - Hippocrates Chet Ramey, ITS, CWRUc...@case.eduhttp://cnswww.cns.cwru.edu/~chet/ *** ../bash-4.2-patched/jobs.c 2011-01-07 10:59:29.0 -0500 --- jobs.c 2012-07-26 10:53:53.0 -0400 *** *** 1900,1903 --- 1902,1908 delete_old_job (pid); + #if 0 + /* Perform the check for pid reuse unconditionally. Some systems reuse + PIDs before giving a process CHILD_MAX/_SC_CHILD_MAX unique ones. */ #if !defined (RECYCLES_PIDS) /* Only check for saved status if we've saved more than CHILD_MAX *** *** 1905,1908 --- 1910,1914 if ((js.c_reaped + bgpids.npid) >= js.c_childmax) #endif + #endif bgp_delete (pid); /* new process, discard any saved status */
Re: AIX and Interix also do early PID recycling.
On 07/26/12 20:29, Chet Ramey wrote: OK, we have some data, we have a hypothesis, and we have a way to test it. Let's test it. Michael, please apply the attached patch, disable RECYCLES_PIDS, and run your tests again. This makes the check for previously-saved exit statuses unconditional. Let's see if this is the one change of significance. Nope, doesn't fix the problem, even if it might be necessary though to not mix up stored exitstates. Somehow this is related to last_made_pid being preserved across childs created for { $() } or { `` }. In execute_command_internal(), last_made_pid still holds the 128 forks old (first) PID, causing wait_for() to be not run when getting the same PID by execute_simple_command() again. However, I've been able to create a short testcase now: --- #! /bin/bash /bin/false # make first child for x in {1..127}; do x=$( : ) # make CHILD_MAX-1 more childs done # breaks when first child's PID is recycled here if /bin/false; then echo BOOM exit 1 fi echo GOOD --- /haubi/