Re: AIX and Interix also do early PID recycling.

2012-07-26 Thread Michael Haubenwallner


On 07/25/12 19:06, Chet Ramey wrote:

Well, _SC_CHILD_MAX is documented across platforms as:


Heck, even POSIX specifies CHILD_MAX as:
"Maximum number of simultaneous processes per real user ID."


Also, one Linux machine actually shows the _SC_CHILD_MAX value equal to
kernel.pid_max (32768 here),


That's interesting, since Posix describes sysconf() as simply a way to
retrieve values from limits.h or unistd.h that one wishes to get at
run time rather than compile time.  And interesting that it establishes a
correspondence between CHILD_MAX and _SC_CHILD_MAX.


There's this one sentence in sysconf spec:
  The value returned shall not be more restrictive than the corresponding
  value described to the application when it was compiled with the
  implementation's  or .

So CHILD_MAX is the /minimum/ value sysconf(_SC_CHILD_MAX) may return.


And I suspect that the single change of significance is to not check
against the childmax value when deciding whether or not to look for and
remove this pid from the list of saved termination status values.


Agreed - but is this still different to defining RECYCLES_PIDS then?


It is not.  It is one of the things that happens when you define
RECYCLES_PIDS.  The question is whether or not that is the single thing
that makes a difference in this case.  If it is, there is merit in
removing the check against js.c_childmax entirely or making it dependent
on something else.


IMO, checking against js.c_childmax (sysconf's value) still makes sense
to have some upper limit, while being large enough to be useful.
However, defining the "useful" value is up to the kernel, which does
guarantee for static CHILD_MAX (or _POSIX_CHILD_MAX) at least, while
providing more than 100 in practice across various platforms.

However, having the "useful" value unavailable to bash feels like
rendering the RECYCLES_PIDS-implementation mandatory for /any/ platform.

/haubi/



Unexpected return from subscripts

2012-07-26 Thread wangzhong . neu
Hi all,

I'm not sure whether I should post this here. Sorry if disturb.

We met a very strange problem with bash "version 3.00.15(1)-release". We are 
using a hadoop-test script to test whether a file exists on HDFS. But we 
observed several times that the hadoop-test script, which is a subscript in a 
control-flow script, returned unexpectly. It seems the subscript was put in 
background, and the main-control script just go on and got a wrong return 
value. We added some debug log in the hadoop-test script, and it looks like 
this:

== sh -x DEBUG LOG ==
+ /home/work/hadoop-client/hadoop/bin/hadoop dfs -test -e xxxFile
+ '[' 0 -ne 0 ']'# this is unexpected, the real return value is 1
+ some other things...
+ ...
+ ...
+ test: File does not exists: xxxFile# this is unexpeted, should be printed 
before the condition statement. Looks like test script goes to background
+ ...
=

This problem bothered us for several months because we have a large cluster 
with thousands of nodes running the hadoop-test script. We met this case every 
month. Anybody ever met the same problem?

Thanks,
Zhong




Re: Unexpected return from subscripts

2012-07-26 Thread Greg Wooledge
On Thu, Jul 26, 2012 at 02:48:30AM -0700, wangzhong@gmail.com wrote:
> Hi all,
> 
> I'm not sure whether I should post this here. Sorry if disturb.

help-bash would be a better choice than bug-bash.

> We met a very strange problem with bash "version 3.00.15(1)-release". We are
> using a hadoop-test script to test whether a file exists on HDFS. [...]

Since you did not show the code that is failing, there's not much we
can do.

> == sh -x DEBUG LOG ==
> + /home/work/hadoop-client/hadoop/bin/hadoop dfs -test -e xxxFile
> + '[' 0 -ne 0 ']'# this is unexpected, the real return value is 1

If you believe that your "hadoop" command should have returned 1,
but bash believes that it returned 0, then perhaps the bug is in this
"hadoop" command.

Of course, since we have no idea what your bash code actually says,
we can't conclude much of *anything* at this point.



Re: AIX and Interix also do early PID recycling.

2012-07-26 Thread Chet Ramey
On 7/25/12 1:08 PM, Chet Ramey wrote:
> On 7/25/12 11:33 AM, Andreas Schwab wrote:
> 
>> I cannot see how CHILD_MAX is related to pid reuse.  CHILD_MAX is a
>> per-user limit, but the pid namespace is global.  If the shell forks a
>> new process, and the pid of it matches one of the previously used pids
>> for asynchronous jobs it can surely discard the remembered status for
>> that job.
> 
> Thanks, that's a good restatement of the problem.  Your proposed solution
> is one of the possibles.  The question is whether or not it's necessary
> (apparently on some systems) and sufficient (probably).

OK, we have some data, we have a hypothesis, and we have a way to test it.
Let's test it.

Michael, please apply the attached patch, disable RECYCLES_PIDS, and run
your tests again.  This makes the check for previously-saved exit statuses
unconditional.

Let's see if this is the one change of significance.

Chet

-- 
``The lyf so short, the craft so long to lerne.'' - Chaucer
 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, ITS, CWRUc...@case.eduhttp://cnswww.cns.cwru.edu/~chet/


*** ../bash-4.2-patched/jobs.c	2011-01-07 10:59:29.0 -0500
--- jobs.c	2012-07-26 10:53:53.0 -0400
***
*** 1900,1903 
--- 1902,1908 
  	delete_old_job (pid);
  
+ #if 0
+   /* Perform the check for pid reuse unconditionally.  Some systems reuse
+  PIDs before giving a process CHILD_MAX/_SC_CHILD_MAX unique ones. */
  #if !defined (RECYCLES_PIDS)
/* Only check for saved status if we've saved more than CHILD_MAX
***
*** 1905,1908 
--- 1910,1914 
if ((js.c_reaped + bgpids.npid) >= js.c_childmax)
  #endif
+ #endif
  	bgp_delete (pid);		/* new process, discard any saved status */
  


Re: AIX and Interix also do early PID recycling.

2012-07-26 Thread Michael Haubenwallner


On 07/26/12 20:29, Chet Ramey wrote:

OK, we have some data, we have a hypothesis, and we have a way to test it.
Let's test it.

Michael, please apply the attached patch, disable RECYCLES_PIDS, and run
your tests again.  This makes the check for previously-saved exit statuses
unconditional.

Let's see if this is the one change of significance.


Nope, doesn't fix the problem, even if it might be necessary though
to not mix up stored exitstates.

Somehow this is related to last_made_pid being preserved across childs
created for { $() } or { `` }.

In execute_command_internal(), last_made_pid still holds the 128 forks
old (first) PID, causing wait_for() to be not run when getting the same
PID by execute_simple_command() again.

However, I've been able to create a short testcase now:

---
#! /bin/bash

/bin/false # make first child

for x in {1..127}; do
  x=$( : ) # make CHILD_MAX-1 more childs
done

# breaks when first child's PID is recycled here
if /bin/false; then
  echo BOOM
  exit 1
fi

echo GOOD
---

/haubi/