On Fri, Feb 19, 2010 at 04:21:10PM -0500, Joey Hess wrote:
> Kurt Roeckx wrote:
> > It seems that parallel can return -1 in most of the cases for no
> > good reason.  All childs returned 0, and there were no other
> > errors.
> 
> Do you have a test case?

We use it as part of the buildd triggers, I guess it's not
easy to run this yourself.  I estimate that it returned 0 in about
5% of the time, so failed in 95% of the cases.

> > The code looks like:
> > int wait_for_child(int options) {
> >         id_t id_ignored = 0;
> >         siginfo_t infop;
> > 
> >         infop.si_pid = 0;
> >         waitid(P_ALL, id_ignored, &infop, WEXITED | options);
> >         if (infop.si_pid == 0)
> >                 return -1; /* Nothing to wait for */
> >         if (infop.si_code == CLD_EXITED)
> >                 return infop.si_status;
> >         return 1;
> > }
> > 
> > And this return value is or'd between all calls.  
> > 
> > The first thing that looks wrong to this is that the return
> > value of waitid() is not checked.  The strace shows that
> > I have atleast 2 calls to it that return -1.  It most
> > likely still contains a pid of 0 in that case.  You really
> > don't want to return -1 for that.
> 
> si_pid will only be nonzero if waitid succeeded.

Like I said it probably will, but I'm not sure there is any
guarantee in that case.  In any case I think it's a bad idea
to not look at the return value.

> If waitid fails, the code returns -1, and this is ORed to the
> returncode. Since the returncode is an int, it then becomes 255,
> so that's what parallel will exit if it fails to wait for a child.

Which is my main problem.  It probably calls waitid() when there
nothing to wait for anymore or something like that.  In any case
that "return -1" is my problem.

> > An other reason that you could get that -1 is that you called
> > it with WNOHANG.  In that case you check for the value
> > being smaller than 0 and don't add it.  But I think values
> > smaller than 0 are valid, and so you can't tell the difference
> > between a child that exited and returning because you don't
> > want to wait.
> 
> AFAIK a child process cannot really exit nonzero. Process exit codes
> are limited to 0..255, and exit(-1) == 255.

You're right, I thought it was different with waitid(), just saw
strace showing it as -1.

> > I think the code will also behave wrong in case the child
> > did not call exit() itself but instead got killed or something.
> > In that case infop.si_code can be != CLD_EXITED, and you'd
> > want to consider that a reason to make parallel return an
> > non-zero exit code.
> 
> Returning a non-zero exit code of 1 is the fall-through case at the end of
> the function, and seems to work ok:

Yes, I figured out that part later.


Kurt




-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org

Reply via email to