Kurt Roeckx wrote:
> It seems that parallel can return -1 in most of the cases for no
> good reason.  All childs returned 0, and there were no other
> errors.

Do you have a test case?

> The code looks like:
> int wait_for_child(int options) {
>         id_t id_ignored = 0;
>         siginfo_t infop;
> 
>         infop.si_pid = 0;
>         waitid(P_ALL, id_ignored, &infop, WEXITED | options);
>         if (infop.si_pid == 0)
>                 return -1; /* Nothing to wait for */
>         if (infop.si_code == CLD_EXITED)
>                 return infop.si_status;
>         return 1;
> }
> 
> And this return value is or'd between all calls.  
> 
> The first thing that looks wrong to this is that the return
> value of waitid() is not checked.  The strace shows that
> I have atleast 2 calls to it that return -1.  It most
> likely still contains a pid of 0 in that case.  You really
> don't want to return -1 for that.

si_pid will only be nonzero if waitid succeeded.

If waitid fails, the code returns -1, and this is ORed to the
returncode. Since the returncode is an int, it then becomes 255,
so that's what parallel will exit if it fails to wait for a child.

> An other reason that you could get that -1 is that you called
> it with WNOHANG.  In that case you check for the value
> being smaller than 0 and don't add it.  But I think values
> smaller than 0 are valid, and so you can't tell the difference
> between a child that exited and returning because you don't
> want to wait.

AFAIK a child process cannot really exit nonzero. Process exit codes
are limited to 0..255, and exit(-1) == 255.

While the lack of explicit signed ints in the code is a bit confusing,
wait_for_child() can return a -1 that is distinguishable from a 255 by
the code that tests that.

> I think the code will also behave wrong in case the child
> did not call exit() itself but instead got killed or something.
> In that case infop.si_code can be != CLD_EXITED, and you'd
> want to consider that a reason to make parallel return an
> non-zero exit code.

Returning a non-zero exit code of 1 is the fall-through case at the end of
the function, and seems to work ok:

j...@gnu:~>./segfault 
zsh: segmentation fault  ./segfault
j...@gnu:~>parallel segfault -- 1 2 3
zsh: exit 1     parallel segfault -- 1 2 3

-- 
see shy jo

Attachment: signature.asc
Description: Digital signature

Reply via email to