On Fri, Feb 19, 2010 at 04:21:10PM -0500, Joey Hess wrote: > Kurt Roeckx wrote: > > It seems that parallel can return -1 in most of the cases for no > > good reason. All childs returned 0, and there were no other > > errors. > > Do you have a test case?
We use it as part of the buildd triggers, I guess it's not easy to run this yourself. I estimate that it returned 0 in about 5% of the time, so failed in 95% of the cases. > > The code looks like: > > int wait_for_child(int options) { > > id_t id_ignored = 0; > > siginfo_t infop; > > > > infop.si_pid = 0; > > waitid(P_ALL, id_ignored, &infop, WEXITED | options); > > if (infop.si_pid == 0) > > return -1; /* Nothing to wait for */ > > if (infop.si_code == CLD_EXITED) > > return infop.si_status; > > return 1; > > } > > > > And this return value is or'd between all calls. > > > > The first thing that looks wrong to this is that the return > > value of waitid() is not checked. The strace shows that > > I have atleast 2 calls to it that return -1. It most > > likely still contains a pid of 0 in that case. You really > > don't want to return -1 for that. > > si_pid will only be nonzero if waitid succeeded. Like I said it probably will, but I'm not sure there is any guarantee in that case. In any case I think it's a bad idea to not look at the return value. > If waitid fails, the code returns -1, and this is ORed to the > returncode. Since the returncode is an int, it then becomes 255, > so that's what parallel will exit if it fails to wait for a child. Which is my main problem. It probably calls waitid() when there nothing to wait for anymore or something like that. In any case that "return -1" is my problem. > > An other reason that you could get that -1 is that you called > > it with WNOHANG. In that case you check for the value > > being smaller than 0 and don't add it. But I think values > > smaller than 0 are valid, and so you can't tell the difference > > between a child that exited and returning because you don't > > want to wait. > > AFAIK a child process cannot really exit nonzero. Process exit codes > are limited to 0..255, and exit(-1) == 255. You're right, I thought it was different with waitid(), just saw strace showing it as -1. > > I think the code will also behave wrong in case the child > > did not call exit() itself but instead got killed or something. > > In that case infop.si_code can be != CLD_EXITED, and you'd > > want to consider that a reason to make parallel return an > > non-zero exit code. > > Returning a non-zero exit code of 1 is the fall-through case at the end of > the function, and seems to work ok: Yes, I figured out that part later. Kurt -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org