Hi *,

at least one sort of "MAXFD, check for defunct children" problem still
exists in version 2.2.8 of cfengine. 
Here is what I found:

cfengine wants to limit the number of parallel pipes to a hardcoded
number (MAXFD=20). To achieve this, every new pipe is checked for its
fileno and if that is higher than 20, the error message appears and 
the pipe is somehow ignored (cfengine does not close the pipe properly
in that case, which is a bug of its own).

Now look at this real live example:
[EMAIL PROTECTED]:~# ls -l /proc/13406/fd
lr-x------ 1 root root 64 2008-08-24 20:39 0 -> /dev/null
l-wx------ 1 root root 64 2008-08-24 20:39 1 -> /var/log/cfengine/cfrun.13400
l-wx------ 1 root root 64 2008-08-24 20:39 2 -> /dev/null
lr-x------ 1 root root 64 2008-08-24 20:39 3 -> /dev/urandom
lr-x------ 1 root root 64 2008-08-24 20:39 4 -> /etc/cfengine/cfagent.conf
lrwx------ 1 root root 64 2008-08-24 20:39 5 -> socket:[18203]
lrwx------ 1 root root 64 2008-08-24 20:39 6 -> socket:[18204]
lr-x------ 1 root root 64 2008-08-24 20:39 7 -> /proc/loadavg
lrwx------ 1 root root 64 2008-08-24 20:39 8 -> socket:[18954]
lrwx------ 1 root root 64 2008-08-24 20:39 9 -> socket:[18207]
lrwx------ 1 root root 64 2008-08-24 20:39 10 -> socket:[18974]
lrwx------ 1 root root 64 2008-08-24 20:39 11 -> socket:[18209]
lrwx------ 1 root root 64 2008-08-24 20:39 12 -> socket:[18215]
lrwx------ 1 root root 64 2008-08-24 20:39 13 -> socket:[18221]
lrwx------ 1 root root 64 2008-08-24 20:39 14 -> socket:[18217]
lrwx------ 1 root root 64 2008-08-24 20:39 15 -> socket:[18227]
lrwx------ 1 root root 64 2008-08-24 20:39 16 -> socket:[18223]
lrwx------ 1 root root 64 2008-08-24 20:39 17 -> socket:[18234]
lrwx------ 1 root root 64 2008-08-24 20:39 18 -> socket:[18229]
lr-x------ 1 root root 64 2008-08-24 20:39 19 -> pipe:[149711]
lrwx------ 1 root root 64 2008-08-24 20:39 20 -> socket:[18236]
lr-x------ 1 root root 64 2008-08-24 20:39 21 -> pipe:[150154]
lr-x------ 1 root root 64 2008-08-24 20:39 22 -> pipe:[150163]
lr-x------ 1 root root 64 2008-08-24 20:39 23 -> pipe:[150173]
lr-x------ 1 root root 64 2008-08-24 20:39 24 -> pipe:[150184]
lr-x------ 1 root root 64 2008-08-24 20:39 25 -> pipe:[150194]
lr-x------ 1 root root 64 2008-08-24 20:39 26 -> pipe:[150205]
lr-x------ 1 root root 64 2008-08-24 20:39 27 -> pipe:[150215]
lr-x------ 1 root root 64 2008-08-24 20:39 28 -> pipe:[150225]
lr-x------ 1 root root 64 2008-08-24 20:39 29 -> pipe:[150236]
lr-x------ 1 root root 64 2008-08-24 20:39 30 -> pipe:[150249]
lr-x------ 1 root root 64 2008-08-24 20:39 31 -> pipe:[150269]
lr-x------ 1 root root 64 2008-08-24 20:39 32 -> pipe:[150283]
lr-x------ 1 root root 64 2008-08-24 20:39 33 -> pipe:[150293]
lr-x------ 1 root root 64 2008-08-24 20:39 34 -> pipe:[150302]
lr-x------ 1 root root 64 2008-08-24 20:39 35 -> pipe:[150396]


You see, that there are 14 sockets open for cfagent. In this
particular case, these sockets belong to heartbeat, which happens
to have started this instance of cfagent. Maybe not the most
common case, but definitely something cfagent should work with.
Since these sockets all count for fileno, there is simply no
fileno for popen left. Or -- even worse -- there is only one
fileno left and the bug hits only occasionally if one pipe
does not return fast enough.

As a workaround, I changed MAXFD to 40. But I think, using a
proper counter for open pipes would be more appropriate.
It looks like, you already started to use the CHILD[] array
to keep track of free pipe slots?!
If you guide me the direction you want to go, I will happily
help you coding and testing.

Best regards,

  Sebastian Hetze
_______________________________________________
Bug-cfengine mailing list
[email protected]
https://cfengine.org/mailman/listinfo/bug-cfengine

Reply via email to