Hi *, at least one sort of "MAXFD, check for defunct children" problem still exists in version 2.2.8 of cfengine. Here is what I found:
cfengine wants to limit the number of parallel pipes to a hardcoded number (MAXFD=20). To achieve this, every new pipe is checked for its fileno and if that is higher than 20, the error message appears and the pipe is somehow ignored (cfengine does not close the pipe properly in that case, which is a bug of its own). Now look at this real live example: [EMAIL PROTECTED]:~# ls -l /proc/13406/fd lr-x------ 1 root root 64 2008-08-24 20:39 0 -> /dev/null l-wx------ 1 root root 64 2008-08-24 20:39 1 -> /var/log/cfengine/cfrun.13400 l-wx------ 1 root root 64 2008-08-24 20:39 2 -> /dev/null lr-x------ 1 root root 64 2008-08-24 20:39 3 -> /dev/urandom lr-x------ 1 root root 64 2008-08-24 20:39 4 -> /etc/cfengine/cfagent.conf lrwx------ 1 root root 64 2008-08-24 20:39 5 -> socket:[18203] lrwx------ 1 root root 64 2008-08-24 20:39 6 -> socket:[18204] lr-x------ 1 root root 64 2008-08-24 20:39 7 -> /proc/loadavg lrwx------ 1 root root 64 2008-08-24 20:39 8 -> socket:[18954] lrwx------ 1 root root 64 2008-08-24 20:39 9 -> socket:[18207] lrwx------ 1 root root 64 2008-08-24 20:39 10 -> socket:[18974] lrwx------ 1 root root 64 2008-08-24 20:39 11 -> socket:[18209] lrwx------ 1 root root 64 2008-08-24 20:39 12 -> socket:[18215] lrwx------ 1 root root 64 2008-08-24 20:39 13 -> socket:[18221] lrwx------ 1 root root 64 2008-08-24 20:39 14 -> socket:[18217] lrwx------ 1 root root 64 2008-08-24 20:39 15 -> socket:[18227] lrwx------ 1 root root 64 2008-08-24 20:39 16 -> socket:[18223] lrwx------ 1 root root 64 2008-08-24 20:39 17 -> socket:[18234] lrwx------ 1 root root 64 2008-08-24 20:39 18 -> socket:[18229] lr-x------ 1 root root 64 2008-08-24 20:39 19 -> pipe:[149711] lrwx------ 1 root root 64 2008-08-24 20:39 20 -> socket:[18236] lr-x------ 1 root root 64 2008-08-24 20:39 21 -> pipe:[150154] lr-x------ 1 root root 64 2008-08-24 20:39 22 -> pipe:[150163] lr-x------ 1 root root 64 2008-08-24 20:39 23 -> pipe:[150173] lr-x------ 1 root root 64 2008-08-24 20:39 24 -> pipe:[150184] lr-x------ 1 root root 64 2008-08-24 20:39 25 -> pipe:[150194] lr-x------ 1 root root 64 2008-08-24 20:39 26 -> pipe:[150205] lr-x------ 1 root root 64 2008-08-24 20:39 27 -> pipe:[150215] lr-x------ 1 root root 64 2008-08-24 20:39 28 -> pipe:[150225] lr-x------ 1 root root 64 2008-08-24 20:39 29 -> pipe:[150236] lr-x------ 1 root root 64 2008-08-24 20:39 30 -> pipe:[150249] lr-x------ 1 root root 64 2008-08-24 20:39 31 -> pipe:[150269] lr-x------ 1 root root 64 2008-08-24 20:39 32 -> pipe:[150283] lr-x------ 1 root root 64 2008-08-24 20:39 33 -> pipe:[150293] lr-x------ 1 root root 64 2008-08-24 20:39 34 -> pipe:[150302] lr-x------ 1 root root 64 2008-08-24 20:39 35 -> pipe:[150396] You see, that there are 14 sockets open for cfagent. In this particular case, these sockets belong to heartbeat, which happens to have started this instance of cfagent. Maybe not the most common case, but definitely something cfagent should work with. Since these sockets all count for fileno, there is simply no fileno for popen left. Or -- even worse -- there is only one fileno left and the bug hits only occasionally if one pipe does not return fast enough. As a workaround, I changed MAXFD to 40. But I think, using a proper counter for open pipes would be more appropriate. It looks like, you already started to use the CHILD[] array to keep track of free pipe slots?! If you guide me the direction you want to go, I will happily help you coding and testing. Best regards, Sebastian Hetze _______________________________________________ Bug-cfengine mailing list [email protected] https://cfengine.org/mailman/listinfo/bug-cfengine
