In the last set of patches we sent to the list, we included a patch to master.c to avoid losing track of child processes after a segfault. This patch has a race condition that we saw triggered under high load, where a child can be reaped before master has processed an MASTER_SERVICE_UNAVAILABLE notification. The result of this is that the process count is off by one for each time the race condition occurs, causing the number of processes to increase indefinitely.
This fixes that problem, and has resulted in a stable number of processes on FastMail.FM for the last few days: --- master/master.c Thu May 9 19:36:03 2002 +++ master/master.c.new Thu May 9 19:35:21 2002 @@ -814,13 +814,17 @@ switch (msg->message) { case MASTER_SERVICE_AVAILABLE: - c->is_available = 1; - s->ready_workers++; + if (c && c->pid == msg->service_pid) { + c->is_available = 1; + s->ready_workers++; + } break; case MASTER_SERVICE_UNAVAILABLE: - c->is_available = 0; - s->ready_workers--; + if (c && c->pid == msg->service_pid) { + c->is_available = 0; + s->ready_workers--; + } break; case MASTER_SERVICE_CONNECTION: