Sounds like what we're running into at the moment, which appears to be the master processes ending up with an incorrect count of available workers. The problem occurs when a worker process dies while in the "available" state, and doesn't notify the master. Jeremy Howard recently posted a patch which addresses this problem, by decrementing the "available workers" counter when receiving a SIGCLD, which strikes me as the right way to go. However, his patch is for 2.1.3, and like you, we're using 2.0.16 (the bleeding edge is a bad place to be with 9 postoffices and 40k users). As soon as I find that mythical spare moment, I'm going to look at applying the patch to 2.0.16. I think it could address what's been a nightmare for us.
To put it in a little more detail, what we see is one service, say, pop3d or lmtpd, suddenly stop working, even though there may be active processes that are working just fine for that service. At first, you'll see the connection accepted, but not handled, as you display here. However, those connections will never be cleared from the listen queue properly, so eventually the listen queue will fill up, and you'll get either refused connections or never accepted connections. The problem is, as I said above, that the master has an incorrectly inflated number of available workers, and it's simply expecting them to handle the connections. However, since none are, the connections never get handled. We see this most on our older, more resource-strapped postoffices, frequently shortly, but not immediately after a spike in load causes a resource limitation. As long as demand for new connections is decreasing or steady, you'll never notice the problem, because there are sufficient workers available to handle the processes. However, if demand ever increases again, you'll eventually hit a shortage of available workers. The master will think that there are sufficient available workers to handle demand, so won't bother to spawn any more. The workers aren't there, so will never report to the master as unavailable, and the counter will never get decremented. You can trick the master into becoming responsive without a restart by increasing the "prefork" number in the cyrus.conf file and sending a HUP signal to the master process. It's not a very pretty solution, but it's a good one if it's the middle of the day, and you don't want to force 700 active IMAP sessions to disconnect and reconnect. If you're really brave, you can also attach to the master process with a debugger, reach down inside the Services structure and decrement the number by hand, and detach. Again, not for the faint of heart, but it does address the core problem pretty directly. Granted, this doesn't address the original root cause, which is that something caused a worker process to quit while in the available state, and I suppose that's something to look into. However, core dumps by workers are annoying, but not critical service outages. One of your services not answering is a critical service outage. For what it's worth, we were able to dramatically reduce the cirucmstances under which we hit these conditions by re-compiling with the mailboxes.db file as a flat file rather than a berkeley database, but we still run into them after resource crunches. Hope some of this helps, Michael Bacon OIT Systems Administration Duke University --On Monday, May 13, 2002 3:08 PM -0500 Dustin Puryear <[EMAIL PROTECTED]> wrote: > We continue to have problems with Cyrus. Another poster mentioned they > have the same problem, but also didn't get any responses. Would one of > the developers please investigate if this is a bug? What's going on? This > is a real show stopper for us, and apparently for others as well. > > Okay, we have Cyrus installed on FreeBSD 4.4-RELEASE: > > cyrus-imapd-2.0.16_1 The cyrus mail server, supporting POP3 and IMAP4 > protocols cyrus-imapd-2.0.16_2 The cyrus mail server, supporting POP3 and > IMAP4 protocols cyrus-sasl-1.5.24_7 RFC 2222 SASL (Simple Authentication > and Security Layer) cyrus-sasl-1.5.24_8 RFC 2222 SASL (Simple > Authentication and Security Layer) cyrus-sasl-1.5.27_2 RFC 2222 SASL > (Simple Authentication and Security Layer) > > Every once in a while Cyrus stops responding to connections. Now, it does > ACCEPT the connection, but it doesn't seem to send. Okay, so lets say > that I stop Cyrus and it happens to work: > > working.. > mercury# telnet mars 110 > Trying 10.0.0.5... > Connected to mars.actioncore.com. > Escape character is '^]'. > +OK <[EMAIL PROTECTED]> Cyrus POP3 v2.0.16 server > ready > > I get a new pop3d process: > > cyrus 1537 0.0 0.8 18836 2128 p0 S 9:52PM 0:00.03 pop3d: > pop3d: mercury.actioncore.com[10.0.0.1] (pop3d) > > And a TCP connection: > > mars# netstat -f inet -ln | grep 10.0.0.1 > tcp4 0 0 10.0.0.5.110 10.0.0.1.2060 > ESTABLISHED > > If I wait a few seconds to several minutes, Cyrus stops working: > > mercury# telnet mars 110 > Trying 10.0.0.5... > Connected to mars.actioncore.com. > Escape character is '^]' > ^C > > And the connection does exist (the connection was made from 10.0.0.1): > > mars# netstat -f inet -ln | grep 10.0.0.1 > tcp4 0 0 10.0.0.5.110 10.0.0.1.2057 ESTABLISHED > > Something I did notice is that when I run lsof that lsof seems to stall > after it hits some for the pop3d processes. Not sure if that is important > or just a fluke. > > What can we do to debug this further? What are some possible issues here > to consider? DNS? Corrupted database files? What? > > Regards, Dustin > > --- > Dustin Puryear <[EMAIL PROTECTED]> > UNIX and Network Consultant > http://members.telocity.com/~dpuryear > PGP Key available at http://www.us.pgp.net > In the beginning the Universe was created. > This has been widely regarded as a bad move. - Douglas Adams > >