Re: Cyrus continues to stop working.. no fix available?

Michael Bacon Mon, 13 May 2002 20:10:48 -0700

Sounds like what we're running into at the moment, which appears to be the 
master processes ending up with an incorrect count of available workers. 
The problem occurs when a worker process dies while in the "available" 
state, and doesn't notify the master.  Jeremy Howard recently posted a 
patch which addresses this problem, by decrementing the "available workers" 
counter when receiving a SIGCLD, which strikes me as the right way to go. 
However, his patch is for 2.1.3, and like you, we're using 2.0.16 (the 
bleeding edge is a bad place to be with 9 postoffices and 40k users).  As 
soon as I find that mythical spare moment, I'm going to look at applying 
the patch to 2.0.16.  I think it could address what's been a nightmare for 
us.

To put it in a little more detail, what we see is one service, say, pop3d 
or lmtpd, suddenly stop working, even though there may be active processes 
that are working just fine for that service.  At first, you'll see the 
connection accepted, but not handled, as you display here.  However, those 
connections will never be cleared from the listen queue properly, so 
eventually the listen queue will fill up, and you'll get either refused 
connections or never accepted connections.

The problem is, as I said above, that the master has an incorrectly 
inflated number of available workers, and it's simply expecting them to 
handle the connections.  However, since none are, the connections never get 
handled.  We see this most on our older, more resource-strapped 
postoffices, frequently shortly, but not immediately after a spike in load 
causes a resource limitation.  As long as demand for new connections is 
decreasing or steady, you'll never notice the problem, because there are 
sufficient workers available to handle the processes.  However, if demand 
ever increases again, you'll eventually hit a shortage of available 
workers.  The master will think that there are sufficient available workers 
to handle demand, so won't bother to spawn any more.  The workers aren't 
there, so will never report to the master as unavailable, and the counter 
will never get decremented.

You can trick the master into becoming responsive without a restart by 
increasing the "prefork" number in the cyrus.conf file and sending a HUP 
signal to the master process.  It's not a very pretty solution, but it's a 
good one if it's the middle of the day, and you don't want to force 700 
active IMAP sessions to disconnect and reconnect.  If you're really brave, 
you can also attach to the master process with a debugger, reach down 
inside the Services structure and decrement the number by hand, and detach. 
Again, not for the faint of heart, but it does address the core problem 
pretty directly.

Granted, this doesn't address the original root cause, which is that 
something caused a worker process to quit while in the available state, and 
I suppose that's something to look into.  However, core dumps by workers 
are annoying, but not critical service outages.  One of your services not 
answering is a critical service outage.  For what it's worth, we were able 
to dramatically reduce the cirucmstances under which we hit these 
conditions by re-compiling with the mailboxes.db file as a flat file rather 
than a berkeley database, but we still run into them after resource 
crunches.

Hope some of this helps,
Michael Bacon
OIT Systems Administration
Duke University

--On Monday, May 13, 2002 3:08 PM -0500 Dustin Puryear <[EMAIL PROTECTED]> 
wrote:

> We continue to have problems with Cyrus. Another poster mentioned they
> have the same problem, but also didn't get any responses. Would one of
> the developers please investigate if this is a bug? What's going on? This
> is a real show stopper for us, and apparently for others as well.
>
> Okay, we have Cyrus installed on FreeBSD 4.4-RELEASE:
>
> cyrus-imapd-2.0.16_1 The cyrus mail server, supporting POP3 and IMAP4
> protocols cyrus-imapd-2.0.16_2 The cyrus mail server, supporting POP3 and
> IMAP4 protocols cyrus-sasl-1.5.24_7 RFC 2222 SASL (Simple Authentication
> and Security Layer) cyrus-sasl-1.5.24_8 RFC 2222 SASL (Simple
> Authentication and Security Layer) cyrus-sasl-1.5.27_2 RFC 2222 SASL
> (Simple Authentication and Security Layer)
>
> Every once in a while Cyrus stops responding to connections. Now, it does
> ACCEPT the connection, but it doesn't seem to send. Okay, so lets say
> that I stop Cyrus and it happens to work:
>
> working..
> mercury# telnet mars 110
> Trying 10.0.0.5...
> Connected to mars.actioncore.com.
> Escape character is '^]'.
> +OK <[EMAIL PROTECTED]> Cyrus POP3 v2.0.16 server
> ready
>
> I get a new pop3d process:
>
> cyrus    1537  0.0  0.8 18836 2128  p0  S     9:52PM   0:00.03 pop3d:
> pop3d: mercury.actioncore.com[10.0.0.1]   (pop3d)
>
> And a TCP connection:
>
> mars# netstat -f inet -ln | grep 10.0.0.1
> tcp4       0      0  10.0.0.5.110           10.0.0.1.2060
> ESTABLISHED
>
> If I wait a few seconds to several minutes, Cyrus stops working:
>
> mercury# telnet mars 110
> Trying 10.0.0.5...
> Connected to mars.actioncore.com.
> Escape character is '^]'
> ^C
>
> And the connection does exist (the connection was made from 10.0.0.1):
>
> mars# netstat -f inet -ln | grep 10.0.0.1
> tcp4 0 0 10.0.0.5.110 10.0.0.1.2057 ESTABLISHED
>
> Something I did notice is that when I run lsof that lsof seems to stall
> after it hits some for the pop3d processes. Not sure if that is important
> or just a fluke.
>
> What can we do to debug this further? What are some possible issues here
> to consider? DNS? Corrupted database files? What?
>
> Regards, Dustin
>
> ---
> Dustin Puryear <[EMAIL PROTECTED]>
> UNIX and Network Consultant
> http://members.telocity.com/~dpuryear
> PGP Key available at http://www.us.pgp.net
> In the beginning the Universe was created.
> This has been widely regarded as a bad move. - Douglas Adams
>
>

Re: Cyrus continues to stop working.. no fix available?

Reply via email to