We have noticed a serious load-avg issue after upgrading to cyrus-2.0.16.

We have just upgraded to cyrus-2.0.16 over the weekend, we went from 
Simeon/Messaging Direct (cyrus-1.x version) under Solaris 2.6 to 
cyrus-2.0.16 under Solaris 8 (kernel patch 108258-10).  We tested it as 
much as possible for about 2 months before transferring all of our 
mailstore over (around 18000 users).  We were able to test all of our 
e-mail clients except for our webmail client (IMP-2.x).  I must say that 
the new version/system seemed to be great (and other than the following 
issue, I still think it's great, especially the sieve capability).  
anyways, here's our issue:

We brought the system back up Sunday afternoon, and up until Monday 
afternoon everything ran just fine (when we "fixed" IMP). We had an 
issue with IMP not logging in correctly, but fixed that by flushing the 
session info IMP uses from the mysql database it stored it in.  Once IMP 
was up & being used we started to notice the following:

Our normal load average is betweem 0.7 & 2 (around 2 when we get batches 
of email in). We noticed the loadavg rising at a tremendous rate, soon 
it was at 20, 30, 40, all the way up to 65! What was interesting was 
that response time for me while logged in to ksh was fast, as were 
connects to pop & imap. Sendmail, of course turned off receives (set to 
stop at 8). After about an hour we decided to disable IMP (run's on a 
separate box), and the load dropped immediately to under 1.  Here's some 
info I managed to collect during this scary period:

System  Info:
cyrus-2.0.16
cyrus-sasl-1.5.24
db-3.3.11
sendmail-8.12.1

# cat config.status (for cyrus-imapd)
#! /bin/sh
# Generated automatically by configure.
# Run this file to recreate the current configuration.
# This directory was configured as follows,
# on host mail1:
#
# ./configure  --with-dbdir=/usr/local/BerkeleyDB.3.3 --with-
perl=/usr/local/bin/perl --with-openssl=/usr/local/ssl --with-
tcl=/usr/local --with-libwrap=/usr/local --with-sasl=/usr/local/lib 
--with-auth=unix --with-idle=poll
#


Here's some info from the running system during hi-load time:

top showed:
last pid:   933;  load averages: 34.75, 30.96, 
29.70                                                           15:17:59
185 processes: 131 sleeping, 50 running, 4 on cpu
CPU states:  4.7% idle, 83.6% user, 11.2% kernel,  0.5% iowait,  0.0% 
swap
Memory: 4096M real, 3234M free, 180M swap in use, 3974M swap free

    PID USERNAME THR PRI NICE  SIZE   RES STATE    TIME    CPU COMMAND
    785 cyrus      1  55    0   24M 3824K run      0:19  1.86% imapd
    734 cyrus      1  56    0   24M 3768K run      0:18  1.84% imapd
(many more imapd's, pop3d's running...):

root@mail1 # truss -p 785
lwp_mutex_lock(0xFF351E68)                      = 0
lwp_mutex_wakeup(0xFF351E68)                    = 0
lwp_mutex_wakeup(0xFF351E68)                    = 0
pread64(10, "\0\0\001\0 z10 D\0\0\0BF".., 8192, 1564672) = 8192
lwp_mutex_wakeup(0xFF351E68)                    = 0
lwp_mutex_lock(0xFF351F38)                      = 0
lwp_mutex_wakeup(0xFF351F38)                    = 0
pread64(10, "\0\0\001\0 z10 D\0\0\0BF".., 8192, 1564672) = 8192
lwp_mutex_lock(0xFF351E68)                      = 0
lwp_mutex_wakeup(0xFF351E68)                    = 0
lwp_mutex_lock(0xFF351E68)                      = 0
lwp_mutex_wakeup(0xFF351E68)                    = 0
lwp_mutex_lock(0xFF351F38)                      = 0
lwp_mutex_wakeup(0xFF351F38)                    = 0
pread64(10, "\0\0\001\0 z10 D\0\0\0BF".., 8192, 1564672) = 8192

This went on "forever", until I stopped "master".

We also noticed the following a little earlier (during the hi-load time) 
from local6.error:
Jan  7 14:54:33 mail1 imapd[6]: [ID 866726 local6.error] DBERROR db3: 74 
lockersJan  7 14:54:34 mail1 imapd[30]: [ID 866726 local6.error] DBERROR 
db3: 76 lockers
Jan  7 14:54:39 mail1 imapd[4]: [ID 866726 local6.error] DBERROR db3: 
Unable to allocate 8287 bytes from mpool shared region: Not enough space
Jan  7 14:54:39 mail1 imapd[4]: [ID 598274 local6.error] DBERROR: error 
advancing: Not enough space
Jan  7 14:54:40 mail1 imapd[29918]: [ID 866726 local6.error] DBERROR 
db3: Unable to allocate 8287 bytes from mpool shared region: Not enough 
space
Jan  7 14:54:40 mail1 imapd[29918]: [ID 598274 local6.error] DBERROR: 
error advancing: Not enough space
Jan  7 14:54:40 mail1 imapd[153]: [ID 866726 local6.error] DBERROR db3: 
Unable to allocate 8287 bytes from mpool shared region: Not enough space
Jan  7 14:54:40 mail1 imapd[153]: [ID 598274 local6.error] DBERROR: 
error advancing: Not enough space
Jan  7 14:54:51 mail1 imapd[151]: [ID 866726 local6.error] DBERROR db3: 
86 lockers

lots of the above "DBERROR db3: xx lockers" lines

Jan  7 15:07:13 mail1 imapd[454]: [ID 898233 local6.error] DBERROR: 
error closing: DB_INCOMPLETE: Cache flush
Jan  7 15:07:13 mail1 imapd[454]: [ID 179994 local6.error] DBERROR: 
error closing mailboxes: cyrusdb error

It seems that IMP was able to tickle something on the system, realize 
that IMP does a login/logout for every IMAP transaction, so it is an 
expensive webmail client. However, it shouldn't be able to put cyrus 
into such a weird state (it is just a client).

Any suggestions would be helpful....

Thanks,
chris


Chris Peck
Manager of Unix Services
Information Technology
The College of William and Mary
Williamsburg, VA 23187

Reply via email to