We have noticed a serious load-avg issue after upgrading to cyrus-2.0.16. We have just upgraded to cyrus-2.0.16 over the weekend, we went from Simeon/Messaging Direct (cyrus-1.x version) under Solaris 2.6 to cyrus-2.0.16 under Solaris 8 (kernel patch 108258-10). We tested it as much as possible for about 2 months before transferring all of our mailstore over (around 18000 users). We were able to test all of our e-mail clients except for our webmail client (IMP-2.x). I must say that the new version/system seemed to be great (and other than the following issue, I still think it's great, especially the sieve capability). anyways, here's our issue:
We brought the system back up Sunday afternoon, and up until Monday afternoon everything ran just fine (when we "fixed" IMP). We had an issue with IMP not logging in correctly, but fixed that by flushing the session info IMP uses from the mysql database it stored it in. Once IMP was up & being used we started to notice the following: Our normal load average is betweem 0.7 & 2 (around 2 when we get batches of email in). We noticed the loadavg rising at a tremendous rate, soon it was at 20, 30, 40, all the way up to 65! What was interesting was that response time for me while logged in to ksh was fast, as were connects to pop & imap. Sendmail, of course turned off receives (set to stop at 8). After about an hour we decided to disable IMP (run's on a separate box), and the load dropped immediately to under 1. Here's some info I managed to collect during this scary period: System Info: cyrus-2.0.16 cyrus-sasl-1.5.24 db-3.3.11 sendmail-8.12.1 # cat config.status (for cyrus-imapd) #! /bin/sh # Generated automatically by configure. # Run this file to recreate the current configuration. # This directory was configured as follows, # on host mail1: # # ./configure --with-dbdir=/usr/local/BerkeleyDB.3.3 --with- perl=/usr/local/bin/perl --with-openssl=/usr/local/ssl --with- tcl=/usr/local --with-libwrap=/usr/local --with-sasl=/usr/local/lib --with-auth=unix --with-idle=poll # Here's some info from the running system during hi-load time: top showed: last pid: 933; load averages: 34.75, 30.96, 29.70 15:17:59 185 processes: 131 sleeping, 50 running, 4 on cpu CPU states: 4.7% idle, 83.6% user, 11.2% kernel, 0.5% iowait, 0.0% swap Memory: 4096M real, 3234M free, 180M swap in use, 3974M swap free PID USERNAME THR PRI NICE SIZE RES STATE TIME CPU COMMAND 785 cyrus 1 55 0 24M 3824K run 0:19 1.86% imapd 734 cyrus 1 56 0 24M 3768K run 0:18 1.84% imapd (many more imapd's, pop3d's running...): root@mail1 # truss -p 785 lwp_mutex_lock(0xFF351E68) = 0 lwp_mutex_wakeup(0xFF351E68) = 0 lwp_mutex_wakeup(0xFF351E68) = 0 pread64(10, "\0\0\001\0 z10 D\0\0\0BF".., 8192, 1564672) = 8192 lwp_mutex_wakeup(0xFF351E68) = 0 lwp_mutex_lock(0xFF351F38) = 0 lwp_mutex_wakeup(0xFF351F38) = 0 pread64(10, "\0\0\001\0 z10 D\0\0\0BF".., 8192, 1564672) = 8192 lwp_mutex_lock(0xFF351E68) = 0 lwp_mutex_wakeup(0xFF351E68) = 0 lwp_mutex_lock(0xFF351E68) = 0 lwp_mutex_wakeup(0xFF351E68) = 0 lwp_mutex_lock(0xFF351F38) = 0 lwp_mutex_wakeup(0xFF351F38) = 0 pread64(10, "\0\0\001\0 z10 D\0\0\0BF".., 8192, 1564672) = 8192 This went on "forever", until I stopped "master". We also noticed the following a little earlier (during the hi-load time) from local6.error: Jan 7 14:54:33 mail1 imapd[6]: [ID 866726 local6.error] DBERROR db3: 74 lockersJan 7 14:54:34 mail1 imapd[30]: [ID 866726 local6.error] DBERROR db3: 76 lockers Jan 7 14:54:39 mail1 imapd[4]: [ID 866726 local6.error] DBERROR db3: Unable to allocate 8287 bytes from mpool shared region: Not enough space Jan 7 14:54:39 mail1 imapd[4]: [ID 598274 local6.error] DBERROR: error advancing: Not enough space Jan 7 14:54:40 mail1 imapd[29918]: [ID 866726 local6.error] DBERROR db3: Unable to allocate 8287 bytes from mpool shared region: Not enough space Jan 7 14:54:40 mail1 imapd[29918]: [ID 598274 local6.error] DBERROR: error advancing: Not enough space Jan 7 14:54:40 mail1 imapd[153]: [ID 866726 local6.error] DBERROR db3: Unable to allocate 8287 bytes from mpool shared region: Not enough space Jan 7 14:54:40 mail1 imapd[153]: [ID 598274 local6.error] DBERROR: error advancing: Not enough space Jan 7 14:54:51 mail1 imapd[151]: [ID 866726 local6.error] DBERROR db3: 86 lockers lots of the above "DBERROR db3: xx lockers" lines Jan 7 15:07:13 mail1 imapd[454]: [ID 898233 local6.error] DBERROR: error closing: DB_INCOMPLETE: Cache flush Jan 7 15:07:13 mail1 imapd[454]: [ID 179994 local6.error] DBERROR: error closing mailboxes: cyrusdb error It seems that IMP was able to tickle something on the system, realize that IMP does a login/logout for every IMAP transaction, so it is an expensive webmail client. However, it shouldn't be able to put cyrus into such a weird state (it is just a client). Any suggestions would be helpful.... Thanks, chris Chris Peck Manager of Unix Services Information Technology The College of William and Mary Williamsburg, VA 23187