On Mon, 17 May 2004, Scott Adkins wrote:
> I do know that as of 2.2.1, the problem still exists. I find it impossible > to trigger the bug on my own in our test environment, and we won't be doing > an upgrade to 2.2.3 (or whatever happens to be out) until the summer time > frame. > > Currently, we manage the problem by watching the sendmail syslog file for > System I/O errors on the Cyrus mailer (grep for 'mailer=cyrus.*System.I.O') > and when we see some, we run another script that tells us which cyrus proc > has the write lock on their quota file and we kill that process. > > The problem occurs regularly, with some users more reliably triggering the > problem than others. > > For those who are curious, our Cyrus server runs on a cluster of 3 machines > running with Alpha Tru64 5.2. I have not seen this happen with lmtpd, but I do get an occasional such error when a mailer (pine and OE) saves a message to the sent folder via imap protocol: May 17 10:35:56 imaps[9711]: [ID 136705 local6.error] IOERROR: opening /export/cyrus/mail/M/user/igor/sent-mail/cyrus.header: No such file or directory Unfortunately, I cannot reproduce this problem because it occurs very infrequently. This particular server runs cyrus v2.2.2. I will try cvs head. -Igor > Scott > > --On Monday, May 17, 2004 2:47 PM +0200 Sebastian Hagedorn > <[EMAIL PROTECTED]> wrote: > > > Hi, > > > > sorry, this is quite long ... I'm looking for the updated status of the > > problem discussed in the threads "stuck lmtpd processes": > > > > <http://asg.web.cmu.edu/archive/message.php?mailbox=archive.info-cyrus&ms > > g=24927> > > > > and "followup: stuck lmtpd processes": > > > > <http://asg.web.cmu.edu/archive/message.php?mailbox=archive.info-cyrus&se > > archterm=stuck%20lmtpd&msg=24934> > > > > AFAICT Cyrus IMAP 2.1.x hasn't been updated since. Could anyone comment > > on whether this is fixed in 2.2.3 and if we *have* to update in order to > > get that fix? In the last two weeks we've had this (or a very similar) > > problem twice with 2.1.16 und Red Hat AS 2.1. > > > > For some reason in both cases tons of sendmail processes were building > > up. We have configured sendmail to deliver in "interactive" mode, so each > > instance tries to connect to lmtpd. I would've expected those attempts to > > time out relatively quickly. I'm not sure why they don't. Anyway, that's > > somewhat off-topic ... > > > > The first time it happened I was around and collected some info while it > > was happening. I'd noticed that mail was no longer being delivered. We > > have maxprocs set to 30 for lmtpd: > > > > [EMAIL PROTECTED] root]# ps -aef|grep lmtp > > cyrus 1383 2095 0 10:10 ? 00:00:18 lmtpd > > cyrus 1974 2095 0 10:11 ? 00:00:00 lmtpd > > cyrus 10631 2095 0 10:36 ? 00:00:00 lmtpd > > cyrus 17162 2095 0 10:52 ? 00:00:00 lmtpd > > cyrus 17262 2095 0 10:52 ? 00:00:06 lmtpd > > cyrus 25687 2095 0 11:17 ? 00:00:00 lmtpd > > cyrus 28377 2095 0 11:23 ? 00:00:09 lmtpd > > cyrus 31251 2095 0 11:32 ? 00:00:00 lmtpd > > cyrus 3824 2095 0 11:46 ? 00:00:07 lmtpd > > cyrus 7534 2095 0 11:56 ? 00:00:00 lmtpd > > cyrus 8248 2095 0 11:58 ? 00:00:08 lmtpd > > cyrus 22103 2095 0 12:35 ? 00:00:00 lmtpd > > cyrus 23725 2095 0 12:39 ? 00:00:01 lmtpd > > cyrus 25366 2095 0 12:45 ? 00:00:00 lmtpd > > cyrus 6108 2095 0 13:23 ? 00:00:00 lmtpd > > cyrus 10832 2095 0 13:37 ? 00:00:00 lmtpd > > cyrus 12972 2095 0 13:42 ? 00:00:00 lmtpd > > cyrus 16114 2095 0 13:53 ? 00:00:00 lmtpd > > cyrus 21377 2095 0 14:07 ? 00:00:09 lmtpd > > cyrus 24581 2095 0 14:16 ? 00:00:00 lmtpd > > cyrus 26200 2095 0 14:21 ? 00:00:14 lmtpd > > cyrus 26577 2095 0 14:22 ? 00:00:00 lmtpd > > cyrus 31893 2095 0 14:35 ? 00:00:03 lmtpd > > cyrus 32575 2095 0 14:36 ? 00:00:07 lmtpd > > cyrus 32659 2095 0 14:37 ? 00:00:01 lmtpd > > cyrus 14146 2095 0 15:16 ? 00:00:00 lmtpd > > cyrus 21049 2095 0 15:36 ? 00:00:00 lmtpd > > cyrus 26969 2095 0 15:56 ? 00:00:11 lmtpd > > cyrus 13275 2095 0 15:15 ? 00:00:04 lmtpd > > cyrus 17679 2095 0 13:57 ? 00:00:00 lmtpd > > > > Obviously something was wrong. > > > > [EMAIL PROTECTED] root]# lsof -p 26969 > > COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME > > lmtpd 26969 cyrus cwd DIR 9,0 4096 393296 /root > > <snip> > > lmtpd 26969 cyrus mem REG 8,73 3841576 30 > > /var/lib/imap/mailboxes.db > > lmtpd 26969 cyrus mem REG 8,49 194 145326099 > > /var/spool/imap/M/user/therbst/cyrus.header > > <snip> > > lmtpd 26969 cyrus 9u REG 8,49 3130 56013510 > > /var/spool/imap/stage./26969-1083938870 > > lmtpd 26969 cyrus 10u unix 0xef1ce040 614472274 socket > > lmtpd 26969 cyrus 11u REG 8,49 194 145326099 > > /var/spool/imap/M/user/therbst/cyrus.header > > > > I gathered that /var/spool/imap/M/user/therbst/cyrus.header was causing > > the problems.Its node number (is that the inode??) is 145326099: > > > > [EMAIL PROTECTED] root]# grep 145326099 /proc/locks > > 1162: POSIX ADVISORY WRITE 1974 08:31:145326099 0 EOF e9b58d54 e9b587f4 > > cd5a7964 00000000 d1c324c0 > > 1162: -> POSIX ADVISORY WRITE 1383 08:31:145326099 0 EOF d1c324b4 > > c7fd745c f3bd0964 e9b58d54 c7fd7464 > > 1162: -> POSIX ADVISORY WRITE 10631 08:31:145326099 0 EOF c7fd7458 > > d1c3245c d1c324b8 e9b58d54 d1c32464 > > 1162: -> POSIX ADVISORY WRITE 17162 08:31:145326099 0 EOF d1c32458 > > d6989400 c7fd745c e9b58d54 d6989408 > > 1162: -> POSIX ADVISORY WRITE 17262 08:31:145326099 0 EOF d69893fc > > dfb2eca0 d1c3245c e9b58d54 dfb2eca8 > > 1162: -> POSIX ADVISORY WRITE 25687 08:31:145326099 0 EOF dfb2ec9c > > f3972be8 d6989400 e9b58d54 f3972bf0 > > 1162: -> POSIX ADVISORY WRITE 28377 08:31:145326099 0 EOF f3972be4 > > c47d96e0 dfb2eca0 e9b58d54 c47d96e8 > > 1162: -> POSIX ADVISORY WRITE 31251 08:31:145326099 0 EOF c47d96dc > > cd5a7d58 f3972be8 e9b58d54 cd5a7d60 > > 1162: -> POSIX ADVISORY WRITE 7534 08:31:145326099 0 EOF cd5a7d54 > > c47d98ac c47d96e0 e9b58d54 c47d98b4 > > 1162: -> POSIX ADVISORY WRITE 3824 08:31:145326099 0 EOF c47d98a8 > > c7fd7400 cd5a7d58 e9b58d54 c7fd7408 > > 1162: -> POSIX ADVISORY WRITE 8248 08:31:145326099 0 EOF c7fd73fc > > c8733be8 c47d98ac e9b58d54 c8733bf0 > > 1162: -> POSIX ADVISORY WRITE 22103 08:31:145326099 0 EOF c8733be4 > > cd197f24 c7fd7400 e9b58d54 cd197f2c > > 1162: -> POSIX ADVISORY WRITE 25366 08:31:145326099 0 EOF cd197f20 > > d1c32290 c8733be8 e9b58d54 d1c32298 > > 1162: -> POSIX ADVISORY WRITE 23725 08:31:145326099 0 EOF d1c3228c > > c47d9964 cd197f24 e9b58d54 c47d996c > > 1162: -> POSIX ADVISORY WRITE 6108 08:31:145326099 0 EOF c47d9960 > > f3bd0234 d1c32290 e9b58d54 f3bd023c > > 1162: -> POSIX ADVISORY WRITE 10832 08:31:145326099 0 EOF f3bd0230 > > c47d9290 c47d9964 e9b58d54 c47d9298 > > 1162: -> POSIX ADVISORY WRITE 12972 08:31:145326099 0 EOF c47d928c > > d1b450c4 f3bd0234 e9b58d54 d1b450cc > > 1162: -> POSIX ADVISORY WRITE 16114 08:31:145326099 0 EOF d1b450c0 > > dfb2e120 c47d9290 e9b58d54 dfb2e128 > > 1162: -> POSIX ADVISORY WRITE 26577 08:31:145326099 0 EOF dfb2e11c > > f6bfb7f4 d1b450c4 e9b58d54 f6bfb7fc > > 1162: -> POSIX ADVISORY WRITE 24581 08:31:145326099 0 EOF f6bfb7f0 > > f6bfb3a4 dfb2e120 e9b58d54 f6bfb3ac > > 1162: -> POSIX ADVISORY WRITE 21377 08:31:145326099 0 EOF f6bfb3a0 > > c87333a4 f6bfb7f4 e9b58d54 c87333ac > > 1162: -> POSIX ADVISORY WRITE 32575 08:31:145326099 0 EOF c87333a0 > > d1c32628 f6bfb3a4 e9b58d54 d1c32630 > > 1162: -> POSIX ADVISORY WRITE 32659 08:31:145326099 0 EOF d1c32624 > > edb35e10 c87333a4 e9b58d54 edb35e18 > > 1162: -> POSIX ADVISORY WRITE 31893 08:31:145326099 0 EOF edb35e0c > > edb35798 d1c32628 e9b58d54 edb357a0 > > 1162: -> POSIX ADVISORY WRITE 14146 08:31:145326099 0 EOF edb35794 > > cdf5ff24 edb35e10 e9b58d54 cdf5ff2c > > 1162: -> POSIX ADVISORY WRITE 21049 08:31:145326099 0 EOF cdf5ff20 > > e8adbca0 edb35798 e9b58d54 e8adbca8 > > 1162: -> POSIX ADVISORY WRITE 26200 08:31:145326099 0 EOF e8adbc9c > > cd5a7798 cdf5ff24 e9b58d54 cd5a77a0 > > 1162: -> POSIX ADVISORY WRITE 26969 08:31:145326099 0 EOF cd5a7794 > > d1695628 e8adbca0 e9b58d54 d1695630 > > 1162: -> POSIX ADVISORY WRITE 13275 08:31:145326099 0 EOF d1695624 > > d1b45e6c cd5a7798 e9b58d54 d1b45e74 > > 1162: -> POSIX ADVISORY WRITE 17679 08:31:145326099 0 EOF d1b45e68 > > cdf5f5cc d1695628 e9b58d54 e9b58d60 > > > > The PIDs waiting for locks correspond to the PIDs of the stuck lmtpd > > processes. At that point I restarted Cyrus, because I wanted to get the > > system going again. Only later did I read that the actual problem is more > > likely caused by a locked quota file. Unfortunately I did not further > > examine PID 1974, which held the lock. > > > > Anyway, this past Saturday it happened again when I was not around. > > Ultimately the system shut itself down, because the Red Hat cluster > > software wasn't able to update its raw device any longer. So when I > > examined the case, all I could go on were log files. Still I was able to > > piece it together, I think. > > > > It started here: > > > > May 14 09:04:59 lvr1 imapd[14301]: login: > > vpn82-4.vpn.Uni-Koeln.DE[134.95.82.4] s0283928 plaintext+TLS > > May 14 09:04:59 lvr1 imapd[14301]: seen_db: user s0283928 opened > > /var/lib/imap/user/B/s0283928.seen > > May 14 09:04:59 lvr1 imapd[14301]: open: user s0283928 opened Drafts > > > > As you can see a VPN dial-up connection was used. I think it likely that > > the notorious APPEND problem happened. i.e. the connection was dropped by > > the client while the server was doing an APPEND. This is bug 1270 > > <https://bugzilla.andrew.cmu.edu/show_bug.cgi?id=1270> and is marked > > FIXED. For completeness' sake: after the IMAP session above, hte lmtpd > > processes started getting stuck when delivering to this user: > > > > May 14 15:59:44 lvr1 lmtpd[3879]: duplicate_check: > > <[EMAIL PROTECTED]> > > user.s0283928 > > 0 > > May 14 16:42:57 lvr1 lmtpd[26559]: duplicate_check: > > <[EMAIL PROTECTED]> > > user.s0283928 > > 0 > > May 14 17:26:45 lvr1 lmtpd[28635]: duplicate_check: > > <[EMAIL PROTECTED]> > > user.s0283928 > > 0 > > May 14 18:08:01 lvr1 lmtpd[32017]: duplicate_check: > > <[EMAIL PROTECTED]> > > user.s0283928 > > 0 > > May 14 18:43:25 lvr1 lmtpd[6860]: duplicate_check: > > <[EMAIL PROTECTED]> > > user.s0283928 > > 0 > > May 14 19:34:23 lvr1 lmtpd[28641]: duplicate_check: > > <[EMAIL PROTECTED]> > > user.s0283928 > > 0 > > May 14 19:49:37 lvr1 lmtpd[26587]: duplicate_check: > > <[EMAIL PROTECTED]> > > user.s0283928 > > 0 > > May 14 20:44:45 lvr1 lmtpd[2960]: duplicate_check: > > <[EMAIL PROTECTED]> > > user.s0283928 > > 0 > > May 14 21:24:31 lvr1 lmtpd[16921]: duplicate_check: > > <[EMAIL PROTECTED]> > > user.s0283928 > > 0 > > May 14 21:45:35 lvr1 lmtpd[3571]: duplicate_check: > > <[EMAIL PROTECTED]> > > user.s0283928 > > 0 > > May 14 22:25:25 lvr1 lmtpd[31517]: duplicate_check: > > <[EMAIL PROTECTED]> > > user.s0283928 > > 0 > > May 14 22:59:53 lvr1 lmtpd[26060]: duplicate_check: > > <[EMAIL PROTECTED]> > > user.s0283928 > > 0 > > May 14 23:37:39 lvr1 lmtpd[4317]: duplicate_check: > > <[EMAIL PROTECTED]> > > user.s0283928 > > 0 > > May 15 00:08:15 lvr1 lmtpd[930]: duplicate_check: > > <[EMAIL PROTECTED]> > > user.s0283928 > > > > So, is this really fixed in 2.2.3? Has anybody seen something like this > > since? And would it be OK to apply the patch to 2.1.16? Or does it depend > > on other stuff that happened in CVS between 2.1.16 and the time the patch > > was created? > > > > Thanks, Sebastian Hagedorn > > -- > > Sebastian Hagedorn M.A. - RZKR-R1 (Gebäude 52), Zimmer 18 > > Zentrum für angewandte Informatik - Universitätsweiter Service RRZK > > Universität zu Köln / Cologne University - Tel. +49-221-478-5587 > > > > -- Igor --- Cyrus Home Page: http://asg.web.cmu.edu/cyrus Cyrus Wiki/FAQ: http://cyruswiki.andrew.cmu.edu List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html