Hi,

sorry, this is quite long ... I'm looking for the updated status of the problem discussed in the threads "stuck lmtpd processes":

<http://asg.web.cmu.edu/archive/message.php?mailbox=archive.info-cyrus&msg=24927>

and "followup: stuck lmtpd processes":

<http://asg.web.cmu.edu/archive/message.php?mailbox=archive.info-cyrus&searchterm=stuck%20lmtpd&msg=24934>

AFAICT Cyrus IMAP 2.1.x hasn't been updated since. Could anyone comment on whether this is fixed in 2.2.3 and if we *have* to update in order to get that fix? In the last two weeks we've had this (or a very similar) problem twice with 2.1.16 und Red Hat AS 2.1.

For some reason in both cases tons of sendmail processes were building up. We have configured sendmail to deliver in "interactive" mode, so each instance tries to connect to lmtpd. I would've expected those attempts to time out relatively quickly. I'm not sure why they don't. Anyway, that's somewhat off-topic ...

The first time it happened I was around and collected some info while it was happening. I'd noticed that mail was no longer being delivered. We have maxprocs set to 30 for lmtpd:

[EMAIL PROTECTED] root]# ps -aef|grep lmtp
cyrus     1383  2095  0 10:10 ?        00:00:18 lmtpd
cyrus     1974  2095  0 10:11 ?        00:00:00 lmtpd
cyrus    10631  2095  0 10:36 ?        00:00:00 lmtpd
cyrus    17162  2095  0 10:52 ?        00:00:00 lmtpd
cyrus    17262  2095  0 10:52 ?        00:00:06 lmtpd
cyrus    25687  2095  0 11:17 ?        00:00:00 lmtpd
cyrus    28377  2095  0 11:23 ?        00:00:09 lmtpd
cyrus    31251  2095  0 11:32 ?        00:00:00 lmtpd
cyrus     3824  2095  0 11:46 ?        00:00:07 lmtpd
cyrus     7534  2095  0 11:56 ?        00:00:00 lmtpd
cyrus     8248  2095  0 11:58 ?        00:00:08 lmtpd
cyrus    22103  2095  0 12:35 ?        00:00:00 lmtpd
cyrus    23725  2095  0 12:39 ?        00:00:01 lmtpd
cyrus    25366  2095  0 12:45 ?        00:00:00 lmtpd
cyrus     6108  2095  0 13:23 ?        00:00:00 lmtpd
cyrus    10832  2095  0 13:37 ?        00:00:00 lmtpd
cyrus    12972  2095  0 13:42 ?        00:00:00 lmtpd
cyrus    16114  2095  0 13:53 ?        00:00:00 lmtpd
cyrus    21377  2095  0 14:07 ?        00:00:09 lmtpd
cyrus    24581  2095  0 14:16 ?        00:00:00 lmtpd
cyrus    26200  2095  0 14:21 ?        00:00:14 lmtpd
cyrus    26577  2095  0 14:22 ?        00:00:00 lmtpd
cyrus    31893  2095  0 14:35 ?        00:00:03 lmtpd
cyrus    32575  2095  0 14:36 ?        00:00:07 lmtpd
cyrus    32659  2095  0 14:37 ?        00:00:01 lmtpd
cyrus    14146  2095  0 15:16 ?        00:00:00 lmtpd
cyrus    21049  2095  0 15:36 ?        00:00:00 lmtpd
cyrus    26969  2095  0 15:56 ?        00:00:11 lmtpd
cyrus    13275  2095  0 15:15 ?        00:00:04 lmtpd
cyrus    17679  2095  0 13:57 ?        00:00:00 lmtpd

Obviously something was wrong.

[EMAIL PROTECTED] root]# lsof -p 26969
COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME
lmtpd 26969 cyrus cwd DIR 9,0 4096 393296 /root
<snip>
lmtpd 26969 cyrus mem REG 8,73 3841576 30 /var/lib/imap/mailboxes.db
lmtpd 26969 cyrus mem REG 8,49 194 145326099 /var/spool/imap/M/user/therbst/cyrus.header
<snip>
lmtpd 26969 cyrus 9u REG 8,49 3130 56013510 /var/spool/imap/stage./26969-1083938870
lmtpd 26969 cyrus 10u unix 0xef1ce040 614472274 socket
lmtpd 26969 cyrus 11u REG 8,49 194 145326099 /var/spool/imap/M/user/therbst/cyrus.header

I gathered that /var/spool/imap/M/user/therbst/cyrus.header was causing the problems.Its node number (is that the inode??) is 145326099:

[EMAIL PROTECTED] root]# grep 145326099 /proc/locks
1162: POSIX ADVISORY WRITE 1974 08:31:145326099 0 EOF e9b58d54 e9b587f4 cd5a7964 00000000 d1c324c0
1162: -> POSIX ADVISORY WRITE 1383 08:31:145326099 0 EOF d1c324b4 c7fd745c f3bd0964 e9b58d54 c7fd7464
1162: -> POSIX ADVISORY WRITE 10631 08:31:145326099 0 EOF c7fd7458 d1c3245c d1c324b8 e9b58d54 d1c32464
1162: -> POSIX ADVISORY WRITE 17162 08:31:145326099 0 EOF d1c32458 d6989400 c7fd745c e9b58d54 d6989408
1162: -> POSIX ADVISORY WRITE 17262 08:31:145326099 0 EOF d69893fc dfb2eca0 d1c3245c e9b58d54 dfb2eca8
1162: -> POSIX ADVISORY WRITE 25687 08:31:145326099 0 EOF dfb2ec9c f3972be8 d6989400 e9b58d54 f3972bf0
1162: -> POSIX ADVISORY WRITE 28377 08:31:145326099 0 EOF f3972be4 c47d96e0 dfb2eca0 e9b58d54 c47d96e8
1162: -> POSIX ADVISORY WRITE 31251 08:31:145326099 0 EOF c47d96dc cd5a7d58 f3972be8 e9b58d54 cd5a7d60
1162: -> POSIX ADVISORY WRITE 7534 08:31:145326099 0 EOF cd5a7d54 c47d98ac c47d96e0 e9b58d54 c47d98b4
1162: -> POSIX ADVISORY WRITE 3824 08:31:145326099 0 EOF c47d98a8 c7fd7400 cd5a7d58 e9b58d54 c7fd7408
1162: -> POSIX ADVISORY WRITE 8248 08:31:145326099 0 EOF c7fd73fc c8733be8 c47d98ac e9b58d54 c8733bf0
1162: -> POSIX ADVISORY WRITE 22103 08:31:145326099 0 EOF c8733be4 cd197f24 c7fd7400 e9b58d54 cd197f2c
1162: -> POSIX ADVISORY WRITE 25366 08:31:145326099 0 EOF cd197f20 d1c32290 c8733be8 e9b58d54 d1c32298
1162: -> POSIX ADVISORY WRITE 23725 08:31:145326099 0 EOF d1c3228c c47d9964 cd197f24 e9b58d54 c47d996c
1162: -> POSIX ADVISORY WRITE 6108 08:31:145326099 0 EOF c47d9960 f3bd0234 d1c32290 e9b58d54 f3bd023c
1162: -> POSIX ADVISORY WRITE 10832 08:31:145326099 0 EOF f3bd0230 c47d9290 c47d9964 e9b58d54 c47d9298
1162: -> POSIX ADVISORY WRITE 12972 08:31:145326099 0 EOF c47d928c d1b450c4 f3bd0234 e9b58d54 d1b450cc
1162: -> POSIX ADVISORY WRITE 16114 08:31:145326099 0 EOF d1b450c0 dfb2e120 c47d9290 e9b58d54 dfb2e128
1162: -> POSIX ADVISORY WRITE 26577 08:31:145326099 0 EOF dfb2e11c f6bfb7f4 d1b450c4 e9b58d54 f6bfb7fc
1162: -> POSIX ADVISORY WRITE 24581 08:31:145326099 0 EOF f6bfb7f0 f6bfb3a4 dfb2e120 e9b58d54 f6bfb3ac
1162: -> POSIX ADVISORY WRITE 21377 08:31:145326099 0 EOF f6bfb3a0 c87333a4 f6bfb7f4 e9b58d54 c87333ac
1162: -> POSIX ADVISORY WRITE 32575 08:31:145326099 0 EOF c87333a0 d1c32628 f6bfb3a4 e9b58d54 d1c32630
1162: -> POSIX ADVISORY WRITE 32659 08:31:145326099 0 EOF d1c32624 edb35e10 c87333a4 e9b58d54 edb35e18
1162: -> POSIX ADVISORY WRITE 31893 08:31:145326099 0 EOF edb35e0c edb35798 d1c32628 e9b58d54 edb357a0
1162: -> POSIX ADVISORY WRITE 14146 08:31:145326099 0 EOF edb35794 cdf5ff24 edb35e10 e9b58d54 cdf5ff2c
1162: -> POSIX ADVISORY WRITE 21049 08:31:145326099 0 EOF cdf5ff20 e8adbca0 edb35798 e9b58d54 e8adbca8
1162: -> POSIX ADVISORY WRITE 26200 08:31:145326099 0 EOF e8adbc9c cd5a7798 cdf5ff24 e9b58d54 cd5a77a0
1162: -> POSIX ADVISORY WRITE 26969 08:31:145326099 0 EOF cd5a7794 d1695628 e8adbca0 e9b58d54 d1695630
1162: -> POSIX ADVISORY WRITE 13275 08:31:145326099 0 EOF d1695624 d1b45e6c cd5a7798 e9b58d54 d1b45e74
1162: -> POSIX ADVISORY WRITE 17679 08:31:145326099 0 EOF d1b45e68 cdf5f5cc d1695628 e9b58d54 e9b58d60

The PIDs waiting for locks correspond to the PIDs of the stuck lmtpd processes. At that point I restarted Cyrus, because I wanted to get the system going again. Only later did I read that the actual problem is more likely caused by a locked quota file. Unfortunately I did not further examine PID 1974, which held the lock.

Anyway, this past Saturday it happened again when I was not around. Ultimately the system shut itself down, because the Red Hat cluster software wasn't able to update its raw device any longer. So when I examined the case, all I could go on were log files. Still I was able to piece it together, I think.

It started here:

May 14 09:04:59 lvr1 imapd[14301]: login: vpn82-4.vpn.Uni-Koeln.DE[134.95.82.4] s0283928 plaintext+TLS
May 14 09:04:59 lvr1 imapd[14301]: seen_db: user s0283928 opened /var/lib/imap/user/B/s0283928.seen
May 14 09:04:59 lvr1 imapd[14301]: open: user s0283928 opened Drafts

As you can see a VPN dial-up connection was used. I think it likely that the notorious APPEND problem happened. i.e. the connection was dropped by the client while the server was doing an APPEND. This is bug 1270 <https://bugzilla.andrew.cmu.edu/show_bug.cgi?id=1270> and is marked FIXED. For completeness' sake: after the IMAP session above, hte lmtpd processes started getting stuck when delivering to this user:

May 14 15:59:44 lvr1 lmtpd[3879]: duplicate_check: <[EMAIL PROTECTED]> user.s0283928
0
May 14 16:42:57 lvr1 lmtpd[26559]: duplicate_check: <[EMAIL PROTECTED]> user.s0283928
0
May 14 17:26:45 lvr1 lmtpd[28635]: duplicate_check: <[EMAIL PROTECTED]> user.s0283928
0
May 14 18:08:01 lvr1 lmtpd[32017]: duplicate_check: <[EMAIL PROTECTED]> user.s0283928
0
May 14 18:43:25 lvr1 lmtpd[6860]: duplicate_check: <[EMAIL PROTECTED]> user.s0283928
0
May 14 19:34:23 lvr1 lmtpd[28641]: duplicate_check: <[EMAIL PROTECTED]> user.s0283928
0
May 14 19:49:37 lvr1 lmtpd[26587]: duplicate_check: <[EMAIL PROTECTED]> user.s0283928
0
May 14 20:44:45 lvr1 lmtpd[2960]: duplicate_check: <[EMAIL PROTECTED]> user.s0283928
0
May 14 21:24:31 lvr1 lmtpd[16921]: duplicate_check: <[EMAIL PROTECTED]> user.s0283928
0
May 14 21:45:35 lvr1 lmtpd[3571]: duplicate_check: <[EMAIL PROTECTED]> user.s0283928
0
May 14 22:25:25 lvr1 lmtpd[31517]: duplicate_check: <[EMAIL PROTECTED]> user.s0283928
0
May 14 22:59:53 lvr1 lmtpd[26060]: duplicate_check: <[EMAIL PROTECTED]> user.s0283928
0
May 14 23:37:39 lvr1 lmtpd[4317]: duplicate_check: <[EMAIL PROTECTED]> user.s0283928
0
May 15 00:08:15 lvr1 lmtpd[930]: duplicate_check: <[EMAIL PROTECTED]> user.s0283928


So, is this really fixed in 2.2.3? Has anybody seen something like this since? And would it be OK to apply the patch to 2.1.16? Or does it depend on other stuff that happened in CVS between 2.1.16 and the time the patch was created?

Thanks, Sebastian Hagedorn
--
Sebastian Hagedorn M.A. - RZKR-R1 (Gebäude 52), Zimmer 18
Zentrum für angewandte Informatik - Universitätsweiter Service RRZK
Universität zu Köln / Cologne University - Tel. +49-221-478-5587

Attachment: pgpYGq03M7kK6.pgp
Description: PGP signature



Reply via email to