Package: exim4-daemon-heavy Version: 4.50-8sarge2 Under heavy SMTP load, we occasionally observe the exim4 daemon crashing (with the result that no further connections can be accepted, obviously). We can reproduce this here with the `postal' SMTP benchmark (package postal) and the following command-line:
postal -p 10 -c 10 -m 1 localhost users - (running on the host running exim4). The file `users' contains a single line with an email address; in this case I used a local address aliased to /dev/null in /etc/aliases. While running these tests I had the whole of /var/spool/exim4 mounted on a tmpfs (to simulate a hardware configuration with very fast disk); since the bug is likely timing-related you may have to do the same to reproduce it. Here typically it will exhibit within the first five minutes of postal's run. Here's a stack trace from the exim4 daemon when it crashes: #0 0x00000000 in ?? () #1 0x40361825 in __pthread_sighandler () from /lib/libpthread.so.0 #2 <signal handler called> #3 0x403d05d9 in __libc_sigaction () from /lib/libc.so.6 #4 0x4035e828 in sigaction () from /lib/libpthread.so.0 #5 0x080866c5 in os_non_restarting_signal (sig=17, handler=0x805c930 <main_sigchld_handler>) at os.c:267 #6 0x0805e9f3 in daemon_go () at daemon.c:1842 #7 0x0806e06b in main (argc=3, cargv=0xbfffdbc4) at exim.c:3922 -- for some reason the distributed binaries are built without debugging symbols, so I had to rebuild it. Also, there's some horrific ugliness going on with os.c in the distribution (``#include "../src/os.c"''!) which confuses gdb; to get a working debug build I had to catenate the generated build-exim4-daemon-heavy/os.c and that in src/, since otherwise gdb got confused about which os.c was which. Hence, the line numbers in os.c in the backtrace don't correspond directly to anything in the exim4 source package. Looking at the backtrace, it appears that what's happened is that a signal (presumably SIGCHLD) has arrived while os_non_restarting_signal is running. The SIGCHLD handler itself calls os_non_restarting_signal, and a crash results. I'm not sure why, though -- there's nothing in the code for that function that's obviously nonreentrant (it only uses automatic variables and calls sigaction(2), which is async-signal-safe). Note that exim in this case is linked against -lpthread, presumably because of -lpq. I haven't had an opportunity to check whether the -light version of the daemon has the same problem, but it's possible that this alters the behaviour of sigaction. The following patch to src/os.c, which blocks the signal for which a handler is being installed over the call to sigaction, appears to fix the problem, which is at least compatible with the above hypothesis, though not a great fix. --- os.c.orig 2006-07-11 18:02:09.000000000 +0100 +++ os.c 2006-07-11 18:05:15.000000000 +0100 @@ -261,11 +261,20 @@ #ifdef SA_RESTART struct sigaction act; +sigset_t mask, curmask; + +sigemptyset(&mask); +sigprocmask(SIG_BLOCK, &mask, &curmask); +sigaddset(&mask, sig); +sigprocmask(SIG_SETMASK, &mask, NULL); + act.sa_handler = handler; sigemptyset(&(act.sa_mask)); act.sa_flags = 0; sigaction(sig, &act, NULL); +sigprocmask(SIG_SETMASK, &curmask, NULL); + #ifdef STAND_ALONE printf("Used sigaction() with flags = 0\n"); #endif -- Chris Lightfoot mySociety -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]