At 11:35 AM -0400 2005-09-01, Paul Rubin wrote: > * Contacting the bug forum on Source forge and being told that bounce > processing is a problem that they hope to have time to fix in 2.1.6 (see > 1077587)
Note that 2.1.6 has already been shipped, and it did make some notable improvements in the bounce handling process. However, I know that it did not completely resolve bounce handling problems, because we recently ran into some issues on python.org (running Python 2.4 and Mailman 2.1.6), where we racked up almost 100,000 bounces in the space of a very few days, in many cases getting multiple bounces in the span of a single second. I ended up shutting down Mailman, moving the "bounce" subdirectory to "bounce.old" and created a new "bounce" directory (with the same ownership and permissions), and after that I haven't seen these problems re-occur. Unfortunately, we don't really know yet what caused our problems on python.org, but we are still investigating. > What top often looks like: > > 09:39:33 up 36 days, 15:58, 9 users, load average: 5.23, 5.75, 4.98 > 497 processes: 485 sleeping, 8 running, 4 zombie, 0 stopped > CPU states: 67.8% user 16.6% system 0.0% nice 0.0% iowait 15.4% idle > Mem: 1029884k av, 973668k used, 56216k free, 0k shrd, 149788k > buff > 739156k actv, 58428k in_d, 17080k in_c > Swap: 2040244k av, 188092k used, 1852152k free 184036k > cached > > PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU COMMAND > 14978 mailman 16 0 324M 318M 2328 R 6.5 31.6 5:46 0 > /usr/local/bin/python /var/mailman/bin/qrunner --runner=BounceRunner:0:1 Well, BounceRunner is taking up a lot of memory, but I don't see it using a lot of CPU. Looking at the system with tools like "iostats" and "vmstats" is usually useful, when you're having problems like this. Especially useful is if you have support for the "-x" option to iostats, because that lets you look at things like the disk I/O wait queue, and gives you a very good idea if you are being limited by synchronous meta-data updates, such as are common with excessive lock contention, or excessive problems with creation/use/deletion of lots of temporary files in a short period of time, etc.... The "vmstats" tool comes in handy when trying to determine if you're seeing serious memory pressure, maybe excessive disk cache thrashing, etc... during periods of high activity. If you see high rates of page-ins and page-outs, that's a very bad sign. High rates of page-ins, in combination with relatively little free memory, is not necessarily a bad sign -- the system may just be using memory that would otherwise be sitting idle as an expanded disk cache. But high rates of page-outs is bad news. You might also want to consider turning on lock debugging, pending lockdb debugging, etc... in your mm_cfg.py. > And immediately after a shutdown restart cycle: > > 09:42:12 up 36 days, 16:01, 9 users, load average: 4.21, 5.23, 4.91 > 532 processes: 521 sleeping, 7 running, 4 zombie, 0 stopped > CPU states: 82.3% user 6.3% system 0.0% nice 0.0% iowait 11.3% idle > Mem: 1029884k av, 643420k used, 386464k free, 0k shrd, 149556k > buff > 435668k actv, 33664k in_d, 16752k in_c > Swap: 2040244k av, 111832k used, 1928412k free 161048k > cached > > PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU COMMAND > 32382 mailman 16 0 24312 23M 2660 R 27.2 2.3 0:03 0 > /usr/local/bin/python /var/mailman/bin/qrunner --runner=ArchRunner:0:1 -s > 32385 mailman 15 0 22036 21M 2616 R 23.0 2.1 0:02 0 > /usr/local/bin/python /var/mailman/bin/qrunner --runner=IncomingRunner:0: > 32388 mailman 15 0 21816 21M 2584 S 14.1 2.1 0:01 0 > /usr/local/bin/python /var/mailman/bin/qrunner --runner=VirginRunner:0:1 > 32383 mailman 16 0 20604 20M 2568 R 6.1 2.0 0:03 0 > /usr/local/bin/python /var/mailman/bin/qrunner --runner=BounceRunner:0:1 Well, you've certainly got a lot more memory showing as free, memory which I believe may have previously been allocated to BounceRunner. This is one potential indication that you are actually suffering from memory shortages during the periods you're discussing. However, you'd have to do further investigation with tools like iostat and vmstat during times of trouble, before you'd be likely to have a good idea of what was really going on. > 32387 mailman 15 0 15960 15M 2712 S 0.5 1.5 0:02 0 > /usr/local/bin/python /var/mailman/bin/qrunner --runner=OutgoingRunner:0: > 32379 mailman 25 0 5900 5900 2596 S 0.0 0.5 0:00 0 > /usr/bin/python /var/mailman/bin/mailmanctl -s -q start > 32386 mailman 15 0 5840 5840 2512 S 0.0 0.5 0:00 0 > /usr/local/bin/python /var/mailman/bin/qrunner --runner=NewsRunner:0:1 -s > 32384 mailman 15 0 5804 5804 2512 S 0.0 0.5 0:00 0 > /usr/local/bin/python /var/mailman/bin/qrunner --runner=CommandRunner:0:1 > 32389 mailman 25 0 5792 5792 2512 S 0.0 0.5 0:00 0 > /usr/local/bin/python /var/mailman/bin/qrunner --runner=RetryRunner:0:1 - > > Notice how the other tasks suddenly get busy.... That's perfectly normal when you restart after a period of having problems, as the other parts of the system try to catch up to where they should be. > Log files were completely flushed at 4PM yesterday. > > [EMAIL PROTECTED] ~]$ ls -al logs > total 16144 > drwxrwsr-x 2 mailman mailman 4096 Aug 31 19:19 . > drwxrwsr-x 34 mailman mailman 4096 Sep 1 08:40 .. > -rw-rw-r-- 1 mailman mailman 15793138 Sep 1 09:02 bounce > -rw-rw-r-- 1 smmsp mailman 335919 Sep 1 09:01 error > -rw-rw-r-- 1 mailman mailman 5181 Sep 1 06:25 locks > -rw-rw-r-- 1 mailman mailman 28700 Sep 1 09:01 post > -rw-rw-r-- 1 mailman mailman 35858 Sep 1 08:32 qrunner > -rw-rw-r-- 1 mailman mailman 263590 Sep 1 09:02 smtp > -rw-rw-r-- 1 mailman mailman 1519 Sep 1 06:31 smtp-failure > -rw-rw-r-- 1 mailman mailman 6833 Sep 1 08:59 subscribe > -rw-rw-r-- 1 mailman mailman 1069 Sep 1 08:32 vette That's a *huge* bounce subdirectory. One problem you can run into is that when you've got more than ~1000-10000 files in a single subdirectory, just accessing that directory can take a very long time to do the linear table scan. Trying to create new files in that directory can take orders of magnitude longer than it used to, just because the system has to lock the entire directory and then scan the entire directory, before it can confirm that there are no other files in the directory with the same name, at which point it can then release the lock on the directory and return to the caller. Try moving the current "bounce" directory to something like "bounce.old", and create a new one with the same ownership and permissions, then stop and restart Mailman. I'd be willing to bet that directory lock contention has been a *huge* part of your problem -- I certainly believe that it was for us on python.org. -- Brad Knowles, <[EMAIL PROTECTED]> "Those who would give up essential Liberty, to purchase a little temporary Safety, deserve neither Liberty nor Safety." -- Benjamin Franklin (1706-1790), reply of the Pennsylvania Assembly to the Governor, November 11, 1755 SAGE member since 1995. See <http://www.sage.org/> for more info. ------------------------------------------------------ Mailman-Users mailing list Mailman-Users@python.org http://mail.python.org/mailman/listinfo/mailman-users Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-users%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-users/archive%40jab.org Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=show&file=faq01.027.htp