Gentlefolk,

Does anyone have experiences they'd be willing to share with combatting deadlocks within a BDB 3.3 duplicate delivery database on a high-traffic Cyrus v2.1.16 (or earlier 2.1.x) server? We're running a 60,000+ user/1.2 million message/day Cyrus postoffice on an 8-way Solaris system, and recently, we've started running into increasingly frequent deadlock problems with the duplicate suppression database.

The symptoms we're seeing are probably what you'd expect -- our cyrus.conf is set to allow up to 120 lmtpd children to run simulateously, and when we hit a deadlock condition in the duplicate suppression database, we find that all 120 of our running lmtpds lock up waiting for write locks in the database. "truss" shows them all stuck in "lwp_sema_wait()" calls. Inspection of the duplicate database after the fact sometimes shows corruption (usually null page pointers reported by db_verify), but sometimes shows nothing -- it's possible that we're seeing two different problems with the same end effect, but I suspect the database corruption is actually a side-effect of the deadlock problem...

We've come up with a work-around that at least allows us to correct the situation without performing a master restart (with 4000+ simultaneous IMAPS connections, a master restart isn't something we can routinely do, unfortunately) -- renaming the duplicate delivery database and its log and __db* files and then kill -15'ing all the running lmtpds seems to get us back to a functional state with a fresh duplicate suppression database. We're up to seeing this happen a bit more than once a day now, though, and it's becoming seriously annoying.

We're using the db3_nosync mechanism (with BDB version 3.3.11) for our dup suppression database -- one option we're strongly considering is switching to the regular "db3" mechanism (without the nosync option) to try to avoid the deadlocks, but we're a bit concerned about what that may do to lmtp throughput. Turning off duplicate suppression is...politically untenable...at this point...

We've also considered running the db3 "db_deadlock" routine to periodically detect and try to correct deadlock conditions in the duplicate suppression database, but that's also somewhat scary -- it's unclear to us exactly what the behavior of an lmtpd awaiting a lock in the duplicate suppression database would be when its waiting lock got terminated by the db_deadlock daemon...

Anyone have any experience or wisdom to share about either possible solution, or about other things that you've seen work in similar situations? At this point, upgrading to 2.2.x is on our radar, but probably not something we can approach before mid-semester (2-3 months out), so suggestions for solutions with Cyrus v2.1.x would be most appreciated...

--Thanx much,
--Rob Carter--
---
Cyrus Home Page: http://asg.web.cmu.edu/cyrus
Cyrus Wiki/FAQ: http://cyruswiki.andrew.cmu.edu
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html

Reply via email to