Interesting and good information. We happen to use Big Brother/Xymon for monitoring and have multiple scripts to check things like cache, locks etc. We will get notified when these sense a problem, but at 1AM on a Saturday getting notified and fixing the issue before all those services get impacted is a little scary. That's why we were contemplating that maybe it would be wise to "hit it with a hammer" until we are able to intervene and repair.
On Wed, 30 Mar 2011 18:43 -0400, "Aaron Richton" <[email protected]> wrote: > So, the best defense is a good offense in this case, and if you were > running 2.4.25 with the appropriate BerkeleyDB library you'd likely not > see an issue of this manner. > > With that said, there was a time (with earlier releases of OpenLDAP) when > we had this issue (one bdb go down, with the service apparently working > via an overly simple smoke test). Not being fans of being bitten by the > same failure mode twice, we wrote up a Nagios check that searches a > known-present-on-disk entry that is in each of our databases. (You can > either create one, or (ab)use "ou=People" if you're RFC2307 or use > "cn=Manager" or what have you...) If any database doesn't return a hit, > time for us to get a call. > > As an aside, I find this thoroughly fascinating timing. Not that it'll > make you feel any better in the present case, but I was just considering > writing something up for the next LDAPCon on how we do monitoring (there > are ~10 angles we check from, many of them due to real life situations > similar to yours). They're all relatively simple ideas like the above, > but > I suppose cleaning up our code to the point where it's world-safe and > getting something written up on it may be useful. They've proven > occasionally useful for slapd(8) code issues and also, more frequently, > useful in the face of human factors. > > On Wed, 30 Mar 2011, ldap@mm">[email protected] wrote: > > > A while ago I posted that we were having what we thought were random bdb > > backend crashes with the following in our log: > > > > bdb(o=example.com): PANIC: fatal region error detected; run recovery. > > > > This was on a on our RH5 openldap servers (2.3.43) that we were > > rebuilding: > > > > It appears that the crashes were caused by a vulnerability scanner that > > was hitting the server (still testing), even though it was suppose to be > > safe. We'll have to investigate what is causing it, maybe we will need > > an acl to stop whatever the scanner is doing. Once we stopped the > > automated scan, the servers seem to be running as expected. > > > > But, this brought up another issue. When the bdb backend failed, the > > slapd process continued run and listen on the ldap ports and clients > > still tried to connect to the failed server for authentication. The > > server accepted and established the connection with the client. Of > > course the client could not authenticate since the backend db was down. > > The client will not fail over to the other server that is listed in it's > > ldap.conf file since it thinks it has a valid connection. If the slap > > process is not running then the fail over works fine since no ports are > > there for the client to connect to. > > > > I'm thinking that bdb failures will be rare once we solve the scanner > > issue, but on a network that relies heavily on ldap, a failed bdb > > backend with a running slapd would cause significant issues. > > > > Just trying to restart the slapd service doesn't fix the issue, a manual > > recovery is required (slapd_db_recover). I was curious if anyone has > > put something in place to deal with this potential issue? Maybe run > > slapd_db_status via cron and if it errors due a bdb corruption, just > > stop slapd and let the admin know. At least the clients would be able > > to failover to the other ldap servers. I guess an automated recovery is > > possible via a script, but I'm not sure if that's a good idea. Maybe > > dealing with this type of failure is not really required, I was hoping > > that some of you that have been do this for a while would have some > > insight. > > >
