Re: Dealing with BDB Crash

Mark Thu, 31 Mar 2011 04:14:49 -0700

What is 'the appropriate BerkeleyDB library'? Are you referring to a
specific *version*? I have a new OpenLDAP installation I've built with
OpenLDAP 2.4.25 & BDB 4.8.30 and have been testing in a 4-way multi-master
setup. Should I take the plunge to BDB 5.1.25?


On Wed, Mar 30, 2011 at 5:43 PM, Aaron Richton <[email protected]>wrote:

> So, the best defense is a good offense in this case, and if you were
> running 2.4.25 with the appropriate BerkeleyDB library you'd likely not see
> an issue of this manner.
>
> With that said, there was a time (with earlier releases of OpenLDAP) when
> we had this issue (one bdb go down, with the service apparently working via
> an overly simple smoke test). Not being fans of being bitten by the same
> failure mode twice, we wrote up a Nagios check that searches a
> known-present-on-disk entry that is in each of our databases. (You can
> either create one, or (ab)use "ou=People" if you're RFC2307 or use
> "cn=Manager" or what have you...) If any database doesn't return a hit, time
> for us to get a call.
>
> As an aside, I find this thoroughly fascinating timing. Not that it'll make
> you feel any better in the present case, but I was just considering writing
> something up for the next LDAPCon on how we do monitoring (there are ~10
> angles we check from, many of them due to real life situations similar to
> yours). They're all relatively simple ideas like the above, but I suppose
> cleaning up our code to the point where it's world-safe and getting
> something written up on it may be useful. They've proven occasionally useful
> for slapd(8) code issues and also, more frequently, useful in the face of
> human factors.
>
>
> On Wed, 30 Mar 2011, [email protected] wrote:
>
>  A while ago I posted that we were having what we thought were random bdb
>> backend crashes with the following in our log:
>>
>> bdb(o=example.com): PANIC: fatal region error detected; run recovery.
>>
>> This was on a on our RH5 openldap servers (2.3.43) that we were
>> rebuilding:
>>
>> It appears that the crashes were caused by a vulnerability scanner that
>> was hitting the server (still testing), even though it was suppose to be
>> safe.  We'll have to investigate what is causing it, maybe we will need
>> an acl to stop whatever the scanner is doing.  Once we stopped the
>> automated scan, the servers seem to be running as expected.
>>
>> But, this brought up another issue.  When the bdb backend failed, the
>> slapd process continued run and listen on the ldap ports and clients
>> still tried to connect to the failed server for authentication.  The
>> server accepted and established the connection with the client.  Of
>> course the client could not authenticate since the backend db was down.
>> The client will not fail over to the other server that is listed in it's
>> ldap.conf file since it thinks it has a valid connection.  If the slap
>> process is not running then the fail over works fine since no ports are
>> there for the client to connect to.
>>
>> I'm thinking that bdb failures will be rare once we solve the scanner
>> issue, but on a network that relies heavily on ldap, a failed bdb
>> backend with a running slapd would cause significant issues.
>>
>> Just trying to restart the slapd service doesn't fix the issue, a manual
>> recovery is required (slapd_db_recover).  I was curious if anyone has
>> put something in place to deal with this potential issue?  Maybe run
>> slapd_db_status via cron and if it errors due a bdb corruption, just
>> stop slapd and let the admin know.  At least the clients would be able
>> to failover to the other ldap servers.  I guess an automated recovery is
>> possible via a script, but I'm not sure if that's a good idea.  Maybe
>> dealing with this type of failure is not really required, I was hoping
>> that some of you that have been do this for a while would have some
>> insight.
>>
>>

Re: Dealing with BDB Crash

Reply via email to