Date: Tue, 04 Sep 2001 18:31:10 -0400
From: Scott Adkins <[EMAIL PROTECTED]>
We stopped the server and nuked the delivery files altogether. This only
worked for 2 days, however, and now we are back to where we were. We see
a constant stream of duplicate delivery database errors...
Hmmm. Something is definitely crashing. Do you get any core files?
> To run recover, you must kill at lmtpd's (and any other processes that
> might have the database open) and run ctl_deliver -r. Just stopping
> (and waiting) and starting the master process should do this.
I was curious how we could do this without stopping our production server.
So, basically, I need to turn off our cron jobs that process the mail queues
(thus, talking to lmtp via TCP), kill off any sendmail's currently doing
queue processing, then kill off any lmtp daemons. At that point, I can
run the ctl_deliver process.
Yes, this should work.
So, does ctl_deliver actually clear out all the locks in the database as
part of the recovery operation?
That is my understanding.
> Possibly a system crash or a process crash/being killed at just the
> wrong time, due to the lack of transactions.
I don't know, but this strikes me as being an extremeley fragile system.
We seem to have database errors more than we don't. We also have caught on
a number of occasions some lmtp processes getting stuck, spinning at 99.9%
CPU in the process table. Throwing a debugger at them shows they are busy
waiting for a lock to become available. We usually have to kill them off.
So, what kind of information should I provide you to help track down the
problem? It sounds like there is either a bug, or something has to be
added to increase the robustness and recovery of the db3 locking mechanism.
We haven't seen this and have many many lmtpds running in parallel.
Unfortunately, the db3 locking system is this fragile; it's something
that has always made me somewhat hesistant about committing to
Berkeley db.
Generally the important thing to find out in this situation is which
lmtpd is holding the lock or (more likely) where did it crash while it
was holding the lock and why. This is tricky, since the lmtpd that
crashed isn't around anymore showing you what's up.
Larry