I've got two OpenBSD 3.9 firewall/router in a CARP configuration. They are
both IBM NetFinity 40004 servers with dual P3 650MHz chips and 512MB of
memory each. Twice now, the backup firewall has disappeared from my Nagios
monitoring and I've found (through remote serial console) only a ddb{1}>
prompt.
According to man ddb, this can happen when the kernel panics or when a break
signal is sent from the console (and ddb.console is set to 1). In my case,
no one is using the console at these times and ddb.console is set to 0
anyway. However, "show panic" seems to indicate it wasn't a kernel panic
either:
ddb{1}> show panic
the kernel did not panic
I feel like I'm missing something obvious here. Is there some undocumented
condition that can cause a system to crash to ddb or am I investigating the
panic wrong? I tried using trace and hangman to gather more information, but
hangman just confused the hell out of me and the trace command gave me:
apm_cpu_idle(0,0,0,0,0) at apm_cpu_idle+0x4a
After a little more investigative commands, I started only to get "Faulted in
DDB; continuing..." and tried rebooting. "boot dump" yielded a nonresponsive
system and a trip to the datacenter to cold boot the machine.
Anyone have any ideas? Perhaps I can disable part of APM and avoid this
problem in the future? What other techniques can I use to debug this if it
happens again - is there a good doc out there that is a little more
descriptive than man ddb?
--
Regards,
Neil Schelly
Senior Systems Administrator
W: 978-667-5115 x213
M: 508-410-4776
OASIS Open http://www.oasis-open.org
"Advancing E-Business Standards Since 1993"