Hi everyone, I'm new to this listserv. I run Air Quality and Meteorological
models on an HPC cluster, which is all CentOS except for one storage server
running OpenIndiana (SunOS 5.11 oi_151a9 November 2013). I know, I know, it's a
ridiculously old installation, please don't bug me about that.
I would like to figure out what happened to that server this past weekend. My
goal is to figure out if there's something I can do to avoid having the problem
described below happen again.
OpenIndiana is running on a Supermicro box, with a SAS attached JBOD, about 85
spinning disks in two ZFS pools, one of SAS disks the other of SATA disks.
Periodically, when load gets too high, it becomes unresponsive for 5 - 30
minutes, but if we're patient enough it comes back. The load (as reported by
/usr/bin/top) immediately after such an event is ~200, which rapidly falls back
to a more normal range of ~0.5.
Two days ago on Sunday evening, it went off into la-la land again, but after a
few hours hadn't come back. The IMPI interface was also not responding, so I
couldn't reboot it remotely. I went in to the office on Monday morning and shut
down the server, then pulled the power cords for 20 seconds. The complete
removal of power often helps in situations like this, I've found.
The server then entered an endless loop, where it would try to boot, timeout
about 6 times (taking ~5 minutes for each timeout) with the following message,
then kernel panic and reboot.
Warning: /pci@0,0/pci8086,3c04@2/pci100,3020@0 (mpt_sas10):
Disconnected command timeout for Target 156
...repeat...
panic[cpu0]/thread=ffffff01e80cbc40: I/O to pool 'pool0' appears to be hung.
Great! This OS already has so many names for disks, here's another one: which
disk is Target 156? Sometimes it was Target 75, sometimes it was Target 150. Or
is that a SAS expander? I could not log in, it would never get that far before
the kernel panic and reboot.
I was able to boot into single-user mode (append -s to the grub line containing
"kernel") and poked around until I found two disks that were reporting errors.
fmdump -eV was useful, though so verbose it took a while to figure out what to
read. The best/clearest method was echo | format, which is not a command I
would have guessed based on decades of experience with Linux ;-). I pulled two
bad disks, and rebooted... and it went back into the endless panic-reboot loop.
I eventually found this page:
https://docs.oracle.com/cd/E23824_01/html/821-1448/gbbwc.html#scrolltoc and
followed this procedure:
* When the booting gets to the grub stage, press e to edit
* Scroll to the line containing "kernel" and press e again edit
* At the end of the line, add the text -m milestone=none and press enter
* Press b to boot
* Login as root
* (The root filesystem [mounted at /] was already read-write, not
read-only, for me)
* Rename /etc/zfs/zpool.cache to something else
* Reboot (svcadm milestone all didn't work for me)
* Login as root
* Type zpool import and verify that all pools were able to be imported
* Type zpool import -a and suddenly everything was back to normal!
(Yes, I typed that all out so someone searching could find the step-by-step
recipe when it happens to them, the link above is not great for beginners.)
Any suggestions on what to look for, and where to look (which logs) would be
greatly appreciated. Suggestions about upgrading or migrating to new hardware
are not necessary, I already know. It's all about money - and with the GDP
outlook for 2020 due to COVID-19, it's looking like I'll have to keep this
server limping along a while longer.
Thanks,
Bart
_______________________________________________
openindiana-discuss mailing list
[email protected]
https://openindiana.org/mailman/listinfo/openindiana-discuss