Hi everyone, I'm new to this listserv. I run Air Quality and Meteorological 
models on an HPC cluster, which is all CentOS except for one storage server 
running OpenIndiana (SunOS 5.11 oi_151a9 November 2013). I know, I know, it's a 
ridiculously old installation, please don't bug me about that.

I would like to figure out what happened to that server this past weekend. My 
goal is to figure out if there's something I can do to avoid having the problem 
described below happen again.

OpenIndiana is running on a Supermicro box, with a SAS attached JBOD, about 85 
spinning disks in two ZFS pools, one of SAS disks the other of SATA disks. 
Periodically, when load gets too high, it becomes unresponsive for 5 - 30 
minutes, but if we're patient enough it comes back. The load (as reported by 
/usr/bin/top) immediately after such an event is ~200, which rapidly falls back 
to a more normal range of ~0.5.

Two days ago on Sunday evening, it went off into la-la land again, but after a 
few hours hadn't come back. The IMPI interface was also not responding, so I 
couldn't reboot it remotely. I went in to the office on Monday morning and shut 
down the server, then pulled the power cords for 20 seconds. The complete 
removal of power often helps in situations like this, I've found.

The server then entered an endless loop, where it would try to boot, timeout 
about 6 times (taking ~5 minutes for each timeout) with the following message, 
then kernel panic and reboot.

Warning: /pci@0,0/pci8086,3c04@2/pci100,3020@0 (mpt_sas10):
       Disconnected command timeout for Target 156
...repeat...
panic[cpu0]/thread=ffffff01e80cbc40: I/O to pool 'pool0' appears to be hung.

Great! This OS already has so many names for disks, here's another one: which 
disk is Target 156? Sometimes it was Target 75, sometimes it was Target 150. Or 
is that a SAS expander? I could not log in, it would never get that far before 
the kernel panic and reboot.

I was able to boot into single-user mode (append -s to the grub line containing 
"kernel") and poked around until I found two disks that were reporting errors. 
fmdump -eV was useful, though so verbose it took a while to figure out what to 
read. The best/clearest method was echo | format, which is not a command I 
would have guessed based on decades of experience with Linux ;-). I pulled two 
bad disks, and rebooted... and it went back into the endless panic-reboot loop.

I eventually found this page: 
https://docs.oracle.com/cd/E23824_01/html/821-1448/gbbwc.html#scrolltoc and 
followed this procedure:


  *   When the booting gets to the grub stage, press e to edit
  *   Scroll to the line containing "kernel" and press e again edit
  *   At the end of the line, add the text -m milestone=none and press enter
  *   Press b to boot
  *   Login as root
  *   (The root filesystem [mounted at /] was already read-write, not 
read-only, for me)
  *   Rename /etc/zfs/zpool.cache to something else
  *   Reboot (svcadm milestone all didn't work for me)
  *   Login as root
  *   Type zpool import and verify that all pools were able to be imported
  *   Type zpool import -a and suddenly everything was back to normal!

(Yes, I typed that all out so someone searching could find the step-by-step 
recipe when it happens to them, the link above is not great for beginners.)

Any suggestions on what to look for, and where to look (which logs) would be 
greatly appreciated. Suggestions about upgrading or migrating to new hardware 
are not necessary, I already know. It's all about money - and with the GDP 
outlook for 2020 due to COVID-19, it's looking like I'll have to keep this 
server limping along a while longer.

Thanks,

Bart
_______________________________________________
openindiana-discuss mailing list
[email protected]
https://openindiana.org/mailman/listinfo/openindiana-discuss

Reply via email to