Hi, I've got a server running OpenIndiana 148 on a Supermicro *X8ST3-F* that has been working perfectly for months right up until I added some more storage.
The board has 6 * SATA ports and 8 * SAS ports. Previously all the drives in my storage pool were attached to the 8 SAS ports and only my rpool drive was using one of the SATA ports. Now that I have added another 4 drives I've had to connect them to the SATA ports - this is when the system started to become unstable. I have had periods of very heavy usage that have cause no problems whatsoever (for example, I copied 4 TB of data on to the pool, most of which would have had to go on the new drives then did several scrubs over the next few days). The system seems perfectly happy to sustain a 350mb+ read or write (or a bit of both) for hours on end with no errors at all. Then other times, typically overnight or early morning when it's just ticking over with < 500k read/write, it will fall apart. There are three kinds of failure I'm experiencing, seemingly randomly: 1. Errors about failed read/write on 2 or 4 SATA drives in /var/adm/messages and system io hung - system has to have the power cut to recover - ssh won't connect, can't get past the username prompt on the terminal. No ZFS errors reported 2. Errors about failed read/write, system io NOT hung, ZFS reporting faulted drives (2 or 4) and hundreds of thousands of errors. In this scenario, the machine can be rebooted cleanly BUT the failed drives don't get detected by BIOS. Usually a full power down, wait 30 seconds, power back up will allow the drives to be detected again. When it powers back up ZFS will report lots of errors but sort itself out after a resilver - I haven't actually had any perminent data loss yet, zfs has always recovered. 3. No errors at all in either /var/adm/messages or zpool status but hung io. I've swaped the drive connections around to prove it isn't the new disks that are at fault and this has confirmed that it's whichever devices are connected to the SATA controller that are having the problem. When I rebooted the machine after the latest failure I checked the /var/adm/messages and there are thousands (9995 in total but that may be from several reboots) messages identical to the following: "[ID 954099 kern.info] NOTICE: IRQ19 is being shared by drivers with different interrupt levels." In case it's useful: cs2dsb@chronos:~$ echo ::interrupts -d | pfexec mdb -k IRQ Vect IPL Bus Trg Type CPU Share APIC/INT# Driver Name(s) 9 0x80 9 PCI Lvl Fixed 1 1 0x0/0x9 acpi_wrapper_isr 11 0xd1 14 PCI Lvl Fixed 2 1 0x0/0xb hpet_isr 16 0x84 9 PCI Lvl Fixed 7 1 0x0/0x10 uhci#0 18 0x82 9 PCI Lvl Fixed 5 2 0x0/0x12 uhci#5, ehci#0 19 0x86 9 PCI Lvl Fixed 3 6 0x0/0x13 uhci#4, uhci#2, pci-ide#0, pci-ide#1, pci-ide#1, pci-ide#0 21 0x85 9 PCI Lvl Fixed 0 1 0x0/0x15 uhci#1 23 0x83 9 PCI Lvl Fixed 6 2 0x0/0x17 uhci#3, ehci#1 24 0x81 7 PCI Edg MSI 4 1 - pcieb#4 25 0x60 6 PCI Edg MSI 1 1 - e1000g#0 26 0x61 6 PCI Edg MSI 2 1 - e1000g#1 27 0x40 5 PCI Edg MSI 3 1 - mpt#0 32 0x20 2 Edg IPI all 1 - cmi_cmci_trap 160 0xa0 0 Edg IPI all 0 - poke_cpu 208 0xd0 14 Edg IPI all 1 - kcpc_hw_overflow_intr 209 0xd3 14 Edg IPI all 1 - cbe_fire 210 0xd4 14 Edg IPI all 1 - cbe_fire 240 0xe0 15 Edg IPI all 1 - xc_serv 241 0xe1 15 Edg IPI all 1 - apic_error_intr cs2dsb@chronos:~$ echo ::interrupts | pfexec mdb -k IRQ Vect IPL Bus Trg Type CPU Share APIC/INT# ISR(s) 9 0x80 9 PCI Lvl Fixed 1 1 0x0/0x9 acpi_wrapper_isr 11 0xd1 14 PCI Lvl Fixed 2 1 0x0/0xb hpet_isr 16 0x84 9 PCI Lvl Fixed 7 1 0x0/0x10 uhci_intr 18 0x82 9 PCI Lvl Fixed 5 2 0x0/0x12 uhci_intr, ehci_intr 19 0x86 9 PCI Lvl Fixed 3 6 0x0/0x13 uhci_intr, uhci_intr, ata_intr, ata_intr, ata_intr, ata_intr 21 0x85 9 PCI Lvl Fixed 0 1 0x0/0x15 uhci_intr 23 0x83 9 PCI Lvl Fixed 6 2 0x0/0x17 uhci_intr, ehci_intr 24 0x81 7 PCI Edg MSI 4 1 - pcieb_intr_handler 25 0x60 6 PCI Edg MSI 1 1 - e1000g_intr_pciexpress 26 0x61 6 PCI Edg MSI 2 1 - e1000g_intr_pciexpress 27 0x40 5 PCI Edg MSI 3 1 - mpt_intr 32 0x20 2 Edg IPI all 1 - cmi_cmci_trap 160 0xa0 0 Edg IPI all 0 - poke_cpu 208 0xd0 14 Edg IPI all 1 - kcpc_hw_overflow_intr 209 0xd3 14 Edg IPI all 1 - cbe_fire 210 0xd4 14 Edg IPI all 1 - cbe_fire 240 0xe0 15 Edg IPI all 1 - xc_serv 241 0xe1 15 Edg IPI all 1 - apic_error_intr So, basically two questions: 1. How do I fix this IRQ issue so that I don't get those warnings during boot up? 2. Is this likely to be the cause of the drive problems described above? Any advice would be much appreciated. Thanks, Daniel _______________________________________________ OpenIndiana-discuss mailing list [email protected] http://openindiana.org/mailman/listinfo/openindiana-discuss
