I have 40 identically configured systems that catch the pci-e error below. It seems that about every six months plus or minus, they go through a cycle where they generate this error usually all forty within about three weeks and they are good for months. Bad juju.
The systems are Intel SR2625URLXR, 9207-8i, Intel 910, and 9205-8e on L5630 CPUs with 96gb of ram. The result of the failure is that zfs and zpool commands commands hang on the intel 910 card. Regular file system disk I/O is okay, but zpool and zfs commands hang. I am looking for a work around as the storage continues to work for applications despite the error. Perhaps the error could be masked before FMD takes action? Maybe ZFS gets internally hosed before FMD takes action, I don't know. The hang up seems to be in zfs where system thinks the storage is hosed and zfs/zpool commands hang. As I say regular file system I/Os work just peachy. Does anyone have any ideas on how to overcome this problem without rebooting? I use clones of file systems to stand up short lived databases to run long batch queries against and when this happens i tend to have fairly crappy work day satisfaction. Perhaps this is related to: https://www.illumos.org/issues/315 http://h20565.www2.hp.com/portal/site/hpsc/template.PAGE/public/psi/mostViewedDisplay?javax.portlet.begCacheTok=com.vignette.cachetoken&javax.portlet.endCacheTok=com.vignette.cachetoken&javax.portlet.prp_efb5c0793523e51970c8fa22b053ce01=wsrp-navigationalState%3DdocId%253Demr_na-c03652921-1%257CdocLocale%253Den_US&javax.portlet.tpst=efb5c0793523e51970c8fa22b053ce01&sp4ts.oid=4091412&ac.admitted=1389635734908.876444892.492883150 It seems Oracle may have patched similar issues. thanks, j. root@db020:~# fmadm faulty -ai --------------- ------------------------------------ -------------- --------- TIME CACHE-ID MSG-ID SEVERITY --------------- ------------------------------------ -------------- --------- Jan 08 13:47:15 2a74a865-ba4e-c3b0-e437-e0e34ba53623 PCIEX-8000-0A Critical Host : db020 Platform : S5520UR Chassis_id : ............ Product_sn : Fault class : fault.io.pciex.device-interr Affects : dev:////pci@0,0/pci8086,340c@5/pci111d,806a@0/pci111d,806a@4/pci1000,3020@0 faulted and taken out of service FRU : "FH PCIE-SLOT2 x8" (hc://:product-id=S5520UR:server-id=db020:chassis-id=............/motherboard=0/hostbridge=2/pciexrc=2/pciexbus=4/pciexdev=0) faulty Description : A problem was detected for a PCIEX device. Refer to http://sun.com/msg/PCIEX-8000-0A for more information. Response : One or more device instances may be disabled Impact : Loss of services provided by the device instances associated with this fault Action : Schedule a repair procedure to replace the affected device. Use fmadm faulty to identify the device or contact Sun for support. _______________________________________________ OpenIndiana-discuss mailing list [email protected] http://openindiana.org/mailman/listinfo/openindiana-discuss
