[OpenIndiana-discuss] mask pci-e errors

jason matthews Mon, 13 Jan 2014 10:00:44 -0800


I have 40 identically configured systems that catch the pci-e error below. It 
seems that about every six months plus or minus, they go through a cycle where 
they generate this error usually all forty within about three weeks and they 
are good for months. Bad juju.

The systems are Intel SR2625URLXR, 9207-8i, Intel 910, and 9205-8e on L5630
CPUs with 96gb of ram. The result of the failure is that zfs and zpool commands
commands hang on the intel 910 card. Regular file system disk I/O is okay, but
zpool and zfs commands hang.

I am looking for a work around as the storage continues to work for
applications despite the error. Perhaps the error could be masked before FMD
takes action? Maybe ZFS gets internally hosed before FMD takes action, I don't
know. The hang up seems to be in zfs where system thinks the storage is hosed
and zfs/zpool commands hang. As I say regular file system I/Os work just
peachy. Does anyone have any ideas on how to overcome this problem without
rebooting?

I use clones of file systems to stand up short lived databases to run long
batch queries against and when this happens i tend to have fairly crappy work
day satisfaction.

Perhaps this is related to:
https://www.illumos.org/issues/315

http://h20565.www2.hp.com/portal/site/hpsc/template.PAGE/public/psi/mostViewedDisplay?javax.portlet.begCacheTok=com.vignette.cachetoken&javax.portlet.endCacheTok=com.vignette.cachetoken&javax.portlet.prp_efb5c0793523e51970c8fa22b053ce01=wsrp-navigationalState%3DdocId%253Demr_na-c03652921-1%257CdocLocale%253Den_US&javax.portlet.tpst=efb5c0793523e51970c8fa22b053ce01&sp4ts.oid=4091412&ac.admitted=1389635734908.876444892.492883150

It seems Oracle may have patched similar issues.
thanks,
j.

root@db020:~# fmadm faulty -ai
--------------- ------------------------------------ -------------- ---------
TIME CACHE-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Jan 08 13:47:15 2a74a865-ba4e-c3b0-e437-e0e34ba53623 PCIEX-8000-0A Critical

Host : db020
Platform : S5520UR Chassis_id : ............
Product_sn :

Fault class : fault.io.pciex.device-interr
Affects :
dev:////pci@0,0/pci8086,340c@5/pci111d,806a@0/pci111d,806a@4/pci1000,3020@0
faulted and taken out of service
FRU : "FH PCIE-SLOT2 x8"
(hc://:product-id=S5520UR:server-id=db020:chassis-id=............/motherboard=0/hostbridge=2/pciexrc=2/pciexbus=4/pciexdev=0)
faulty

Description : A problem was detected for a PCIEX device.
Refer to http://sun.com/msg/PCIEX-8000-0A for more information.

Response : One or more device instances may be disabled

Impact : Loss of services provided by the device instances associated with
this fault

Action : Schedule a repair procedure to replace the affected device. Use
fmadm faulty to identify the device or contact Sun for support.

_______________________________________________
OpenIndiana-discuss mailing list
[email protected]
http://openindiana.org/mailman/listinfo/openindiana-discuss

[OpenIndiana-discuss] mask pci-e errors

Reply via email to