I have 40 identically configured systems that catch the pci-e error below. It 
seems that about every six months plus or minus, they go through a cycle where 
they generate this error usually all forty within about three weeks and they 
are good for months. Bad juju.

The systems are Intel SR2625URLXR, 9207-8i, Intel 910, and 9205-8e on L5630 
CPUs with 96gb of ram. The result of the failure is that zfs and zpool commands 
commands hang on the intel 910 card. Regular file system disk I/O is okay, but 
zpool and zfs commands hang. 

I am looking for a work around as  the storage continues to work for 
applications despite the error. Perhaps the error could be masked before FMD 
takes action? Maybe ZFS gets internally hosed before FMD takes action, I don't 
know. The hang up seems to be in zfs where system thinks the storage is hosed 
and zfs/zpool commands hang. As I say regular file system I/Os work just 
peachy. Does anyone have any ideas on how to overcome this problem without 
rebooting?

I use clones of file systems to stand up short lived databases to run long 
batch queries against and when this happens i tend to have fairly crappy work 
day satisfaction.

Perhaps this is related to:
https://www.illumos.org/issues/315

http://h20565.www2.hp.com/portal/site/hpsc/template.PAGE/public/psi/mostViewedDisplay?javax.portlet.begCacheTok=com.vignette.cachetoken&javax.portlet.endCacheTok=com.vignette.cachetoken&javax.portlet.prp_efb5c0793523e51970c8fa22b053ce01=wsrp-navigationalState%3DdocId%253Demr_na-c03652921-1%257CdocLocale%253Den_US&javax.portlet.tpst=efb5c0793523e51970c8fa22b053ce01&sp4ts.oid=4091412&ac.admitted=1389635734908.876444892.492883150

It seems Oracle may have patched similar issues.
thanks,
j.


root@db020:~# fmadm faulty -ai
--------------- ------------------------------------  -------------- ---------
TIME            CACHE-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Jan 08 13:47:15 2a74a865-ba4e-c3b0-e437-e0e34ba53623  PCIEX-8000-0A  Critical  

Host        : db020
Platform    : S5520UR   Chassis_id  : ............
Product_sn  : 

Fault class : fault.io.pciex.device-interr
Affects     : 
dev:////pci@0,0/pci8086,340c@5/pci111d,806a@0/pci111d,806a@4/pci1000,3020@0
                  faulted and taken out of service
FRU         : "FH PCIE-SLOT2 x8" 
(hc://:product-id=S5520UR:server-id=db020:chassis-id=............/motherboard=0/hostbridge=2/pciexrc=2/pciexbus=4/pciexdev=0)
                  faulty

Description : A problem was detected for a PCIEX device.
              Refer to http://sun.com/msg/PCIEX-8000-0A for more information.

Response    : One or more device instances may be disabled

Impact      : Loss of services provided by the device instances associated with
              this fault

Action      : Schedule a repair procedure to replace the affected device.  Use
              fmadm faulty to identify the device or contact Sun for support.


_______________________________________________
OpenIndiana-discuss mailing list
[email protected]
http://openindiana.org/mailman/listinfo/openindiana-discuss

Reply via email to