When running rsnapshot backups from an IBM fibre channel disk system using LVM2 snapshots to a Promise fibre channel disk system, the qla2xxx driver causes a system crash and reboot. I'm running Lenny with kernel 2.6.22--3-vserver-amd64 and stock Debian qla2xxx module. I've already replaced the Qlogic HBA and the Qlogic switch connecting to the storage. Three other servers with similar hardware running the same Debian version don't have this problem. These events were logged with the ql2xextended_error_logging parameter enabled:
Feb 6 13:40:28 hqhost kernel: qla2xxx_eh_abort(0): aborting sp ffff8101d01aa7c0 from RISC. pid=111928. Feb 6 13:40:58 hqhost kernel: qla2x00_mailbox_command(0): timeout calling abort_isp Feb 6 13:40:58 hqhost kernel: qla2x00_mailbox_command(0): timeout calling abort_isp Feb 6 13:40:58 hqhost kernel: qla2xxx 0000:08:01.0: Mailbox command timeout occured. Issuing ISP abort. Feb 6 13:40:58 hqhost kernel: qla2xxx 0000:08:01.0: Performing ISP error recovery - ha= ffff810225a84530. Feb 6 13:40:58 hqhost kernel: scsi(0): **** Load RISC code **** Feb 6 13:40:58 hqhost kernel: scsi(0): Verifying Checksum of loaded RISC code. Feb 6 13:40:58 hqhost kernel: scsi(0): Checksum OK, start firmware. Feb 6 13:40:58 hqhost kernel: scsi(0): Issue init firmware. Feb 6 13:40:59 hqhost kernel: scsi(0): Asynchronous P2P MODE received. Feb 6 13:40:59 hqhost kernel: scsi(0): Asynchronous LOOP UP (4 Gbps). Feb 6 13:40:59 hqhost kernel: qla2xxx 0000:08:01.0: LOOP UP detected (4 Gbps). Feb 6 13:40:59 hqhost kernel: scsi(0): Asynchronous PORT UPDATE. Feb 6 13:40:59 hqhost kernel: scsi(0): Port database changed ffff 0006 0000. Feb 6 13:40:59 hqhost kernel: scsi(0): Asynchronous PORT UPDATE ignored 0000/0004/0600. Feb 6 13:40:59 hqhost kernel: scsi(0): Asynchronous PORT UPDATE ignored 0000/0007/0b00. Feb 6 13:40:59 hqhost kernel: scsi(0): F/W Ready - OK Feb 6 13:40:59 hqhost kernel: scsi(0): fw_state=3 curr time=1001756ca. Feb 6 13:40:59 hqhost kernel: qla2x00_restart_isp(): Start configure loop, status = 0 Feb 6 13:40:59 hqhost kernel: scsi(0): Configure loop -- dpc flags =0x4080048 Feb 6 13:40:59 hqhost kernel: scsi(0): RSCN queue entry[0] = [00/000000]. Feb 6 13:40:59 hqhost kernel: scsi(0): device_resync: rscn overflow. Feb 6 13:40:59 hqhost kernel: scsi(0): RFT_ID failed, completion status (280). Feb 6 13:40:59 hqhost kernel: scsi(0): Register FC-4 TYPE failed. Feb 6 13:40:59 hqhost kernel: scsi(0): RFF_ID failed, completion status (280). Feb 6 13:40:59 hqhost kernel: scsi(0): Register FC-4 Features failed. Feb 6 13:40:59 hqhost kernel: scsi(0): RNN_ID failed, completion status (280). Feb 6 13:40:59 hqhost kernel: scsi(0): Register Node Name failed. Feb 6 13:40:59 hqhost kernel: scsi(0): GID_PT failed, completion status (180). Feb 6 13:40:59 hqhost kernel: scsi(0): GA_NXT failed, rejected request: Feb 6 13:40:59 hqhost kernel: 0 1 2 3 4 5 6 7 8 9 Ah Bh Ch Dh Eh Fh Feb 6 13:40:59 hqhost kernel: -------------------------------------------------------------- Feb 6 13:40:59 hqhost kernel: 14 00 00 00 00 10 97 23 02 00 00 00 10 08 00 00 Feb 6 13:40:59 hqhost kernel: qla2xxx 0000:08:01.0: SNS scan failed -- assuming zero-entry result... Feb 6 13:40:59 hqhost kernel: scsi(0): fcport-0 - port retry count: 29 remaining Feb 6 13:40:59 hqhost kernel: scsi(0): fcport-1 - port retry count: 29 remaining Feb 6 13:40:59 hqhost kernel: scsi(0): fcport-2 - port retry count: 29 remaining Feb 6 13:40:59 hqhost kernel: qla24xx_fabric_logout(0): failed to complete IOCB -- completion status (2) ioparam=0/810031. Feb 6 13:40:59 hqhost kernel: scsi(0): LOOP READY Feb 6 13:40:59 hqhost kernel: qla2x00_restart_isp(): Configure loop done, status = 0x0 Feb 6 13:40:59 hqhost kernel: APIC error on CPU5: 00(40) Feb 6 13:40:59 hqhost kernel: qla2x00_abort_isp(0): exiting. Feb 6 13:40:59 hqhost kernel: qla2x00_mailbox_command(0): finished abort_isp Feb 6 13:40:59 hqhost kernel: qla2x00_mailbox_command(0): finished abort_isp Feb 6 13:40:59 hqhost kernel: qla2x00_mailbox_command(0): **** FAILED. mbx0=54, mbx1=0, mbx2=2397, cmd=54 **** Feb 6 13:40:59 hqhost kernel: qla2x00_issue_iocb(0): failed rval 0x100 Feb 6 13:40:59 hqhost kernel: qla2x00_issue_iocb(0): failed rval 0x100 Feb 6 13:40:59 hqhost kernel: qla24xx_abort_command(0): failed to issue IOCB (100). Feb 6 13:40:59 hqhost kernel: qla2xxx_eh_abort(0): abort_command mbx failed. Feb 6 13:40:59 hqhost kernel: qla2xxx 0000:08:01.0: scsi(0:0:0): Abort command issued -- 0 1b538 2002. Feb 6 13:41:00 hqhost kernel: scsi(0): fcport-0 - port retry count: 28 remaining Feb 6 13:41:00 hqhost kernel: scsi(0): fcport-1 - port retry count: 28 remaining Feb 6 13:41:00 hqhost kernel: scsi(0): fcport-2 - port retry count: 28 remaining Feb 6 13:41:01 hqhost kernel: scsi(0): fcport-0 - port retry count: 27 remaining Feb 6 13:41:01 hqhost kernel: scsi(0): fcport-1 - port retry count: 27 remaining Feb 6 13:41:01 hqhost kernel: scsi(0): fcport-2 - port retry count: 27 remaining Feb 6 13:41:02 hqhost kernel: scsi(0): fcport-0 - port retry count: 26 remaining Feb 6 13:41:02 hqhost kernel: scsi(0): fcport-1 - port retry count: 26 remaining Feb 6 13:41:02 hqhost kernel: scsi(0): fcport-2 - port retry count: 26 remaining ...(25 more port retries)... Feb 6 13:41:33 hqhost kernel: rport-0:0-0: blocked FC remote port time out: removing target and saving binding Feb 6 13:41:33 hqhost kernel: rport-0:0-4: blocked FC remote port time out: removing target and saving binding Feb 6 13:41:33 hqhost kernel: rport-0:0-5: blocked FC remote port time out: removing target and saving binding Feb 6 13:41:33 hqhost kernel: qla2xxx 0000:08:01.0: scsi(0:0:0): DEVICE RESET ISSUED. Feb 6 13:41:33 hqhost kernel: qla2x00_wait_for_hba_online return_status=0 Is this a hardware problem, a kernel problem, or a qlogic driver problem-- or perhaps all three at once? Thanks in advance, -- Daniel Bakken Systems Administrator Economic Modeling Specialists Inc Moscow, Idaho