Hey there, Our DRBD primary machine expirenced a rather spontanous reboot some time ago.
We were happily starting / stopping kvm virtual machines, syncing a new drbd resource and then this happened: ... Feb 29 06:53:47 node2 kernel: [217385.578661] ata3.00: disabled Feb 29 06:53:47 node2 kernel: [217385.578703] sd 2:0:0:0: [sda] Unhandled error code Feb 29 06:53:47 node2 kernel: [217385.578707] sd 2:0:0:0: [sda] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK Feb 29 06:53:47 node2 kernel: [217385.578712] sd 2:0:0:0: [sda] CDB: Read(10): 28 00 19 74 18 00 00 01 38 00 Feb 29 06:53:47 node2 kernel: [217385.661238] sd 2:0:0:0: [sda] Stopping disk Feb 29 06:53:47 node2 kernel: [217385.661977] sd 2:0:0:0: [sda] START_STOP FAILED Feb 29 06:53:47 node2 kernel: [217385.661981] sd 2:0:0:0: [sda] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK Feb 29 06:53:47 node2 kernel: [217385.662391] ata4.00: disabled Feb 29 06:53:47 node2 kernel: [217385.668821] sd 3:0:0:0: [sdb] Stopping disk Feb 29 06:53:47 node2 kernel: [217385.668864] sd 3:0:0:0: [sdb] START_STOP FAILED Feb 29 06:53:47 node2 kernel: [217385.668867] sd 3:0:0:0: [sdb] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK Feb 29 06:53:47 node2 kernel: [217385.669000] ata5.00: disabled Feb 29 06:53:47 node2 kernel: [217385.686506] md: super_written gets error=-5, uptodate=0 Feb 29 06:53:47 node2 kernel: [217385.755989] md: super_written gets error=-5, uptodate=0 Feb 29 06:53:47 node2 kernel: [217385.756202] sd 4:0:0:0: [sdc] Stopping disk Feb 29 06:53:47 node2 kernel: [217385.756257] sd 4:0:0:0: [sdc] START_STOP FAILED Feb 29 06:53:47 node2 kernel: [217385.756260] sd 4:0:0:0: [sdc] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK Feb 29 06:53:47 node2 kernel: [217385.756779] ata6.00: disabled Feb 29 06:53:47 node2 kernel: [217385.816675] md: super_written gets error=-5, uptodate=0 Feb 29 06:53:47 node2 kernel: [217385.900415] RAID5 conf printout: Feb 29 06:53:47 node2 kernel: [217385.900418] --- rd:4 wd:0 Feb 29 06:53:47 node2 kernel: [217385.900421] disk 0, o:0, dev:sda Feb 29 06:53:47 node2 kernel: [217385.900424] disk 1, o:0, dev:sdb Feb 29 06:53:47 node2 kernel: [217385.900426] disk 2, o:0, dev:sdc Feb 29 06:53:47 node2 kernel: [217385.900429] disk 3, o:0, dev:sdd Feb 29 06:53:47 node2 kernel: [217385.900771] sd 5:0:0:0: [sdd] Stopping disk Feb 29 06:53:47 node2 kernel: [217385.901157] sd 5:0:0:0: [sdd] START_STOP FAILED Feb 29 06:53:47 node2 kernel: [217385.901162] sd 5:0:0:0: [sdd] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK Feb 29 06:53:47 node2 kernel: [217385.901487] ahci 0000:00:11.0: PCI INT A disabled Feb 29 06:53:47 node2 kernel: [217385.902756] pci-stub 0000:00:11.0: claimed by stub Feb 29 06:53:47 node2 kernel: [217385.904721] RAID5 conf printout: Feb 29 06:53:47 node2 kernel: [217385.904727] --- rd:4 wd:0 Feb 29 06:53:47 node2 kernel: [217385.904732] disk 1, o:0, dev:sdb Feb 29 06:53:47 node2 kernel: [217385.904735] disk 2, o:0, dev:sdc Feb 29 06:53:47 node2 kernel: [217385.904738] disk 3, o:0, dev:sdd Feb 29 06:53:47 node2 kernel: [217385.904752] RAID5 conf printout: Feb 29 06:53:47 node2 kernel: [217385.904755] --- rd:4 wd:0 Feb 29 06:53:47 node2 kernel: [217385.904757] disk 1, o:0, dev:sdb Feb 29 06:53:47 node2 kernel: [217385.904759] disk 2, o:0, dev:sdc Feb 29 06:53:47 node2 kernel: [217385.904762] disk 3, o:0, dev:sdd Feb 29 06:53:47 node2 kernel: [217385.916029] RAID5 conf printout: Feb 29 06:53:47 node2 kernel: [217385.916035] --- rd:4 wd:0 Feb 29 06:53:47 node2 kernel: [217385.916040] disk 1, o:0, dev:sdb Feb 29 06:53:47 node2 kernel: [217385.916042] disk 2, o:0, dev:sdc Feb 29 06:53:47 node2 kernel: [217385.916056] RAID5 conf printout: Feb 29 06:53:47 node2 kernel: [217385.916058] --- rd:4 wd:0 Feb 29 06:53:47 node2 kernel: [217385.916060] disk 1, o:0, dev:sdb Feb 29 06:53:47 node2 kernel: [217385.916062] disk 2, o:0, dev:sdc Feb 29 06:53:47 node2 kernel: [217385.932427] RAID5 conf printout: Feb 29 06:53:47 node2 kernel: [217385.932432] --- rd:4 wd:0 Feb 29 06:53:47 node2 kernel: [217385.932437] disk 1, o:0, dev:sdb Feb 29 06:53:47 node2 kernel: [217385.932450] RAID5 conf printout: Feb 29 06:53:47 node2 kernel: [217385.932452] --- rd:4 wd:0 Feb 29 06:53:47 node2 kernel: [217385.932455] disk 1, o:0, dev:sdb Feb 29 06:53:47 node2 kernel: [217385.948162] RAID5 conf printout: Feb 29 06:53:47 node2 kernel: [217385.948168] --- rd:4 wd:0 Feb 29 06:53:47 node2 kernel: [217385.949817] block drbd0: Barriers not supported on meta data device - disabling Feb 29 06:53:47 node2 kernel: [217385.950177] block drbd0: read: error=-5 s=232535040s Feb 29 06:53:47 node2 kernel: [217385.950184] block drbd0: Resync aborted. Feb 29 06:53:47 node2 kernel: [217385.950189] block drbd0: conn( SyncSource -> Connected ) disk( UpToDate -> Failed ) Feb 29 06:53:47 node2 kernel: [217385.981468] block drbd0: read: error=-5 s=232536064s Feb 29 06:53:47 node2 kernel: [217385.981479] block drbd0: read: error=-5 s=232534016s Feb 29 06:53:47 node2 kernel: [217385.981648] block drbd0: read: error=-5 s=232535048s <snip ~ 600 more lines like this...> Feb 29 06:53:47 node2 kernel: [217385.985444] block drbd0: p write: error=-5 Feb 29 06:53:47 node2 kernel: [217386.016978] block drbd0: p write: error=-5 Feb 29 06:53:47 node2 kernel: [217386.136316] block drbd0: helper command: /sbin/drbdadm pri-on-incon-degr minor-0 Feb 29 06:53:47 node2 kernel: [217386.153546] block drbd0: read: error=-5 s=232539272s Feb 29 06:53:47 node2 notify-pri-on-incon-degr.sh[25841]: invoked for lv0 Feb 29 06:53:48 node2 kernel: [217386.403458] lost page write due to I/O error on drbd0 Feb 29 06:53:48 node2 kernel: [217386.471193] lost page write due to I/O error on drbd0 Feb 29 06:53:48 node2 kernel: [217386.511306] block drbd1: p write: error=-5 Feb 29 06:53:48 node2 kernel: [217386.526164] block drbd1: disk( UpToDate -> Failed ) Feb 29 06:53:48 node2 kernel: [217386.585614] block drbd1: p write: error=-5 Feb 29 06:53:48 node2 kernel: [217386.624749] block drbd1: disk( Failed -> Diskless ) Feb 29 06:53:48 node2 kernel: [217386.624764] block drbd1: Notified peer that my disk is broken. Feb 29 06:53:48 node2 kernel: [217386.917071] ahci 0000:00:11.0: PCI INT A -> GSI 19 (level, low) -> IRQ 19 Feb 29 06:53:48 node2 kernel: [217386.917872] ahci 0000:00:11.0: AHCI 0001.0200 32 slots 4 ports 3 Gbps 0xf impl SATA mode Feb 29 06:53:48 node2 kernel: [217386.917879] ahci 0000:00:11.0: flags: 64bit ncq sntf ilck pm led clo pmp pio slum part Feb 29 06:53:48 node2 kernel: [217386.918363] scsi7 : ahci Feb 29 06:53:48 node2 kernel: [217386.918492] scsi8 : ahci Feb 29 06:53:48 node2 kernel: [217386.918571] scsi9 : ahci Feb 29 06:53:48 node2 kernel: [217386.919291] scsi10 : ahci Feb 29 06:53:48 node2 kernel: [217386.919361] ata7: SATA max UDMA/133 abar m1024@0xfe4ffc00 port 0xfe4ffd00 irq 30 Feb 29 06:53:48 node2 kernel: [217386.919367] ata8: SATA max UDMA/133 abar m1024@0xfe4ffc00 port 0xfe4ffd80 irq 30 Feb 29 06:53:48 node2 kernel: [217386.919372] ata9: SATA max UDMA/133 abar m1024@0xfe4ffc00 port 0xfe4ffe00 irq 30 Feb 29 06:53:48 node2 kernel: [217386.919377] ata10: SATA max UDMA/133 abar m1024@0xfe4ffc00 port 0xfe4ffe80 irq 30 Feb 29 06:53:49 node2 kernel: [217387.404053] ata9: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Feb 29 06:53:49 node2 kernel: [217387.404091] ata7: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Feb 29 06:53:49 node2 kernel: [217387.404116] ata8: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Feb 29 06:53:49 node2 kernel: [217387.404141] ata10: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Feb 29 06:53:49 node2 kernel: [217387.409780] ata9.00: ATA-8: SAMSUNG HD103SJ, 1AJ10001, max UDMA/133 Feb 29 06:53:49 node2 kernel: [217387.409786] ata9.00: 1953525168 sectors, multi 0: LBA48 NCQ (depth 31/32), AA Feb 29 06:53:49 node2 kernel: [217387.409823] ata8.00: ATA-8: SAMSUNG HD103SJ, 1AJ10001, max UDMA/133 Feb 29 06:53:49 node2 kernel: [217387.409828] ata8.00: 1953525168 sectors, multi 0: LBA48 NCQ (depth 31/32), AA Feb 29 06:53:49 node2 kernel: [217387.410142] ata7.00: ATA-8: SAMSUNG HD103SJ, 1AJ10001, max UDMA/133 Feb 29 06:53:49 node2 kernel: [217387.410149] ata7.00: 1953525168 sectors, multi 0: LBA48 NCQ (depth 31/32), AA Feb 29 06:53:49 node2 kernel: [217387.410198] ata10.00: ATA-8: SAMSUNG HD103SJ, 1AJ10001, max UDMA/133 Feb 29 06:53:49 node2 kernel: [217387.410203] ata10.00: 1953525168 sectors, multi 0: LBA48 NCQ (depth 31/32), AA Feb 29 06:53:49 node2 kernel: [217387.415580] ata9.00: configured for UDMA/133 Feb 29 06:53:49 node2 kernel: [217387.415615] ata8.00: configured for UDMA/133 Feb 29 06:53:49 node2 kernel: [217387.415939] ata7.00: configured for UDMA/133 Feb 29 06:53:49 node2 kernel: [217387.415980] ata10.00: configured for UDMA/133 Feb 29 06:53:49 node2 kernel: [217387.428686] scsi 7:0:0:0: Direct-Access ATA SAMSUNG HD103SJ 1AJ1 PQ: 0 ANSI: 5 Feb 29 06:53:49 node2 kernel: [217387.429015] sd 7:0:0:0: [sdf] 1953525168 512-byte logical blocks: (1.00 TB/931 GiB) Feb 29 06:53:49 node2 kernel: [217387.429450] scsi 8:0:0:0: Direct-Access ATA SAMSUNG HD103SJ 1AJ1 PQ: 0 ANSI: 5 Feb 29 06:53:49 node2 kernel: [217387.429756] scsi 9:0:0:0: Direct-Access ATA SAMSUNG HD103SJ 1AJ1 PQ: 0 ANSI: 5 Feb 29 06:53:49 node2 kernel: [217387.430666] sd 9:0:0:0: [sdh] 1953525168 512-byte logical blocks: (1.00 TB/931 GiB) Feb 29 06:53:49 node2 kernel: [217387.430741] sd 9:0:0:0: [sdh] Write Protect is off Feb 29 06:53:49 node2 kernel: [217387.430774] sd 9:0:0:0: [sdh] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA Feb 29 06:53:49 node2 kernel: [217387.430974] sdh: Feb 29 06:53:49 node2 kernel: [217387.431199] sd 8:0:0:0: [sdg] 1953525168 512-byte logical blocks: (1.00 TB/931 GiB) Feb 29 06:53:49 node2 kernel: [217387.431278] sd 8:0:0:0: [sdg] Write Protect is off Feb 29 06:53:49 node2 kernel: [217387.431313] sd 8:0:0:0: [sdg] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA Feb 29 06:53:49 node2 kernel: [217387.436193] sdg: Feb 29 06:53:49 node2 kernel: [217387.436382] scsi 10:0:0:0: Direct-Access ATA SAMSUNG HD103SJ 1AJ1 PQ: 0 ANSI: 5 Feb 29 06:53:49 node2 kernel: [217387.436580] sd 10:0:0:0: [sdi] 1953525168 512-byte logical blocks: (1.00 TB/931 GiB) Feb 29 06:53:49 node2 kernel: [217387.436649] sd 10:0:0:0: [sdi] Write Protect is off Feb 29 06:53:49 node2 kernel: [217387.436682] sd 10:0:0:0: [sdi] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA Feb 29 06:53:49 node2 kernel: [217387.436888] sdi: Feb 29 06:53:49 node2 kernel: [217387.437033] sd 7:0:0:0: [sdf] Write Protect is off Feb 29 06:53:49 node2 kernel: [217387.437064] sd 7:0:0:0: [sdf] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA Feb 29 06:53:49 node2 kernel: [217387.437212] sdf: unknown partition table Feb 29 06:53:49 node2 kernel: [217387.439934] sd 9:0:0:0: [sdh] Attached SCSI disk Feb 29 06:53:49 node2 kernel: [217387.445677] unknown partition table Feb 29 06:53:49 node2 kernel: [217387.446324] sd 10:0:0:0: [sdi] Attached SCSI disk Feb 29 06:53:49 node2 kernel: [217387.451006] unknown partition table Feb 29 06:53:49 node2 kernel: [217387.451309] sd 8:0:0:0: [sdg] Attached SCSI disk Feb 29 06:53:49 node2 kernel: [217387.451325] Feb 29 06:53:49 node2 kernel: [217387.452053] sd 7:0:0:0: [sdf] Attached SCSI disk <snip drbd does propably the right thing and initiates a reboot> Feb 29 06:53:50 node2 notify-emergency-reboot.sh[25900]: invoked for lv0 Setup Both nodes run squeeze stock 2.6.32-5-amd64 kernel. node2 drbd primary HP Proliant Micro Server 00:11.0 SATA controller: ATI Technologies Inc SB700/SB800 SATA Controller [AHCI mode] (rev 40) 4 sata disks sd[a-d] 1 vg "data" 2.73TB 1 lv "export" 500GB / /dev/drbd1 1 lv "lv0" 500GB / /dev/drbd0 node3 drbd secondary HP Proliant Micro Server 00:11.0 SATA controller: ATI Technologies Inc SB700/SB800 SATA Controller [AHCI mode] (rev 40) 4 sata disks sd[a-d] 1 vg "data" 2.73TB 1 lv "export" 500GB / /dev/drbd1 1 lv "lv0" 500GB / /dev/drbd0 drbd resources resource r0 { device /dev/drbd1; disk /dev/mapper/data-export; meta-disk internal; startup { wfc-timeout 90; } # net { on-disconnect reconnect; } disk { on-io-error detach; } on node2 { address 10.1.5.2:7789; } on node3 { address 10.1.5.3:7789; } } resource lv0 { device /dev/drbd0; disk /dev/mapper/data-lv0; meta-disk internal; startup { wfc-timeout 90; } # net { on-disconnect reconnect; } disk { on-io-error detach; } on node2 { address 10.1.5.2:7790; } on node3 { address 10.1.5.3:7790; } } switched gigabit ethernet hooks all this together I was since able to reproduce the problem on another hp miniserver, identical to this one but with slower and bigger disks in it - same sata controller tough. Other people might be having issues with this sata controller too: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/550559 The machines are deployed at the customers site, run currently mostly stable, as long as we keep the io load down... Any help is appreciated to get this sorted out. Cheers Robert -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/CAMPMgM9ozKVs=k5hh5h7t4jpgh8tnxhsphaycv-omfnjweo...@mail.gmail.com