Public bug reported: My system boots from XFS on RAID10 on GPT partitions (no LVM). The RAID10 uses the "far2" layout, and has three component devices. I use grub-pc for non-EFI booting, because this system is old and doesn't support EFI (Intel DG965WH from 2008).
I added a fourth hard drive and shuffled my data around so I could re- partition the existing drives. (http://unix.stackexchange.com/questions/74924/how-to-safely-replace-a -not-yet-failed-disk-in-a-linux-raid5-array) A few weeks the final `mdadm /dev/md0 --replace /dev/sda1 --with /dev/sdd1`, grub failed to boot. Error messages included "invalid arch- independent ELF magic", and `insmod linux` giving "not a regular file". Booting an Ubuntu live USB showed no problem with the FS, and none of dpkg-reconfigure grub-pc; grub-install /dev/sda ; update-grub helped. Before those attempts to fix it, grub was loading a messed-up menu but not quite booting Linux. After re-running grub-install, it stopped at the grub rescue> prompt. sda is the first BIOS disk, but even having my BIOS boot a different disk didn't help. Presumably that doesn't affect the order GRUB detects them in. I eventually solved the problem by swapping the SATA cables so the drive that didn't have a member of the boot array was not the first BIOS drive anymore. Now everything works perfectly. I think GRUB's md code is including the first N members it sees, whether they're stale or not. Linux's MD code finds all candidates, and then picks N in-sync ones if available. This was really hard to diagnose, because disk churn hadn't got the data so far out of sync that there were XFS errors. Directory listings of /boot/grub/i386-pc worked from the grub rescue shell, but the actual data in some of the files didn't match. (And even some of the inode contents were different, too, hence the "not a regular file") I think wiping the RAID signature would have solved the problem as well. (mdadm --zero-superblock /dev/sda2, after making sure that was actually the stale device in the live-USB environment) Here's mdadm -E from the stale component (which was sda2 before swapping cables, now it's sdd2). This is what a component looks like after a --replace and --remove is done with it. After that: mdadm --detail /dev/md/root peter@tesla:~$ sudo mdadm --examine /dev/sdd2 /dev/sdd2: ####### Magic : a92b4efc Version : 1.2 Feature Map : 0x0 Array UUID : e0ad8202:4c270099:9f28ddd6:b597231d Name : tesla:root (local to host tesla) Creation Time : Thu Apr 16 14:26:50 2015 ### note that's 2015, last year. Raid Level : raid10 Raid Devices : 3 Avail Dev Size : 30703616 (14.64 GiB 15.72 GB) Array Size : 23027712 (21.96 GiB 23.58 GB) Data Offset : 16384 sectors Super Offset : 8 sectors Unused Space : before=16296 sectors, after=0 sectors State : clean Device UUID : 8ae879d7:b5c6b0ad:f2d6c787:49284d4b Update Time : Wed Mar 16 02:49:17 2016 Bad Block Log : 512 entries available at offset 72 sectors Checksum : 1c62e134 - correct Events : 2708 Layout : far=2 Chunk Size : 1024K Device Role : Active device 2 Array State : AAR ('A' == active, '.' == missing, 'R' == replacing) /dev/sda2: ##### An in-sync component Magic : a92b4efc Version : 1.2 Feature Map : 0x0 Array UUID : e0ad8202:4c270099:9f28ddd6:b597231d Name : tesla:root (local to host tesla) Creation Time : Thu Apr 16 14:26:50 2015 Raid Level : raid10 Raid Devices : 3 Avail Dev Size : 30703616 (14.64 GiB 15.72 GB) Array Size : 23027712 (21.96 GiB 23.58 GB) Data Offset : 16384 sectors Super Offset : 8 sectors Unused Space : before=16296 sectors, after=0 sectors State : clean Device UUID : 5d6bb778:1700264b:bd7aadba:11336f0b Update Time : Sat Apr 9 16:48:18 2016 Bad Block Log : 512 entries available at offset 72 sectors Checksum : 4e39b4c0 - correct Events : 2740 Layout : far=2 Chunk Size : 1024K Device Role : Active device 1 Array State : AAA ('A' == active, '.' == missing, 'R' == replacing) peter@tesla:~$ sudo mdadm --detail /dev/md/root /dev/md/root: Version : 1.2 Creation Time : Thu Apr 16 14:26:50 2015 Raid Level : raid10 Array Size : 23027712 (21.96 GiB 23.58 GB) Used Dev Size : 15351808 (14.64 GiB 15.72 GB) Raid Devices : 3 Total Devices : 3 Persistence : Superblock is persistent Update Time : Sat Apr 9 21:19:32 2016 State : clean Active Devices : 3 Working Devices : 3 Failed Devices : 0 Spare Devices : 0 Layout : far=2 Chunk Size : 1024K Name : tesla:root (local to host tesla) UUID : e0ad8202:4c270099:9f28ddd6:b597231d Events : 2740 Number Major Minor RaidDevice State 3 8 18 0 active sync /dev/sdb2 4 8 2 1 active sync /dev/sda2 6 8 34 2 active sync /dev/sdc2 ProblemType: Bug DistroRelease: Ubuntu 15.10 Package: grub-pc 2.02~beta2-29ubuntu0.3 ProcVersionSignature: Ubuntu 4.2.0-35.40-generic 4.2.8-ckt5 Uname: Linux 4.2.0-35-generic x86_64 ApportVersion: 2.19.1-0ubuntu5 Architecture: amd64 CurrentDesktop: KDE Date: Sat Apr 9 20:53:19 2016 SourcePackage: grub2 UpgradeStatus: Upgraded to wily on 2015-11-12 (149 days ago) ** Affects: grub2 (Ubuntu) Importance: Undecided Status: New ** Tags: amd64 apport-bug wily ** Description changed: My system boots from XFS on RAID10 on GPT partitions (no LVM). The RAID10 uses the "far2" layout, and has three component devices. I use grub-pc for non-EFI booting, because this system is old and doesn't support EFI (Intel DG965WH from 2008). I added a fourth hard drive and shuffled my data around so I could re- partition the existing drives. + (http://unix.stackexchange.com/questions/74924/how-to-safely-replace-a + -not-yet-failed-disk-in-a-linux-raid5-array) A few weeks the final `mdadm /dev/md0 --replace /dev/sda1 --with /dev/sdd1`, grub failed to boot. Error messages included "invalid arch- independent ELF magic", and `insmod linux` giving "not a regular file". Booting an Ubuntu live USB showed no problem with the FS, and none of dpkg-reconfigure grub-pc; grub-install /dev/sda ; update-grub helped. Before those attempts to fix it, grub was loading a messed-up menu but not quite booting Linux. After re-running grub-install, it stopped at the grub rescue> prompt. sda is the first BIOS disk, but even having my BIOS boot a different disk didn't help. Presumably that doesn't affect the order GRUB detects them in. I eventually solved the problem by swapping the SATA cables so the drive that didn't have a member of the boot array was not the first BIOS drive anymore. Now everything works perfectly. I think GRUB's md code is including the first N members it sees, whether they're stale or not. Linux's MD code finds all candidates, and then picks N in-sync ones if available. This was really hard to diagnose, because disk churn hadn't got the data so far out of sync that there were XFS errors. Directory listings of /boot/grub/i386-pc worked from the grub rescue shell, but the actual data in some of the files didn't match. (And even some of the inode contents were different, too, hence the "not a regular file") I think wiping the RAID signature would have solved the problem as well. (mdadm --zero-superblock /dev/sda2, after making sure that was actually the stale device in the live-USB environment) Here's mdadm -E from the stale component (which was sda2 before swapping cables, now it's sdd2). This is what a component looks like after a --replace and --remove is done with it. After that: mdadm --detail /dev/md/root peter@tesla:~$ sudo mdadm --examine /dev/sdd2 - /dev/sdd2: ####### - Magic : a92b4efc - Version : 1.2 - Feature Map : 0x0 - Array UUID : e0ad8202:4c270099:9f28ddd6:b597231d - Name : tesla:root (local to host tesla) - Creation Time : Thu Apr 16 14:26:50 2015 ### note that's 2015, last year. - Raid Level : raid10 - Raid Devices : 3 + /dev/sdd2: ####### + Magic : a92b4efc + Version : 1.2 + Feature Map : 0x0 + Array UUID : e0ad8202:4c270099:9f28ddd6:b597231d + Name : tesla:root (local to host tesla) + Creation Time : Thu Apr 16 14:26:50 2015 ### note that's 2015, last year. + Raid Level : raid10 + Raid Devices : 3 - Avail Dev Size : 30703616 (14.64 GiB 15.72 GB) - Array Size : 23027712 (21.96 GiB 23.58 GB) - Data Offset : 16384 sectors - Super Offset : 8 sectors - Unused Space : before=16296 sectors, after=0 sectors - State : clean - Device UUID : 8ae879d7:b5c6b0ad:f2d6c787:49284d4b + Avail Dev Size : 30703616 (14.64 GiB 15.72 GB) + Array Size : 23027712 (21.96 GiB 23.58 GB) + Data Offset : 16384 sectors + Super Offset : 8 sectors + Unused Space : before=16296 sectors, after=0 sectors + State : clean + Device UUID : 8ae879d7:b5c6b0ad:f2d6c787:49284d4b - Update Time : Wed Mar 16 02:49:17 2016 - Bad Block Log : 512 entries available at offset 72 sectors - Checksum : 1c62e134 - correct - Events : 2708 + Update Time : Wed Mar 16 02:49:17 2016 + Bad Block Log : 512 entries available at offset 72 sectors + Checksum : 1c62e134 - correct + Events : 2708 - Layout : far=2 - Chunk Size : 1024K + Layout : far=2 + Chunk Size : 1024K - Device Role : Active device 2 - Array State : AAR ('A' == active, '.' == missing, 'R' == replacing) - + Device Role : Active device 2 + Array State : AAR ('A' == active, '.' == missing, 'R' == replacing) /dev/sda2: ##### An in-sync component - Magic : a92b4efc - Version : 1.2 - Feature Map : 0x0 - Array UUID : e0ad8202:4c270099:9f28ddd6:b597231d - Name : tesla:root (local to host tesla) - Creation Time : Thu Apr 16 14:26:50 2015 - Raid Level : raid10 - Raid Devices : 3 + Magic : a92b4efc + Version : 1.2 + Feature Map : 0x0 + Array UUID : e0ad8202:4c270099:9f28ddd6:b597231d + Name : tesla:root (local to host tesla) + Creation Time : Thu Apr 16 14:26:50 2015 + Raid Level : raid10 + Raid Devices : 3 - Avail Dev Size : 30703616 (14.64 GiB 15.72 GB) - Array Size : 23027712 (21.96 GiB 23.58 GB) - Data Offset : 16384 sectors - Super Offset : 8 sectors - Unused Space : before=16296 sectors, after=0 sectors - State : clean - Device UUID : 5d6bb778:1700264b:bd7aadba:11336f0b + Avail Dev Size : 30703616 (14.64 GiB 15.72 GB) + Array Size : 23027712 (21.96 GiB 23.58 GB) + Data Offset : 16384 sectors + Super Offset : 8 sectors + Unused Space : before=16296 sectors, after=0 sectors + State : clean + Device UUID : 5d6bb778:1700264b:bd7aadba:11336f0b - Update Time : Sat Apr 9 16:48:18 2016 - Bad Block Log : 512 entries available at offset 72 sectors - Checksum : 4e39b4c0 - correct - Events : 2740 + Update Time : Sat Apr 9 16:48:18 2016 + Bad Block Log : 512 entries available at offset 72 sectors + Checksum : 4e39b4c0 - correct + Events : 2740 - Layout : far=2 - Chunk Size : 1024K + Layout : far=2 + Chunk Size : 1024K - Device Role : Active device 1 - Array State : AAA ('A' == active, '.' == missing, 'R' == replacing) - + Device Role : Active device 1 + Array State : AAA ('A' == active, '.' == missing, 'R' == replacing) peter@tesla:~$ sudo mdadm --detail /dev/md/root /dev/md/root: - Version : 1.2 - Creation Time : Thu Apr 16 14:26:50 2015 - Raid Level : raid10 - Array Size : 23027712 (21.96 GiB 23.58 GB) - Used Dev Size : 15351808 (14.64 GiB 15.72 GB) - Raid Devices : 3 - Total Devices : 3 - Persistence : Superblock is persistent + Version : 1.2 + Creation Time : Thu Apr 16 14:26:50 2015 + Raid Level : raid10 + Array Size : 23027712 (21.96 GiB 23.58 GB) + Used Dev Size : 15351808 (14.64 GiB 15.72 GB) + Raid Devices : 3 + Total Devices : 3 + Persistence : Superblock is persistent - Update Time : Sat Apr 9 21:19:32 2016 - State : clean - Active Devices : 3 + Update Time : Sat Apr 9 21:19:32 2016 + State : clean + Active Devices : 3 Working Devices : 3 - Failed Devices : 0 - Spare Devices : 0 + Failed Devices : 0 + Spare Devices : 0 - Layout : far=2 - Chunk Size : 1024K + Layout : far=2 + Chunk Size : 1024K - Name : tesla:root (local to host tesla) - UUID : e0ad8202:4c270099:9f28ddd6:b597231d - Events : 2740 + Name : tesla:root (local to host tesla) + UUID : e0ad8202:4c270099:9f28ddd6:b597231d + Events : 2740 - Number Major Minor RaidDevice State - 3 8 18 0 active sync /dev/sdb2 - 4 8 2 1 active sync /dev/sda2 - 6 8 34 2 active sync /dev/sdc2 + Number Major Minor RaidDevice State + 3 8 18 0 active sync /dev/sdb2 + 4 8 2 1 active sync /dev/sda2 + 6 8 34 2 active sync /dev/sdc2 ProblemType: Bug DistroRelease: Ubuntu 15.10 Package: grub-pc 2.02~beta2-29ubuntu0.3 ProcVersionSignature: Ubuntu 4.2.0-35.40-generic 4.2.8-ckt5 Uname: Linux 4.2.0-35-generic x86_64 ApportVersion: 2.19.1-0ubuntu5 Architecture: amd64 CurrentDesktop: KDE Date: Sat Apr 9 20:53:19 2016 SourcePackage: grub2 UpgradeStatus: Upgraded to wily on 2015-11-12 (149 days ago) -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1568426 Title: md: detects stale members ahead of in-sync members To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/grub2/+bug/1568426/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs