[Kernel-packages] [Bug 1849682] Re: [REGRESSION] md/raid0: cannot assemble multi-zone RAID0 with default_layout setting

dann frazier Thu, 24 Oct 2019 14:29:00 -0700

** Description changed:

  Users of RAID0 arrays are susceptible to a corruption issue if:
-  - The members of the RAID array are not all the same size[*]
-  - Data has been written to the array while running kernels < 3.14 and >= 
3.14.
+  - The members of the RAID array are not all the same size[*]
+  - Data has been written to the array while running kernels < 3.14 and >= 
3.14.
  
  Upstream is dealing with this by adding a versioned layout in v5.4, and
  backporting that via stable. Version 1 is the pre-3.14 layout, Version 2
- is post 3.14. However, unless a layout-version-aware kernel *created*
- the array, there's no way for the kernel to know which version was used
- to write the existing data. This undefined mode is considered "Version
- 0", and the kernel will now refuse to start these arrays w/o user
- intervention.
+ is post 3.14. Mixing version 1 & version 2 layouts can cause corruption.
+ However, unless a layout-version-aware kernel *created* the array,
+ there's no way for the kernel to know which version(s) was used to write
+ the existing data. This undefined mode is considered "Version 0", and
+ the kernel will now refuse to start these arrays w/o user intervention.
  
  These changes are now coming into our kernels via stable backports of
  the following commit, which describes the problem in the commit message:
  
  
https://github.com/torvalds/linux/commit/c84a1372df929033cb1a0441fb57bd3932f39ac9
  
  The user experience is pretty awful here. A user upgrades to the next
  SRU and all of a sudden their system stops at an (initramfs) prompt. A
  clueful user can spot something like the following in dmesg:
  
  Here's the message which , as you can see from the log in Comment #1, is
  hidden in a ton of other messages:
  
  [ 72.720232] md/raid0:md0: cannot assemble multi-zone RAID0 with 
default_layout setting
  [ 72.728149] md/raid0: please set raid.default_layout to 1 or 2
  [ 72.733979] md: pers->run() failed ...
  mdadm: failed to start array /dev/md0: Unknown error 524
  
- 
- What that is trying to say is that you should determine if your data - 
specifically the data toward the end of your array - was most likely written 
with a pre-3.14 or post-3.14 kernel. Based on that, reboot with the kernel 
parameter raid0.default_layout=1 or raid0.default_layout=2 on the kernel 
command line. And note it should be *raid0.default_layout* not 
*raid.default_layout* as the message says - a fix for that message is now 
queued for stable.
+ What that is trying to say is that you should determine if your data -
+ specifically the data toward the end of your array - was most likely
+ written with a pre-3.14 or post-3.14 kernel. Based on that, reboot with
+ the kernel parameter raid0.default_layout=1 or raid0.default_layout=2 on
+ the kernel command line. And note it should be *raid0.default_layout*
+ not *raid.default_layout* as the message says - a fix for that message
+ is now queued for stable:
  
  
https://github.com/torvalds/linux/commit/3874d73e06c9b9dc15de0b7382fc223986d75571)
+ 
+ IMHO, we should work with upstream to create a web page that clearly
+ walks the user through this process, and update the error message to
+ point to that page. I'd also like to see if we can detect this problem
+ *before* the user reboots (debconf?) and help the user fix things. e.g.
+ "We detected that you have RAID0 arrays that maybe susceptible to a
+ corruption problem", guide the user to choosing a layout, and update the
+ mdadm initramfs hook to poke the answer in via sysfs before starting the
+ array on reboot.
+ 
  
  [*] Which surprisingly is not the case reported in this bug - the user
  here had a raid0 of 8 identically-sized devices. I suspect there's a bug
  in the detection code somewhere.

** Description changed:

Users of RAID0 arrays are susceptible to a corruption issue if:
- The members of the RAID array are not all the same size[*]
- - Data has been written to the array while running kernels < 3.14 and >=
3.14.
+ - Data has been written to the array while running kernels < 3.14 *and* >=
3.14.

Upstream is dealing with this by adding a versioned layout in v5.4, and
backporting that via stable. Version 1 is the pre-3.14 layout, Version 2
is post 3.14. Mixing version 1 & version 2 layouts can cause corruption.
However, unless a layout-version-aware kernel *created* the array,
there's no way for the kernel to know which version(s) was used to write
the existing data. This undefined mode is considered "Version 0", and
the kernel will now refuse to start these arrays w/o user intervention.

These changes are now coming into our kernels via stable backports of
the following commit, which describes the problem in the commit message:

https://github.com/torvalds/linux/commit/c84a1372df929033cb1a0441fb57bd3932f39ac9

The user experience is pretty awful here. A user upgrades to the next
SRU and all of a sudden their system stops at an (initramfs) prompt. A
clueful user can spot something like the following in dmesg:

Here's the message which , as you can see from the log in Comment #1, is
hidden in a ton of other messages:

[ 72.720232] md/raid0:md0: cannot assemble multi-zone RAID0 with
default_layout setting
[ 72.728149] md/raid0: please set raid.default_layout to 1 or 2
[ 72.733979] md: pers->run() failed ...
mdadm: failed to start array /dev/md0: Unknown error 524

What that is trying to say is that you should determine if your data -
specifically the data toward the end of your array - was most likely
written with a pre-3.14 or post-3.14 kernel. Based on that, reboot with
the kernel parameter raid0.default_layout=1 or raid0.default_layout=2 on
the kernel command line. And note it should be *raid0.default_layout*
not *raid.default_layout* as the message says - a fix for that message
is now queued for stable:

https://github.com/torvalds/linux/commit/3874d73e06c9b9dc15de0b7382fc223986d75571)

IMHO, we should work with upstream to create a web page that clearly
walks the user through this process, and update the error message to
point to that page. I'd also like to see if we can detect this problem
*before* the user reboots (debconf?) and help the user fix things. e.g.
"We detected that you have RAID0 arrays that maybe susceptible to a
corruption problem", guide the user to choosing a layout, and update the
mdadm initramfs hook to poke the answer in via sysfs before starting the
array on reboot.

-
[*] Which surprisingly is not the case reported in this bug - the user
here had a raid0 of 8 identically-sized devices. I suspect there's a bug
in the detection code somewhere.

** Description changed:

Users of RAID0 arrays are susceptible to a corruption issue if:
- The members of the RAID array are not all the same size[*]
- Data has been written to the array while running kernels < 3.14 *and* >=
3.14.

These changes are now coming into our kernels via stable backports of
the following commit, which describes the problem in the commit message:

https://github.com/torvalds/linux/commit/c84a1372df929033cb1a0441fb57bd3932f39ac9

Here's the message which , as you can see from the log in Comment #1, is
hidden in a ton of other messages:

https://github.com/torvalds/linux/commit/3874d73e06c9b9dc15de0b7382fc223986d75571)

+ References from users of other distros:
+ https://blog.icod.de/2019/10/10/caution-kernel-5-3-4-and-raid0-default_layout/
+
https://www.linuxquestions.org/questions/linux-general-1/raid-arrays-not-assembling-4175662774/
+
[*] Which surprisingly is not the case reported in this bug - the user
here had a raid0 of 8 identically-sized devices. I suspect there's a bug
in the detection code somewhere.

--
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1849682

Title:
[REGRESSION] md/raid0: cannot assemble multi-zone RAID0 with
default_layout setting

Status in linux package in Ubuntu:
Incomplete
Status in linux source package in Bionic:
Confirmed
Status in linux source package in Disco:
Incomplete
Status in linux source package in Eoan:
Incomplete
Status in linux source package in Focal:
Incomplete

Bug description:
Users of RAID0 arrays are susceptible to a corruption issue if:
- The members of the RAID array are not all the same size[*]
- Data has been written to the array while running kernels < 3.14 *and* >=
3.14.

Upstream is dealing with this by adding a versioned layout in v5.4,
and backporting that via stable. Version 1 is the pre-3.14 layout,
Version 2 is post 3.14. Mixing version 1 & version 2 layouts can cause
corruption. However, unless a layout-version-aware kernel *created*
the array, there's no way for the kernel to know which version(s) was
used to write the existing data. This undefined mode is considered
"Version 0", and the kernel will now refuse to start these arrays w/o
user intervention.

These changes are now coming into our kernels via stable backports of
the following commit, which describes the problem in the commit
message:

https://github.com/torvalds/linux/commit/c84a1372df929033cb1a0441fb57bd3932f39ac9

Here's the message which , as you can see from the log in Comment #1,
is hidden in a ton of other messages:

What that is trying to say is that you should determine if your data -
specifically the data toward the end of your array - was most likely
written with a pre-3.14 or post-3.14 kernel. Based on that, reboot
with the kernel parameter raid0.default_layout=1 or
raid0.default_layout=2 on the kernel command line. And note it should
be *raid0.default_layout* not *raid.default_layout* as the message
says - a fix for that message is now queued for stable:

https://github.com/torvalds/linux/commit/3874d73e06c9b9dc15de0b7382fc223986d75571)

IMHO, we should work with upstream to create a web page that clearly
walks the user through this process, and update the error message to
point to that page. I'd also like to see if we can detect this problem
*before* the user reboots (debconf?) and help the user fix things.
e.g. "We detected that you have RAID0 arrays that maybe susceptible to
a corruption problem", guide the user to choosing a layout, and update
the mdadm initramfs hook to poke the answer in via sysfs before
starting the array on reboot.

References from users of other distros:
https://blog.icod.de/2019/10/10/caution-kernel-5-3-4-and-raid0-default_layout/

https://www.linuxquestions.org/questions/linux-general-1/raid-arrays-not-assembling-4175662774/

[*] Which surprisingly is not the case reported in this bug - the user
here had a raid0 of 8 identically-sized devices. I suspect there's a
bug in the detection code somewhere.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1849682/+subscriptions

--
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1849682] Re: [REGRESSION] md/raid0: cannot assemble multi-zone RAID0 with default_layout setting

Reply via email to