[Kernel-packages] [Bug 1850540] [NEW] multi-zone raid0 corruption

dann frazier Tue, 29 Oct 2019 12:32:06 -0700

Public bug reported:

Bug 1849682 tracks the temporarily revert of the fix for this issue,
while this bug tracks the re-application of that fix once we have a full
solution.


Users of RAID0 arrays are susceptible to a corruption issue if:
 - The members of the RAID array are not all the same size[*]
 - Data has been written to the array while running kernels < 3.14 *and* >= 
3.14.

This is because of an change in v3.14 that accidentally changed how data was 
written - as described in the upstream commit message:
https://github.com/torvalds/linux/commit/c84a1372df929033cb1a0441fb57bd3932f39ac9

That change has been applied to stable, but we reverted it to fix
1849682 until we have a full solution ready.

To summarize, upstream is dealing with this by adding a versioned layout
in v5.4, and that is being backported to stable kernels - which is why
we're now seeing it. Layout version 1 is the pre-3.14 layout, version 2
is post 3.14. Mixing version 1 & version 2 layouts can cause corruption.
However, unless a layout-version-aware kernel *created* the array,
there's no way for the kernel to know which version(s) was used to write
the existing data. This undefined mode is considered "Version 0", and
the kernel will now refuse to start these arrays w/o user intervention.

The user experience is pretty awful here. A user upgrades to the next
SRU and all of a sudden their system stops at an (initramfs) prompt. A
clueful user can spot something like the following in dmesg:

Here's the message which , as you can see from the log in Comment #1, is
hidden in a ton of other messages:

[ 72.720232] md/raid0:md0: cannot assemble multi-zone RAID0 with default_layout 
setting
[ 72.728149] md/raid0: please set raid.default_layout to 1 or 2
[ 72.733979] md: pers->run() failed ...
mdadm: failed to start array /dev/md0: Unknown error 524

What that is trying to say is that you should determine if your data -
specifically the data toward the end of your array - was most likely
written with a pre-3.14 or post-3.14 kernel. Based on that, reboot with
the kernel parameter raid0.default_layout=1 or raid0.default_layout=2 on
the kernel command line. And note it should be *raid0.default_layout*
not *raid.default_layout* as the message says - a fix for that message
is now queued for stable:

https://github.com/torvalds/linux/commit/3874d73e06c9b9dc15de0b7382fc223986d75571)

IMHO, we should work with upstream to create a web page that clearly
walks the user through this process, and update the error message to
point to that page. I'd also like to see if we can detect this problem
*before* the user reboots (debconf?) and help the user fix things. e.g.
"We detected that you have RAID0 arrays that maybe susceptible to a
corruption problem", guide the user to choosing a layout, and update the
mdadm initramfs hook to poke the answer in via sysfs before starting the
array on reboot.

Note that it also seems like we should investigate backporting this to <
3.14 kernels. Imagine a user switching between the trusty HWE kernel and
the GA kernel.

References from users of other distros:
https://blog.icod.de/2019/10/10/caution-kernel-5-3-4-and-raid0-default_layout/
https://www.linuxquestions.org/questions/linux-general-1/raid-arrays-not-assembling-4175662774/

[*] Which surprisingly is not the case reported in this bug - the user
here had a raid0 of 8 identically-sized devices. I suspect there's a bug
in the detection code somewhere.

** Affects: linux (Ubuntu)
     Importance: Undecided
         Status: New

** Affects: mdadm (Ubuntu)
     Importance: Undecided
         Status: New

** Affects: linux (Ubuntu Precise)
     Importance: Undecided
         Status: New

** Affects: mdadm (Ubuntu Precise)
     Importance: Undecided
         Status: New

** Affects: linux (Ubuntu Trusty)
     Importance: Undecided
         Status: New

** Affects: mdadm (Ubuntu Trusty)
     Importance: Undecided
         Status: New

** Affects: linux (Ubuntu Xenial)
     Importance: Undecided
         Status: New

** Affects: mdadm (Ubuntu Xenial)
     Importance: Undecided
         Status: New

** Affects: linux (Ubuntu Bionic)
     Importance: Undecided
         Status: New

** Affects: mdadm (Ubuntu Bionic)
     Importance: Undecided
         Status: New

** Affects: linux (Ubuntu Disco)
     Importance: Undecided
         Status: New

** Affects: mdadm (Ubuntu Disco)
     Importance: Undecided
         Status: New

** Affects: linux (Ubuntu Eoan)
     Importance: Undecided
         Status: New

** Affects: mdadm (Ubuntu Eoan)
     Importance: Undecided
         Status: New

** Affects: linux (Ubuntu Focal)
     Importance: Undecided
         Status: New

** Affects: mdadm (Ubuntu Focal)
     Importance: Undecided
         Status: New

** Also affects: linux (Ubuntu Trusty)
   Importance: Undecided
       Status: New

** Also affects: linux (Ubuntu Xenial)
   Importance: Undecided
       Status: New

** Also affects: linux (Ubuntu Eoan)
   Importance: Undecided
       Status: New

** Also affects: linux (Ubuntu Disco)
   Importance: Undecided
       Status: New

** Also affects: linux (Ubuntu Bionic)
   Importance: Undecided
       Status: New

** Also affects: linux (Ubuntu Focal)
   Importance: Undecided
       Status: New

** Also affects: linux (Ubuntu Precise)
   Importance: Undecided
       Status: New

** Description changed:

+ Bug 1849682 tracks the temporarily revert of the fix for this issue,
+ while this bug tracks the re-application of that fix once we have a full
+ solution.
+ 
  Users of RAID0 arrays are susceptible to a corruption issue if:
-  - The members of the RAID array are not all the same size[*]
-  - Data has been written to the array while running kernels < 3.14 *and* >= 
3.14.
+  - The members of the RAID array are not all the same size[*]
+  - Data has been written to the array while running kernels < 3.14 *and* >= 
3.14.
  
  This is because of an change in v3.14 that accidentally changed how data was 
written - as described in the upstream commit message:
  
https://github.com/torvalds/linux/commit/c84a1372df929033cb1a0441fb57bd3932f39ac9
  
  That change has been applied to stable, but we reverted it to fix
  1849682 until we have a full solution ready.
  
  To summarize, upstream is dealing with this by adding a versioned layout
  in v5.4, and that is being backported to stable kernels - which is why
  we're now seeing it. Layout version 1 is the pre-3.14 layout, version 2
  is post 3.14. Mixing version 1 & version 2 layouts can cause corruption.
  However, unless a layout-version-aware kernel *created* the array,
  there's no way for the kernel to know which version(s) was used to write
  the existing data. This undefined mode is considered "Version 0", and
  the kernel will now refuse to start these arrays w/o user intervention.
  
  The user experience is pretty awful here. A user upgrades to the next
  SRU and all of a sudden their system stops at an (initramfs) prompt. A
  clueful user can spot something like the following in dmesg:
  
  Here's the message which , as you can see from the log in Comment #1, is
  hidden in a ton of other messages:
  
  [ 72.720232] md/raid0:md0: cannot assemble multi-zone RAID0 with 
default_layout setting
  [ 72.728149] md/raid0: please set raid.default_layout to 1 or 2
  [ 72.733979] md: pers->run() failed ...
  mdadm: failed to start array /dev/md0: Unknown error 524
  
  What that is trying to say is that you should determine if your data -
  specifically the data toward the end of your array - was most likely
  written with a pre-3.14 or post-3.14 kernel. Based on that, reboot with
  the kernel parameter raid0.default_layout=1 or raid0.default_layout=2 on
  the kernel command line. And note it should be *raid0.default_layout*
  not *raid.default_layout* as the message says - a fix for that message
  is now queued for stable:
  
  
https://github.com/torvalds/linux/commit/3874d73e06c9b9dc15de0b7382fc223986d75571)
  
  IMHO, we should work with upstream to create a web page that clearly
  walks the user through this process, and update the error message to
  point to that page. I'd also like to see if we can detect this problem
  *before* the user reboots (debconf?) and help the user fix things. e.g.
  "We detected that you have RAID0 arrays that maybe susceptible to a
  corruption problem", guide the user to choosing a layout, and update the
  mdadm initramfs hook to poke the answer in via sysfs before starting the
  array on reboot.
  
  Note that it also seems like we should investigate backporting this to <
  3.14 kernels. Imagine a user switching between the trusty HWE kernel and
  the GA kernel.
  
  References from users of other distros:
  https://blog.icod.de/2019/10/10/caution-kernel-5-3-4-and-raid0-default_layout/
  
https://www.linuxquestions.org/questions/linux-general-1/raid-arrays-not-assembling-4175662774/
  
  [*] Which surprisingly is not the case reported in this bug - the user
  here had a raid0 of 8 identically-sized devices. I suspect there's a bug
  in the detection code somewhere.

** Also affects: mdadm (Ubuntu)
   Importance: Undecided
       Status: New

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1850540

Title:
  multi-zone raid0 corruption

Status in linux package in Ubuntu:
  New
Status in mdadm package in Ubuntu:
  New
Status in linux source package in Precise:
  New
Status in mdadm source package in Precise:
  New
Status in linux source package in Trusty:
  New
Status in mdadm source package in Trusty:
  New
Status in linux source package in Xenial:
  New
Status in mdadm source package in Xenial:
  New
Status in linux source package in Bionic:
  New
Status in mdadm source package in Bionic:
  New
Status in linux source package in Disco:
  New
Status in mdadm source package in Disco:
  New
Status in linux source package in Eoan:
  New
Status in mdadm source package in Eoan:
  New
Status in linux source package in Focal:
  New
Status in mdadm source package in Focal:
  New

Bug description:
  Bug 1849682 tracks the temporarily revert of the fix for this issue,
  while this bug tracks the re-application of that fix once we have a
  full solution.

  Users of RAID0 arrays are susceptible to a corruption issue if:
   - The members of the RAID array are not all the same size[*]
   - Data has been written to the array while running kernels < 3.14 *and* >= 
3.14.

  This is because of an change in v3.14 that accidentally changed how data was 
written - as described in the upstream commit message:
  
https://github.com/torvalds/linux/commit/c84a1372df929033cb1a0441fb57bd3932f39ac9

  That change has been applied to stable, but we reverted it to fix
  1849682 until we have a full solution ready.

  To summarize, upstream is dealing with this by adding a versioned
  layout in v5.4, and that is being backported to stable kernels - which
  is why we're now seeing it. Layout version 1 is the pre-3.14 layout,
  version 2 is post 3.14. Mixing version 1 & version 2 layouts can cause
  corruption. However, unless a layout-version-aware kernel *created*
  the array, there's no way for the kernel to know which version(s) was
  used to write the existing data. This undefined mode is considered
  "Version 0", and the kernel will now refuse to start these arrays w/o
  user intervention.

  The user experience is pretty awful here. A user upgrades to the next
  SRU and all of a sudden their system stops at an (initramfs) prompt. A
  clueful user can spot something like the following in dmesg:

  Here's the message which , as you can see from the log in Comment #1,
  is hidden in a ton of other messages:

  [ 72.720232] md/raid0:md0: cannot assemble multi-zone RAID0 with 
default_layout setting
  [ 72.728149] md/raid0: please set raid.default_layout to 1 or 2
  [ 72.733979] md: pers->run() failed ...
  mdadm: failed to start array /dev/md0: Unknown error 524

  What that is trying to say is that you should determine if your data -
  specifically the data toward the end of your array - was most likely
  written with a pre-3.14 or post-3.14 kernel. Based on that, reboot
  with the kernel parameter raid0.default_layout=1 or
  raid0.default_layout=2 on the kernel command line. And note it should
  be *raid0.default_layout* not *raid.default_layout* as the message
  says - a fix for that message is now queued for stable:

  
https://github.com/torvalds/linux/commit/3874d73e06c9b9dc15de0b7382fc223986d75571)

  IMHO, we should work with upstream to create a web page that clearly
  walks the user through this process, and update the error message to
  point to that page. I'd also like to see if we can detect this problem
  *before* the user reboots (debconf?) and help the user fix things.
  e.g. "We detected that you have RAID0 arrays that maybe susceptible to
  a corruption problem", guide the user to choosing a layout, and update
  the mdadm initramfs hook to poke the answer in via sysfs before
  starting the array on reboot.

  Note that it also seems like we should investigate backporting this to
  < 3.14 kernels. Imagine a user switching between the trusty HWE kernel
  and the GA kernel.

  References from users of other distros:
  https://blog.icod.de/2019/10/10/caution-kernel-5-3-4-and-raid0-default_layout/
  
https://www.linuxquestions.org/questions/linux-general-1/raid-arrays-not-assembling-4175662774/

  [*] Which surprisingly is not the case reported in this bug - the user
  here had a raid0 of 8 identically-sized devices. I suspect there's a
  bug in the detection code somewhere.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1850540/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1850540] [NEW] multi-zone raid0 corruption

Reply via email to