I can reproduce this on a Google Cloud n1-standard-16 using 2x Local
NVMe disks. Then partition nvme0n1 and nvne0n2 with only an 8GB
partition, then format directly with ext4 (skip LVM).

In this setup each 'check' takes <1 min so speeds up testing
considerably. Example details - seems pre-emptible instance cost for
this is $0.292/hour / $7/day.

gcloud compute instances create raid10-test --project=juju2-157804 \
        --zone=us-west1-b \
        --machine-type=n1-standard-16 \
        --subnet=default \
        --network-tier=STANDARD \
        --no-restart-on-failure \
        --maintenance-policy=TERMINATE \
        --preemptible \
        --boot-disk-size=32GB \
        --boot-disk-type=pd-ssd \
        --image=ubuntu-1804-bionic-v20201116 --image-project=ubuntu-os-cloud \
        --local-ssd=interface=NVME  --local-ssd=interface=NVME

# apt install linux-image-virtual
# apt-get remove linux-image-gcp linux-image-5.4.0-1029-gcp 
linux-image-unsigned-5.4.0-1029-gcp   --purge
# reboot

sgdisk -n 0:0:+8G /dev/nvme0n1
sgdisk -n 0:0:+8G /dev/nvme0n2
mdadm -C -v -l10 -n2 -N "lv-raid" -R /dev/md0 /dev/nvme0n1p2 /dev/nvme1n1p2
mkfs.ext4 /dev/md0
mount /dev/md0 /mnt
dd if=/dev/zero of=/mnt/data.raw bs=4K count=1M; sync; rm /mnt/data.raw
echo check >/sys/block/md0/md/sync_action; watch 'grep . /proc/mdstat 
/sys/block/md0/md/mismatch_cnt' # no mismatch
fstrim -v /mnt
echo check >/sys/block/md0/md/sync_action; watch 'grep . /proc/mdstat 
/sys/block/md0/md/mismatch_cnt' # mismatch=256

I ran blktrace /dev/md0 /dev/nvme0n1 /dev/nvme0n2 and will upload the
results I didn't have time to try and understand the results as yet.

Some thoughts
 - It was asserted that the first disk 'appears' fine
 - So I wondered can we reliably repair by asking mdadm to do a 'repair' or 
'resync'
 - It seems that reads are at least sometimes balanced (maybe by PID) to 
different disks since this post.. 
https://www.spinics.net/lists/raid/msg62762.html - unclear if the same 
selection impacts writes (not that it would help performance)
 - So it's unclear we can reliably say only a 'passive mirror' is being 
corrupted, it's possible application reads may or may not be corrupted. More 
testing/understanding of the code required.
 - This area of RAID10 and RAID1 seems quite under-documented, "man md" doesn't 
talk much about how or which disk is used to repair the other if there is a 
mismatch (unlike RAID5 where the parity gives us some assurances as to which 
data is wrong).
 - We should try writes from different PIDs, with known different data, and 
compare the data on both disks with the known data to see if we can knowingly 
get the wrong data on both disks or only one. And try that with 4 disks instead 
of 2.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1907262

Title:
  raid10: discard leads to corrupted file system

Status in linux package in Ubuntu:
  Confirmed
Status in linux source package in Bionic:
  Fix Committed
Status in linux source package in Focal:
  In Progress
Status in linux source package in Groovy:
  In Progress

Bug description:
  Seems to be closely related to
  https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1896578

  After updating the Ubuntu 18.04 kernel from 4.15.0-124 to 4.15.0-126
  the fstrim command triggered by fstrim.timer causes a severe number of
  mismatches between two RAID10 component devices.

  This bug affects several machines in our company with different HW
  configurations (All using ECC RAM). Both, NVMe and SATA SSDs are
  affected.

  How to reproduce:
   - Create a RAID10 LVM and filesystem on two SSDs
      mdadm -C -v -l10 -n2 -N "lv-raid" -R /dev/md0 /dev/nvme0n1p2 
/dev/nvme1n1p2
      pvcreate -ff -y /dev/md0
      vgcreate -f -y VolGroup /dev/md0
      lvcreate -n root    -L 100G -ay -y VolGroup
      mkfs.ext4 /dev/VolGroup/root
      mount /dev/VolGroup/root /mnt
   - Write some data, sync and delete it
      dd if=/dev/zero of=/mnt/data.raw bs=4K count=1M
      sync
      rm /mnt/data.raw
   - Check the RAID device
      echo check >/sys/block/md0/md/sync_action
   - After finishing (see /proc/mdstat), check the mismatch_cnt (should be 0):
      cat /sys/block/md0/md/mismatch_cnt
   - Trigger the bug
      fstrim /mnt
   - Re-Check the RAID device
      echo check >/sys/block/md0/md/sync_action
   - After finishing (see /proc/mdstat), check the mismatch_cnt (probably in 
the range of N*10000):
      cat /sys/block/md0/md/mismatch_cnt

  After investigating this issue on several machines it *seems* that the
  first drive does the trim correctly while the second one goes wild. At
  least the number and severity of errors found by a  USB stick live
  session fsck.ext4 suggests this.

  To perform the single drive evaluation the RAID10 was started using a single 
drive at once:
    mdadm --assemble /dev/md127 /dev/nvme0n1p2
    mdadm --run /dev/md127
    fsck.ext4 -n -f /dev/VolGroup/root

    vgchange -a n /dev/VolGroup
    mdadm --stop /dev/md127

    mdadm --assemble /dev/md127 /dev/nvme1n1p2
    mdadm --run /dev/md127
    fsck.ext4 -n -f /dev/VolGroup/root

  When starting these fscks without -n, on the first device it seems the
  directory structure is OK while on the second device there is only the
  lost+found folder left.

  Side-note: Another machine using HWE kernel 5.4.0-56 (after using -53
  before) seems to have a quite similar issue.

  Unfortunately the risk/regression assessment in the aforementioned bug
  is not complete: the workaround only mitigates the issues during FS
  creation. This bug on the other hand is triggered by a weekly service
  (fstrim) causing severe file system corruption.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1907262/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to