Re: Strange (cosmic ray?) corruption.

David Christensen Sun, 25 May 2025 12:45:59 -0700

On 5/25/25 00:58, Tim Woodall wrote:

Hi All,
This is the third time I've had the same problem. The first time ithappened I put it down to a freak cosmic ray event. The second time ithappened gave me doubts but I cannot come up with a good alternativeexplanation. Now it's happened for a third time I'm hoping that someoneelse can suggest additional logging that might help debug next time oreven a possible (non-cosmic ray) explanation.
At the end of this email is the complete log of the failed backup that Iget emailed.
The setup: proxy17 is a vm guest running on a xen host. backup17 isanother vm guest running on the same host. There are many other guestsrunning on the same host and all write backups to backup17 but onlyproxy17 has a problem.
The process:
dump does a low level backup of an ext4 partition on proxy17. The backupis written to stdout and then piped through ssh to backup17 where it'sthen piped through tee which writes it to disk and pipes it to sha256sumwhich writes the shasum to a separate file.
The sha256sum matches the data on the disk so I know there's no diskcorruption happening.
sha256sum _.dmp
f86f80d8b7370fe7e91570f379f4f53212a534f413878d1a4367a04ac031bfcc  _.dmp
cat _.dmp.sha256sum
f86f80d8b7370fe7e91570f379f4f53212a534f413878d1a4367a04ac031bfcc  -
The data is then read back (this may or may not actually involve readingthe data back from disk) sent over ssh to proxy17 where restore thenverifies that the data in the backup matches the disk image.
This process has worked for 20+ years.
On 2025/04/29 (Tuesday) the restore verify failed with a decompressionerror. I checked the sha256sum and I restored the data successfully (nodecompression error). Because it failed on a log file that had changedsince the backup had run I couldn't check that the file it actuallyfailed on was identical to what was dumped but the log looked normalwith no visible corruption. I put this down to a cosmic ray eventflipping a bit in memory while the restore was running.
Then exactly the same thing happened again on 2025/05/04 (Sunday). Thistime the file it aborted at due to the decompression error was a file insbin and I could confirm that the restore (that was successful with nodecompression errors) successfully restored the file to an identicalstate to the one that was backed up.
It happened for a third time this morning, again Sunday. Sundays tend tobe bigger backups and I updated the kernel yesterday so the compressedbackup is around 180MB today.
Both backup17 and proxy17 are xen guest running on the same xen host. Ibackup about 15 guests in the same way on the same xen host no others ofwhich have ever had this failure. I also backup another 25 or somachines, 15 or so on the same site but are otherwise completelyindependent (they share network switches etc but this backup isn't goingover the network) and another 10 or so that are in a different sitewhere all the backups go over the network to the backup host.
Nothing else has failed, no unexplained crashes, no kernel faults, nodisk corruption (that I've noticed). Have I freakishly managed to be hitby a similar cosmic ray event three times! or is there something elsethat could have given the same symptoms?
Below I include the complete logs of todays failure - there is no moredata and I cannot reproduce this at will to test things - so I'm lookingfor ideas on where I can add logging to try and track this down shouldit happen again. Is there anything obvious that could give thesesymptoms? (They were all essentially identical logs). The problem mustsurely be on the "return" path as if the data on disk was corrupted I'dbe able to reproduce the decompression error so I don't see any goodreason to do a sha256sum on the data before I send it over the ssh linkbut can anyone else think of a reason?
The dump is OK and can be restored:
# restore -t -f _.dmp >/dev/null && echo $?
Dump tape is compressed.
0

The file is OK, can be restored, and matches the original:
# restore -i -f _.dmp
Dump tape is compressed.
restore > add usr/lib/modules/6.1.0-37-amd64/kernel/drivers/hid/hid-sony.ko
restore > extract
You have not read any volumes yet.
Unless you know which volume your file(s) are on you should start
with the last volume and work towards the first.
Specify next volume # (none if no more volumes): 1
set owner/mode/attributes for '.'? [yn] n
restore > quit
root@backup17:# sha256sum ./usr/lib/modules/6.1.0-37-amd64/kernel/drivers/hid/hid-sony.kofad808d9b24d457c2235b4e96e7b50872c0ff2b45bb3c2458034b074f9f99997 ./usr/lib/modules/6.1.0-37-amd64/kernel/drivers/hid/hid-sony.ko
root@proxy17:# sha256sum /usr/lib/modules/6.1.0-37-amd64/kernel/drivers/hid/hid-sony.kofad808d9b24d457c2235b4e96e7b50872c0ff2b45bb3c2458034b074f9f99997 /usr/lib/modules/6.1.0-37-amd64/kernel/drivers/hid/hid-sony.ko
And, FWIW, also matches a complete different machine:
tim@dirac:~ (none)$ sha256sum /usr/lib/modules/6.1.0-37-amd64/kernel/drivers/hid/hid-sony.kofad808d9b24d457c2235b4e96e7b50872c0ff2b45bb3c2458034b074f9f99997 /usr/lib/modules/6.1.0-37-amd64/kernel/drivers/hid/hid-sony.ko
Is the only possible explanation that there's a bug in restore andsomething with this particular backup is tickling it once in a bluemoon? There isn't an obvious link between an update and this startinghappening.
Start-Date: 2024-12-21  06:25:54
Commandline: apt-get -y upgrade
Upgrade: dump:amd64 (0.4b49-1+tjw12r4, 0.4b49-1+tjw12r5)
End-Date: 2024-12-21  06:25:59

Start-Date: 2024-12-30  06:26:01
Commandline: apt-get -y upgrade
Upgrade: dump:amd64 (0.4b49-1+tjw12r5, 0.4b49-1+tjw12r6)
End-Date: 2024-12-30  06:26:07

Start-Date: 2025-03-24  06:25:51
Commandline: apt-get -y upgrade
Upgrade: dump:amd64 (0.4b49-1+tjw12r6, 0.4b50-1+tjw12r1), local-sendmail-config:amd64 (12.1+tjw12r1, 12.2+tjw12r1)
End-Date: 2025-03-24  06:26:12

Start-Date: 2025-03-26  06:25:52
Commandline: apt-get -y upgrade
Upgrade: dump:amd64 (0.4b50-1+tjw12r1, 0.4b51-1+tjw12r1)
End-Date: 2025-03-26  06:26:02

Start-Date: 2025-05-02  06:26:06
Commandline: apt-get -y upgrade
Upgrade: dump:amd64 (0.4b51-1+tjw12r1, 0.4b52-1+tjw12r2)
End-Date: 2025-05-02  06:26:16

Start-Date: 2025-05-04  04:51:56
Commandline: apt-get -y upgrade
Upgrade: dump:amd64 (0.4b52-1+tjw12r2, 0.4b52-1+tjw12r3)
End-Date: 2025-05-04  04:52:03

Tim.
p.s. the script renamed the dump to _.dmp.bad I've renamed it back asthere's nothing wrong with it which is why the tests above don't looklike they're using the same file.
Date: Sun, 25 May 2025 02:44:15 GMT
From: Cron Daemon <r...@proxy17.home.woodall.me.uk>
To: r...@proxy17.home.woodall.me.uk
Subject: Cron <root@proxy17> /usr/sbin/backup


/sbin/lvcreate -A n -l52 -s -nroot-backup /dev/vg-proxy17/root
   Logical volume "root-backup" created.
   WARNING: This metadata update is NOT backed up.

/sbin/e2fsck -p /dev/vg-proxy17/root-backup
/dev/vg-proxy17/root-backup: Clearing orphaned inode 39474 (uid=0,gid=0, mode=0100600, size=0)/dev/vg-proxy17/root-backup: clean, 27942/126208 files, 294517/524288blocks
ssh -e none -i /root/.ssh/id_rsa_backup backup@backup17 'mkdir -p /dumps/proxy17.20250525.1'
/sbin/dump -z9 -1u -f - /dev/vg-proxy17/root-backup | ssh -e none -i /root/.ssh/id_rsa_backup backup@backup17 '/usr/bin/tee /dumps/proxy17.20250525.1/_.dmp | /usr/bin/sha256sum >/dumps/proxy17.20250525.1/_.dmp.sha256sum'
   DUMP: Date of this level 1 dump: Sun May 25 02:32:07 2025
   DUMP: Date of last level 0 dump: Sun May  4 02:32:12 2025
DUMP: Dumping /dev/vg-proxy17/root-backup (an unlisted file system)to standard output
   DUMP: Label: none
   DUMP: Writing 10 Kilobyte records
   DUMP: Compressing output at transformation level 9 (zlib)
   DUMP: mapping (Pass I) [regular files]
   DUMP: mapping (Pass II) [directories]
   DUMP: estimated 586849 blocks.
   DUMP: Volume 1 started with block 1 at: Sun May 25 02:32:12 2025
   DUMP: dumping (Pass III) [directories]
   DUMP: dumping (Pass IV) [regular files]
   DUMP: 64.64% done at 1264 kB/s, finished in 0:02
   DUMP: Volume 1 completed at: Sun May 25 02:39:02 2025
   DUMP: Volume 1 took 0:06:50
   DUMP: Volume 1 transfer rate: 422 kB/s
   DUMP: Volume 1 583700kB uncompressed, 173243kB compressed, 3.370:1
   DUMP: 583700 blocks (570.02MB)
   DUMP: finished in 410 seconds, throughput 1423 kBytes/sec
   DUMP: Date of this level 1 dump: Sun May 25 02:32:07 2025
   DUMP: Date this dump completed:  Sun May 25 02:39:02 2025
   DUMP: Average transfer rate: 422 kB/s
   DUMP: Wrote 583700kB uncompressed, 173243kB compressed, 3.370:1
   DUMP: DUMP IS DONE

mount -o ro /dev/vg-proxy17/root-backup /tmp/backup.TLzpiF2wx5
ssh -e none -i /root/.ssh/id_rsa_backup backup@backup17 'cat /dumps/proxy17.20250525.1/_.dmp' | /sbin/restore -C -D /tmp/backup.TLzpiF2wx5 -f -
/sbin/restore: cannot open /dev/tty: No such device or address
Dump tape is compressed.
decompression error, block 207131: data error
Tape decompression error while restoring ./usr/lib/modules/6.1.0-37-amd64/kernel/drivers/hid/hid-sony.ko
continue? [yn]
Dump   date: Sun May 25 02:32:07 2025
Dumped from: Sun May  4 02:32:12 2025
Level 1 dump of an unlisted file system on
proxy17.home.woodall.me.uk:/dev/vg-proxy17/root-backup
Label: none
filesys = /tmp/backup.TLzpiF2wx5

umount /tmp/backup.TLzpiF2wx5
ssh -e none -i /root/.ssh/id_rsa_backup backup@backup17 'mv /dumps/proxy17.20250525.1/_.dmp /dumps/proxy17.20250525.1/_.dmp.bad'
echo 'Verify of /dev/vg-proxy17/root-backup failed'
Verify of /dev/vg-proxy17/root-backup failed

/sbin/lvremove -A n -f /dev/vg-proxy17/root-backup
   Logical volume "root-backup" successfully removed.
   WARNING: This metadata update is NOT backed up.
mount | ssh -e none -i /root/.ssh/id_rsa_backup backup@backup17 'cat >/dumps/proxy17.20250525.1/mount.log'
df | ssh -e none -i /root/.ssh/id_rsa_backup backup@backup17 'cat >/dumps/proxy17.20250525.1/df'
fdisk -l | ssh -e none -i /root/.ssh/id_rsa_backup backup@backup17 'cat>/dumps/proxy17.20250525.1/fdisk'
cat /var/lib/dumpdates | ssh -e none -i /root/.ssh/id_rsa_backupbackup@backup17 'cat >/dumps/proxy17.20250525.1/dumpdates'
cat /tmp/backup.log.7uH9zDqOIo | ssh -e none -i /root/.ssh/id_rsa_backupbackup@backup17 'cat >/dumps/proxy17.20250525.1/_.log'



Please run and post:

# cat /etc/debian_version ; uname -a


Does your computer have ECC memory?


Are you using integrity assured filesystems (e.g. btrfs or ZFS)?

Can you reduce your process to a minimal Bourne shell script that exitswith 0 on success and exits with 1 on error that readers can run ontheir computers to debug the issue?



David

Re: Strange (cosmic ray?) corruption.

Reply via email to