Thank you for your efforts! FWIW, yes, I saw the panic two more times
when using the method you described.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to zfs-linux in Ubuntu.
https://bugs.launchpad.net/bugs/2110885

Title:
  Kernel panic when unmounting ZFS snapshots

Status in zfs-linux package in Ubuntu:
  Fix Released
Status in zfs-linux source package in Noble:
  In Progress
Status in zfs-linux source package in Oracular:
  In Progress
Status in zfs-linux source package in Plucky:
  In Progress
Status in zfs-linux source package in Questing:
  Fix Released

Bug description:
  [Impact]
  ZFS mount/unmount operations can leave the storage pools stuck in 'D' state,
  preventing access to any datasets.

  [Test Plan]
  This is not easily reproducible, but seems to happen more frequently when
  repeatedly mounting and unmounting ZFS snapshots. Below is a simple test loop
  that eventually causes the kernel panic spews in a test system.

  1. Set up a regular ZFS pool
     # zpool create pooltest sda sdb sdc
  2. Create a ZFS filesystem on the new pool
     # zfs create pooltest/data
  3. Write random data to the ZFS dataset. For convenience, we'll use the 
attached
     zfs_write_unified.py script
     # python3 zfs_write_unified.py .
  4. Create a snapshot of pooltest/data
     # zfs snapshot pooltest/data@snapshot1
  5. Mount/unmount this snapshot, while the zfs_write_unified is still running
     # while true; do sudo mount -t zfs pooltest/data@snapshot1 
/var/tmp/snapshot1 && sleep 0.5 && sudo umount /var/tmp/snapshot1; done

  [Where problems could occur]
  This is a follow-up fix for an upstream zfs_prune patch. Potential regressions
  would likely show up on ZFS cleaning operations such as pool scrub, as well as
  on the unmount path.
  We should properly exercise the mount/unmount code paths, as well as snapshot
  creation and deletion.

  [Other Info]
  The fix was merged as part of the upstream v2.3.2 release, so it's not 
required
  for Questing. The buggy v1 patch has been backported to Noble, so only that
  release and newer are affected.

  The breaking commit is
  38c0324c0fb6 Linux: Fix zfs_prune panics

  And the fix was introduced by
  a0e62718cfcf Linux: Fix zfs_prune panics v2 (#17121)

  --
  Every now and then, the `umount` command gets stuck in the `D` state when 
unmounting ZFS snapshots:

  # ps aux | grep umount
  root      912290  0.0  0.0  10344  2560 ?        D    Apr26   0:01 umount 
/mnt/zfs-snapshot-backup/var/opt/jira

  At the same time, we can see a kernel oops/panic in `dmesg`:

  Sat 2025-04-26 02:15:43 UTC systemd[1]: 
mnt-zfs\x2dsnapshot\x2dbackup-var-opt-jira.mount: Deactivated successfully.
  Sat 2025-04-26 02:15:44 UTC kernel: BUG: kernel NULL pointer dereference, 
address: 0000000000000000
  Sat 2025-04-26 02:15:44 UTC kernel: #PF: supervisor instruction fetch in 
kernel mode
  Sat 2025-04-26 02:15:44 UTC kernel: #PF: error_code(0x0010) - not-present page
  Sat 2025-04-26 02:15:44 UTC kernel: PGD 8000000131251067 P4D 8000000131251067 
PUD 0
  Sat 2025-04-26 02:15:44 UTC kernel: Oops: 0010 [#1] PREEMPT SMP PTI
  Sat 2025-04-26 02:15:44 UTC kernel: CPU: 0 PID: 486 Comm: arc_prune Tainted: 
P           O       6.8.0-58-generic #60-Ubuntu
  Sat 2025-04-26 02:15:44 UTC kernel: Hardware name: QEMU Standard PC (i440FX + 
PIIX, 1996), BIOS 1.14.0-2 04/01/2014
  Sat 2025-04-26 02:15:44 UTC kernel: RIP: 0010:0x0
  Sat 2025-04-26 02:15:44 UTC kernel: Code: Unable to access opcode bytes at 
0xffffffffffffffd6.
  Sat 2025-04-26 02:15:44 UTC kernel: RSP: 0018:ffffb845c0cebd40 EFLAGS: 
00010246
  Sat 2025-04-26 02:15:44 UTC kernel: RAX: 0000000000000000 RBX: 
ffffb845c0cebdac RCX: 0000000000000000
  Sat 2025-04-26 02:15:44 UTC kernel: RDX: 0000000000000000 RSI: 
ffffb845c0cebd48 RDI: ffff8f8bb1fa4f00
  Sat 2025-04-26 02:15:44 UTC kernel: RBP: ffffb845c0cebd98 R08: 
0000000000000000 R09: 0000000000000000
  Sat 2025-04-26 02:15:44 UTC kernel: R10: 0000000000000000 R11: 
0000000000000000 R12: 0000000000009ca5
  Sat 2025-04-26 02:15:44 UTC kernel: R13: 0000000000000000 R14: 
ffff8f8ab33dc000 R15: ffff8f8bb1fa4f00
  Sat 2025-04-26 02:15:44 UTC kernel: FS:  0000000000000000(0000) 
GS:ffff8f8cede00000(0000) knlGS:0000000000000000
  Sat 2025-04-26 02:15:44 UTC kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 
0000000080050033
  Sat 2025-04-26 02:15:44 UTC kernel: CR2: ffffffffffffffd6 CR3: 
000000013837e002 CR4: 00000000001706f0
  Sat 2025-04-26 02:15:44 UTC kernel: DR0: 0000000000000000 DR1: 
0000000000000000 DR2: 0000000000000000
  Sat 2025-04-26 02:15:44 UTC kernel: DR3: 0000000000000000 DR6: 
00000000fffe0ff0 DR7: 0000000000000400
  Sat 2025-04-26 02:15:44 UTC kernel: Call Trace:
  Sat 2025-04-26 02:15:44 UTC kernel:  <TASK>
  Sat 2025-04-26 02:15:44 UTC kernel:  ? show_regs+0x6d/0x80
  Sat 2025-04-26 02:15:44 UTC kernel:  ? __die+0x24/0x80
  Sat 2025-04-26 02:15:44 UTC kernel:  ? page_fault_oops+0x99/0x1b0
  Sat 2025-04-26 02:15:44 UTC kernel:  ? do_user_addr_fault+0x2e9/0x670
  Sat 2025-04-26 02:15:44 UTC kernel:  ? free_large_kmalloc+0x6b/0xc0
  Sat 2025-04-26 02:15:44 UTC kernel:  ? exc_page_fault+0x83/0x1b0
  Sat 2025-04-26 02:15:44 UTC kernel:  ? asm_exc_page_fault+0x27/0x30
  Sat 2025-04-26 02:15:44 UTC kernel:  zfs_prune+0x90/0x130 [zfs]
  Sat 2025-04-26 02:15:44 UTC kernel:  zpl_prune_sb+0x35/0x60 [zfs]
  Sat 2025-04-26 02:15:44 UTC kernel:  arc_prune_task+0x22/0x40 [zfs]
  Sat 2025-04-26 02:15:44 UTC kernel:  taskq_thread+0x1f6/0x3c0 [spl]
  Sat 2025-04-26 02:15:44 UTC kernel:  ? __pfx_default_wake_function+0x10/0x10
  Sat 2025-04-26 02:15:44 UTC kernel:  ? __pfx_taskq_thread+0x10/0x10 [spl]
  Sat 2025-04-26 02:15:44 UTC kernel:  kthread+0xf2/0x120
  Sat 2025-04-26 02:15:44 UTC kernel:  ? __pfx_kthread+0x10/0x10
  Sat 2025-04-26 02:15:44 UTC kernel:  ret_from_fork+0x47/0x70
  Sat 2025-04-26 02:15:44 UTC kernel:  ? __pfx_kthread+0x10/0x10
  Sat 2025-04-26 02:15:44 UTC kernel:  ret_from_fork_asm+0x1b/0x30
  Sat 2025-04-26 02:15:44 UTC kernel:  </TASK>
  Sat 2025-04-26 02:15:44 UTC kernel: Modules linked in: tls tcp_diag udp_diag 
inet_diag xt_comment xt_set ip_set_hash_net ip_set_hash_ip ip_set xt_tcpudp 
xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nf_tables 
cfg80211 binfmt_misc intel_rapl_msr intel_rapl_common kvm_intel kvm irqbypass 
rapl qxl drm_ttm_helper ttm i2c_piix4 zfs(PO) pvpanic_mmio pvpanic qemu_fw_cfg 
spl(O) input_leds joydev mac_hid serio_raw sch_fq_codel dm_multipath efi_pstore 
nfnetlink dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic raid10 
raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq 
libcrc32c raid1 raid0 crct10dif_pclmul crc32_pclmul hid_generic polyval_clmulni 
polyval_generic ghash_clmulni_intel sha256_ssse3 sha1_ssse3 floppy virtio_rng 
psmouse pata_acpi usbhid hid aesni_intel crypto_simd cryptd
  Sat 2025-04-26 02:15:44 UTC kernel: CR2: 0000000000000000
  Sat 2025-04-26 02:15:44 UTC kernel: ---[ end trace 0000000000000000 ]---
  Sat 2025-04-26 02:15:44 UTC kernel: RIP: 0010:0x0
  Sat 2025-04-26 02:15:44 UTC kernel: Code: Unable to access opcode bytes at 
0xffffffffffffffd6.
  Sat 2025-04-26 02:15:44 UTC kernel: RSP: 0018:ffffb845c0cebd40 EFLAGS: 
00010246
  Sat 2025-04-26 02:15:44 UTC kernel: RAX: 0000000000000000 RBX: 
ffffb845c0cebdac RCX: 0000000000000000
  Sat 2025-04-26 02:15:44 UTC kernel: RDX: 0000000000000000 RSI: 
ffffb845c0cebd48 RDI: ffff8f8bb1fa4f00
  Sat 2025-04-26 02:15:44 UTC kernel: RBP: ffffb845c0cebd98 R08: 
0000000000000000 R09: 0000000000000000
  Sat 2025-04-26 02:15:44 UTC kernel: R10: 0000000000000000 R11: 
0000000000000000 R12: 0000000000009ca5
  Sat 2025-04-26 02:15:44 UTC kernel: R13: 0000000000000000 R14: 
ffff8f8ab33dc000 R15: ffff8f8bb1fa4f00
  Sat 2025-04-26 02:15:44 UTC kernel: FS:  0000000000000000(0000) 
GS:ffff8f8cede00000(0000) knlGS:0000000000000000
  Sat 2025-04-26 02:15:44 UTC kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 
0000000080050033
  Sat 2025-04-26 02:15:44 UTC kernel: CR2: ffffffffffffffd6 CR3: 
000000013837e002 CR4: 00000000001706f0
  Sat 2025-04-26 02:15:44 UTC kernel: DR0: 0000000000000000 DR1: 
0000000000000000 DR2: 0000000000000000
  Sat 2025-04-26 02:15:44 UTC kernel: DR3: 0000000000000000 DR6: 
00000000fffe0ff0 DR7: 0000000000000400
  Sat 2025-04-26 02:15:44 UTC kernel: note: arc_prune[486] exited with irqs 
disabled

  Here is another stack track in the same situation on a different VM:

  [May10 04:58] general protection fault, probably for non-canonical address 
0x636f6c2f7273752f: 0000 [#1] PREEMPT SMP NOPTI
  [  +0.000037] CPU: 3 PID: 676 Comm: arc_prune Tainted: P           O       
6.8.0-55-generic #57-Ubuntu
  [  +0.000022] Hardware name: Hetzner vServer/Standard PC (Q35 + ICH9, 2009), 
BIOS 20171111 11/11/2017
  [  +0.000020] RIP: 0010:srso_alias_safe_ret+0x5/0x7
  [  +0.000019] Code: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 
cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 48 8d 64 24 08 <c3> cc e8 
f4 ff ff ff 0f 0b cc cc cc cc cc cc cc cc cc cc cc cc cc
  [  +0.000044] RSP: 0018:ffff9e32c043fd38 EFLAGS: 00010293
  [  +0.000015] RAX: 636f6c2f7273752f RBX: ffff9e32c043fdac RCX: 
0000000000000000
  [  +0.000016] RDX: 0000000000000000 RSI: ffff9e32c043fd48 RDI: 
ffff8d5002f1db80
  [  +0.000016] RBP: ffff9e32c043fd98 R08: 0000000000000000 R09: 
0000000000000000
  [  +0.000021] R10: 0000000000000000 R11: 0000000000000000 R12: 
000000000000059b
  [  +0.000016] R13: 0000000000000000 R14: ffff8d5010ada000 R15: 
ffff8d5002f1db80
  [  +0.000019] FS:  0000000000000000(0000) GS:ffff8d5730780000(0000) 
knlGS:0000000000000000
  [  +0.000019] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [  +0.000013] CR2: 00007fdf8ea1b000 CR3: 0000000113a3c004 CR4: 
0000000000770ef0
  [  +0.000017] PKRU: 55555554
  [  +0.000009] Call Trace:
  [  +0.000009]  <TASK>
  [  +0.000010]  ? show_regs+0x6d/0x80
  [  +0.000014]  ? die_addr+0x37/0xa0
  [  +0.000011]  ? exc_general_protection+0x1db/0x480
  [  +0.000015]  ? srso_alias_return_thunk+0x5/0xfbef5
  [  +0.000015]  ? asm_exc_general_protection+0x27/0x30
  [  +0.000017]  ? srso_alias_safe_ret+0x5/0x7
  [  +0.000012]  ? srso_alias_return_thunk+0x5/0xfbef5
  [  +0.000014]  ? zfs_prune+0xf7/0x130 [zfs]
  [  +0.000234]  zpl_prune_sb+0x35/0x60 [zfs]
  [  +0.000202]  arc_prune_task+0x22/0x40 [zfs]
  [  +0.000211]  taskq_thread+0x1f6/0x3c0 [spl]
  [  +0.000026]  ? __pfx_default_wake_function+0x10/0x10
  [  +0.000019]  ? __pfx_taskq_thread+0x10/0x10 [spl]
  [  +0.000023]  kthread+0xf2/0x120
  [  +0.000013]  ? __pfx_kthread+0x10/0x10
  [  +0.000014]  ret_from_fork+0x47/0x70
  [  +0.000013]  ? __pfx_kthread+0x10/0x10
  [  +0.000013]  ret_from_fork_asm+0x1b/0x30
  [  +0.000017]  </TASK>
  [  +0.000009] Modules linked in: tls tcp_diag udp_diag inet_diag xt_comment 
xt_set ip_set_hash_net ip_set_hash_ip ip_set xt_tcpudp xt_conntrack 
nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nf_tables binfmt_misc 
nls_iso8859_1 zfs(PO) spl(O) input_leds joydev serio_raw sch_fq_codel 
dm_multipath efi_pstore nfnetlink dmi_sysfs ip_tables x_tables autofs4 btrfs 
blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq 
async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 hid_generic usbhid hid 
crct10dif_pclmul crc32_pclmul polyval_clmulni polyval_generic 
ghash_clmulni_intel sha256_ssse3 sha1_ssse3 ahci psmouse libahci virtio_gpu 
xhci_pci virtio_rng xhci_pci_renesas virtio_dma_buf aesni_intel crypto_simd 
cryptd
  [  +0.000208] ---[ end trace 0000000000000000 ]---
  [  +0.758503] RIP: 0010:srso_alias_safe_ret+0x5/0x7
  [  +0.000038] Code: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 
cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 48 8d 64 24 08 <c3> cc e8 
f4 ff ff ff 0f 0b cc cc cc cc cc cc cc cc cc cc cc cc cc
  [  +0.000039] RSP: 0018:ffff9e32c043fd38 EFLAGS: 00010293
  [  +0.000579] RAX: 636f6c2f7273752f RBX: ffff9e32c043fdac RCX: 
0000000000000000
  [  +0.000614] RDX: 0000000000000000 RSI: ffff9e32c043fd48 RDI: 
ffff8d5002f1db80
  [  +0.000615] RBP: ffff9e32c043fd98 R08: 0000000000000000 R09: 
0000000000000000
  [  +0.000575] R10: 0000000000000000 R11: 0000000000000000 R12: 
000000000000059b
  [  +0.000503] R13: 0000000000000000 R14: ffff8d5010ada000 R15: 
ffff8d5002f1db80
  [  +0.000450] FS:  0000000000000000(0000) GS:ffff8d5730780000(0000) 
knlGS:0000000000000000
  [  +0.000504] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [  +0.000481] CR2: 00007fdf8ea1b000 CR3: 000000010f716005 CR4: 
0000000000770ef0
  [  +0.000381] PKRU: 55555554

  The end result is a system that cannot be shut down cleanly anymore,
  because unmounting never finishes.

  This is *not* easily reproducible. We run about 300 systems with
  Ubuntu 24.04, each one mounting and unmounting ZFS snapshots at least
  once per day. On those, we saw the bug 3 times in the last 2 months or
  so.

  Mounting/unmounting ZFS snapshots is part of our backup software.
  We've been doing that for many years now and this bug only started
  appearing with Ubuntu 24.04.

  Let me know if you need any more info. Thanks!

  More info:

  # lsb_release -rd
  No LSB modules are available.
  Description:    Ubuntu 24.04.2 LTS
  Release:        24.04

  # apt-cache policy zfsutils-linux
  zfsutils-linux:
    Installed: 2.2.2-0ubuntu9.2
    Candidate: 2.2.2-0ubuntu9.2
    Version table:
   *** 2.2.2-0ubuntu9.2 500
          500 mirror+file:/etc/apt/mirrors/ubuntu.txt noble-updates/main amd64 
Packages
          100 /var/lib/dpkg/status

  # uname -a
  Linux foo 6.8.0-59-generic #61-Ubuntu SMP PREEMPT_DYNAMIC Fri Apr 11 23:16:11 
UTC 2025 x86_64 x86_64 x86_64 GNU/Linux

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/zfs-linux/+bug/2110885/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to