[Kernel-packages] [Bug 1832384] Re: Unable to unmount apparently unused filesystem

Colin Ian King Wed, 12 Jun 2019 01:51:26 -0700

I had a quick look at there are quite a few differences between the
Delphix ZFS git repo and the upstream ZFS git repo. Github states: "This
branch is 376 commits ahead of zfsonlinux:master.".


So:

1. Does this issue occur with stock ZFS rather than Delphix ZFS?

2. With the activity you are performing to trigger the bug it does seem
like there is a race condition occurring on mnt_count so this does look
like a kernel bug.  One way to quickly sanity check this is with a
coarse bisect using some pre-build ubuntu kernels based on the mainline
kernels.

The wiki page https://wiki.ubuntu.com/Kernel/MainlineBuilds details our
mainline builds, you can select a mainline kernel from:
https://kernel.ubuntu.com/~kernel-ppa/mainline/?C=N;O=D - install it,
run the tests and see if the problem is fixed or still persists.  I
suggest trying a recent (say 5.1) kernel build, and if that works,
bisect on that between 4.15 and 5.1 etc until you find the earliest fix
point.

If the problem still occurs with a 5.1 kernel, then we need to start
digging a bit deeper. But I think it is definitely worth trying a recent
kernel first as the fix may be already be in an upstream kernel.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1832384

Title:
  Unable to unmount apparently unused filesystem

Status in linux package in Ubuntu:
  In Progress

Bug description:
  We periodically see an issue where unmounting a ZFS filesystem fails
  with EBUSY, even though there appears to be no one using it.

      # cat /proc/self/mounts | grep 
/domain0/group-38/oracle_db_container-202/oracle_timeflow-16370/archive
      domain0/group-38/oracle_db_container-202/oracle_timeflow-16370/archive 
/domain0/group-38/oracle_db_container-202/oracle_timeflow-16370/archive zfs 
rw,nosuid,nodev,noexec,relatime,xattr,noacl 0 0

  'lsof' and 'fuser' show no processes using any of the files in the
  problematic filesystem:

      # ls -l 
/domain0/group-38/oracle_db_container-202/oracle_timeflow-16370/archive/
      total 221
      -rw-r----- 1 500 500  52736 May 22 11:01 1_19_1008904362.dbf
      -rw-r----- 1 500 500 541696 May 22 11:03 1_20_1008904362.dbf
      # fuser 
/domain0/group-38/oracle_db_container-202/oracle_timeflow-16370/archive/1_20_1008904362.dbf
      # fuser 
/domain0/group-38/oracle_db_container-202/oracle_timeflow-16370/archive/1_19_1008904362.dbf
      # fuser 
/domain0/group-38/oracle_db_container-202/oracle_timeflow-16370/archive/
      # lsof | grep 
/domain0/group-38/oracle_db_container-202/oracle_timeflow-16370/archive
      #

  The filesystem was shared over NFS, but has since been unshared:

      # showmount -e | grep 
/domain0/group-38/oracle_db_container-202/oracle_timeflow-16370/archive
      #

  Since no one appears to be using the filesystem, our expectation is
  that it should be possible to unmount the filesystem. However,
  attempts to unmount the filesystem fail with EBUSY:

      # zfs destroy 
domain0/group-38/oracle_db_container-202/oracle_timeflow-16370/archive
      umount: 
/domain0/group-38/oracle_db_container-202/oracle_timeflow-16370/archive: target 
is busy.
      cannot unmount 
'/domain0/group-38/oracle_db_container-202/oracle_timeflow-16370/archive': 
umount failed
      # umount 
/domain0/group-38/oracle_db_container-202/oracle_timeflow-16370/archive
      umount: 
/domain0/group-38/oracle_db_container-202/oracle_timeflow-16370/archive: target 
is busy.

  
  Using bpftrace, we can see that the unmount is failing in 
'propagate_mount_busy()' in the kernel. Using a live kernel debugger, we can 
look at the 'mount' struct for this particular mount and see that the 
'mnt_count' refcount summed across all CPUs is 2. For filesystems that are 
eligible for unmounting, the refcount is 1.

  The only way to work around this issue that we have found is to
  reboot, at which point the filesystem can be unmounted and destroyed.

  
  So far, we have only been able to reproduce this using a workload driven by 
our application. The application mananges ZFS filesystems in groups, and the 
lifecycle of each group looks something like

      - Create and mount a group of filesystems, 1 parent and 4 children:
          /domain0/group-38/oracle_db_container-202/oracle_timeflow-16370
          
/domain0/group-38/oracle_db_container-202/oracle_timeflow-16370/datafile
          
/domain0/group-38/oracle_db_container-202/oracle_timeflow-16370/external
          
/domain0/group-38/oracle_db_container-202/oracle_timeflow-16370/archive
          /domain0/group-38/oracle_db_container-202/oracle_timeflow-16370/temp
      - Share all 5 filesystems over NFS
      - A client mounts all 5 shares using NFSv3
      - For a few hours, the client does NFS operations on the filesystems and 
the server occasionally takes ZFS snapshots of them
      - Unshare filesystems
      - Unmount filesystems
      - Delete filesystems

  These groups of filesystems are constantly being created and
  destroyed. At any given time, we have ~30k filesystems on the system,
  about 5k of which are shared. On average, one out of ~200-300k
  unmounts fails with this EBUSY error. To create and destroy this many
  filesystems takes us about a week or so.

  Note that we are using ZFS built from https://github.com/delphix/zfs,
  which is essentially master ZFS on Linux.

  ProblemType: Bug
  DistroRelease: Ubuntu 18.04
  Package: linux-image-4.15.0-50-generic 4.15.0-50.54
  ProcVersionSignature: Ubuntu 4.15.0-50.54-generic 4.15.18
  Uname: Linux 4.15.0-50-generic x86_64
  NonfreeKernelModules: zfs zunicode zcommon znvpair zavl icp
  AlsaDevices:
   total 0
   crw-rw---- 1 root audio 116,  1 May 20 19:10 seq
   crw-rw---- 1 root audio 116, 33 May 20 19:10 timer
  AplayDevices: Error: [Errno 2] No such file or directory: 'aplay': 'aplay'
  ApportVersion: 2.20.9-0ubuntu7.6
  Architecture: amd64
  ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord': 
'arecord'
  AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', 
'/dev/snd/timer'] failed with exit code 1:
  Date: Tue Jun 11 05:28:21 2019
  IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig': 'iwconfig'
  Lsusb: Error: [Errno 2] No such file or directory: 'lsusb': 'lsusb'
  MachineType: VMware, Inc. VMware Virtual Platform
  PciMultimedia:
   
  ProcEnviron:
   TERM=xterm-256color
   PATH=(custom, no user)
   LANG=en_US.UTF-8
   SHELL=/bin/bash
  ProcFB: 0 svgadrmfb
  ProcKernelCmdLine: 
BOOT_IMAGE=/ROOT/username.QbVhgpM/root@/boot/vmlinuz-4.15.0-50-generic 
root=ZFS=rpool/ROOT/username.QbVhgpM/root ro console=tty0 console=ttyS0,38400n8 
ipv6.disable=1 crashkernel=1024M-:512M
  RelatedPackageVersions:
   linux-restricted-modules-4.15.0-50-generic N/A
   linux-backports-modules-4.15.0-50-generic  N/A
   linux-firmware                             1.173.6
  RfKill: Error: [Errno 2] No such file or directory: 'rfkill': 'rfkill'
  SourcePackage: linux
  UpgradeStatus: No upgrade log present (probably fresh install)
  WifiSyslog:
   
  dmi.bios.date: 09/21/2015
  dmi.bios.vendor: Phoenix Technologies LTD
  dmi.bios.version: 6.00
  dmi.board.name: 440BX Desktop Reference Platform
  dmi.board.vendor: Intel Corporation
  dmi.board.version: None
  dmi.chassis.asset.tag: No Asset Tag
  dmi.chassis.type: 1
  dmi.chassis.vendor: No Enclosure
  dmi.chassis.version: N/A
  dmi.modalias: 
dmi:bvnPhoenixTechnologiesLTD:bvr6.00:bd09/21/2015:svnVMware,Inc.:pnVMwareVirtualPlatform:pvrNone:rvnIntelCorporation:rn440BXDesktopReferencePlatform:rvrNone:cvnNoEnclosure:ct1:cvrN/A:
  dmi.product.name: VMware Virtual Platform
  dmi.product.version: None
  dmi.sys.vendor: VMware, Inc.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1832384/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1832384] Re: Unable to unmount apparently unused filesystem

Reply via email to