[Kernel-packages] [Bug 1779156] Re: lxc 'delete' fails to destroy ZFS filesystem 'dataset is busy'

Paride Legovini Fri, 30 Aug 2019 06:21:20 -0700

Hi,

I hit this issue on Bionic, Disco and Eoan. Our (server-team) Jenkins
nodes are often filled by stale LXD containers which are left there
because of "fails to destroy ZFS filesystem" errors.


Some thoughts and qualitative observations:

0. This is not a corner case, I see the problem all the time.

1. There is probably more than one issue involved here, even we get
similar error messages when trying to delete a container.

2. One issue is about mount namespaces: stray mounts that prevent to the
container to be deleted. This issue can be worked around by entering the
namespace and unmounting. The container can then be deleted. When this
happens retrying `lxd delete` doesn't help. This is described in [0]. I
think the newer versions of LXD are way less prone to end up in this
case.

3. In other cases `lxc delete --force` fails with the "ZFS dataset is
busy" error, but the deletion succeeds if the delete is retried
immediately after. In my case I don't even need to wait for a single
second: the second delete in `lxc delete --force <x> ; lxc delete <x>`
already works. Stopping and deleting the container as separate
operations also works.

4. It has been suggested in [0] that LXD could retry the "delete"
operation if it fails. stgraber wrote that LXD *already* retries the
operation 20 times over 10 seconds, but the outcome is still a failure.
It is not clear to me how retrying manually works, while LXD auto-
retrying does not.

5. Some time ago (weeks) the error message changed from "Failed to
destroy ZFS filesystem: dataset is busy" to "Failed to destroy ZFS
filesystem:" with no other detail. I can't tell which specific upgrade
triggered this change.

6. I see this problem in both file-backed and device-backed zpools.

7. I'm not sure system load plays a role: I often hit the problem on my
lightly loaded laptop.

8. I don't have clear steps to reproduce the problem, but I personally
see it happening most of the time. While I don't have steps to reproduce
with 100% probability, I'm seeing this more times than I don't. But see
the next point.

9. In my experience a system can be in a "bad state" (the problem always
happens), or in a "good state" (the problem never happens). When the
system is in a "good state" we can `lxc delete` hundreds of containers
with no errors. I can't tell what makes a system switch from a good to a
bad state. I almost certain I also saw systems switching from a bad to a
good state.

10. The lxcfs package it not installed in the systems where I hit this
issue

That's it for the moment. Thanks for looking into this!

Paride

[0] https://github.com/lxc/lxd/issues/4656

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1779156

Title:
  lxc 'delete' fails to destroy ZFS filesystem 'dataset is busy'

Status in linux package in Ubuntu:
  Triaged
Status in lxc package in Ubuntu:
  Confirmed
Status in linux source package in Cosmic:
  Triaged
Status in lxc source package in Cosmic:
  Confirmed

Bug description:
  I'm not sure exactly what got me into this state, but I have several
  lxc containers that cannot be deleted.

  $ lxc info
  <snip>
  api_status: stable
  api_version: "1.0"
  auth: trusted
  public: false
  auth_methods:
  - tls
  environment:
    addresses: []
    architectures:
    - x86_64
    - i686
    certificate: |
      -----BEGIN CERTIFICATE-----
      <snip>
      -----END CERTIFICATE-----
    certificate_fingerprint: 
3af6f8b8233c5d9e898590a9486ded5c0bec045488384f30ea921afce51f75cb
    driver: lxc
    driver_version: 3.0.1
    kernel: Linux
    kernel_architecture: x86_64
    kernel_version: 4.15.0-23-generic
    server: lxd
    server_pid: 15123
    server_version: "3.2"
    storage: zfs
    storage_version: 0.7.5-1ubuntu15
    server_clustered: false
    server_name: milhouse

  $ lxc delete --force b1
  Error: Failed to destroy ZFS filesystem: cannot destroy 
'default/containers/b1': dataset is busy

  Talking in #lxc-dev, stgraber and sforeshee provided diagnosis:

   | short version is that something unshared a mount namespace causing
   | them to get a copy of the mount table at the time that dataset was
   | mounted, which then prevents zfs from being able to destroy it)

  The work around provided was

   | you can unstick this particular issue by doing:
   |  grep default/containers/b1 /proc/*/mountinfo
   | then for any of the hits, do:
   |   nsenter -t PID -m -- umount 
/var/snap/lxd/common/lxd/storage-pools/default/containers/b1
   | then try the delete again

  ProblemType: Bug
  DistroRelease: Ubuntu 18.10
  Package: linux-image-4.15.0-23-generic 4.15.0-23.25
  ProcVersionSignature: Ubuntu 4.15.0-23.25-generic 4.15.18
  Uname: Linux 4.15.0-23-generic x86_64
  NonfreeKernelModules: zfs zunicode zavl icp zcommon znvpair
  ApportVersion: 2.20.10-0ubuntu3
  Architecture: amd64
  AudioDevicesInUse:
   USER        PID ACCESS COMMAND
   /dev/snd/controlC1:  smoser    31412 F.... pulseaudio
   /dev/snd/controlC2:  smoser    31412 F.... pulseaudio
   /dev/snd/controlC0:  smoser    31412 F.... pulseaudio
  CurrentDesktop: ubuntu:GNOME
  Date: Thu Jun 28 10:42:45 2018
  EcryptfsInUse: Yes
  InstallationDate: Installed on 2015-07-23 (1071 days ago)
  InstallationMedia: Ubuntu 15.10 "Wily Werewolf" - Alpha amd64 (20150722.1)
  MachineType: 
b'\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff'
 
b'\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff'
  ProcEnviron:
   TERM=xterm-256color
   PATH=(custom, no user)
   XDG_RUNTIME_DIR=<set>
   LANG=en_US.UTF-8
   SHELL=/bin/bash
  ProcFB: 0 inteldrmfb
  ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.15.0-23-generic 
root=UUID=f897b32a-eacf-4191-9717-844918947069 ro quiet splash vt.handoff=1
  RelatedPackageVersions:
   linux-restricted-modules-4.15.0-23-generic N/A
   linux-backports-modules-4.15.0-23-generic  N/A
   linux-firmware                             1.174
  SourcePackage: linux
  UpgradeStatus: No upgrade log present (probably fresh install)
  dmi.bios.date: 03/09/2015
  dmi.bios.vendor: Intel Corporation
  dmi.bios.version: RYBDWi35.86A.0246.2015.0309.1355
  dmi.board.asset.tag: ���������������������������������
  dmi.board.name: NUC5i5RYB
  dmi.board.vendor: Intel Corporation
  dmi.board.version: H40999-503
  dmi.chassis.asset.tag: ���������������������������������
  dmi.chassis.type: 3
  dmi.chassis.vendor: ���������������������������������
  dmi.chassis.version: ���������������������������������
  dmi.modalias: 
dmi:bvnIntelCorporation:bvrRYBDWi35.86A.0246.2015.0309.1355:bd03/09/2015:svn:pn:pvr:rvnIntelCorporation:rnNUC5i5RYB:rvrH40999-503:cvn:ct3:cvr:
  dmi.product.family: ���������������������������������
  dmi.product.name: ���������������������������������
  dmi.product.version: ���������������������������������
  dmi.sys.vendor: ���������������������������������

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1779156/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1779156] Re: lxc 'delete' fails to destroy ZFS filesystem 'dataset is busy'

Reply via email to