Explanatory note on bionic-proposed (see comment #17):

The flaky fail set had a long tail (tests that failed just a few times).

Symptom:
---

The logs show the failure cause is output mismatch, because '/target'
failed to umount (busy), and it was logged; e.g.,

        btrfs/027       [failed, exit status 1]- output mismatch (see 
/home/ubuntu/xfstests-dev/results//btrfs/027.out.bad)
            --- tests/btrfs/027.out     2020-10-26 22:23:54.626341276 +0000
            +++ /home/ubuntu/xfstests-dev/results//btrfs/027.out.bad    
2020-11-17 06:34:12.623548481 +0000
            @@ -1,2 +1,3 @@
             QA output created by 027
            -Silence is golden
            +umount: /mnt/scratch: target is busy.
            +failed to unmount /dev/loop2
            ...

The number of failed tests increases with the number of umount failures.
Comparing both numbers per log file: (line breaks added for clarity.)

        $ (grep ^Failed xfstests.log.2020-11-1*; \
           grep -c 'umount: /mnt/scratch: target is busy' 
xfstests.log.2020-11-1* | grep -v :0) \
             | sort
        xfstests.log.2020-11-16-21-03-32:Failed 31 of 823 tests // 30ish is 
normal failure count
        xfstests.log.2020-11-16-22-16-38:Failed 34 of 823 tests
        xfstests.log.2020-11-16-23-25-56:Failed 34 of 823 tests
        xfstests.log.2020-11-17-00-36-27:Failed 36 of 823 tests
        xfstests.log.2020-11-17-01-44-04:Failed 37 of 823 tests
        xfstests.log.2020-11-17-02-52-41:Failed 37 of 823 tests // things go 
bad after this. umount hung here.

        xfstests.log.2020-11-17-04-07-11:15
        xfstests.log.2020-11-17-04-07-11:Failed 68 of 823 tests // umount hung 
here too.

        xfstests.log.2020-11-17-05-21-34:76
        xfstests.log.2020-11-17-05-21-34:Failed 158 of 823 tests

        xfstests.log.2020-11-17-06-29-11:148
        xfstests.log.2020-11-17-06-29-11:Failed 278 of 823 tests

        xfstests.log.2020-11-17-07-29-57:31
        xfstests.log.2020-11-17-07-29-57:Failed 104 of 823 tests

        xfstests.log.2020-11-17-08-37-16:33
        xfstests.log.2020-11-17-08-37-16:Failed 87 of 823 tests

        xfstests.log.2020-11-17-09-50-43:36
        xfstests.log.2020-11-17-09-50-43:Failed 101 of 823 tests

        xfstests.log.2020-11-17-11-04-16:43
        xfstests.log.2020-11-17-11-04-16:Failed 111 of 823 tests

        xfstests.log.2020-11-17-12-08-14:68
        xfstests.log.2020-11-17-12-08-14:Failed 143 of 823 tests

        xfstests.log.2020-11-17-13-14-01:93
        xfstests.log.2020-11-17-13-14-01:Failed 200 of 823 tests

And the kernel log files show that 'umount' hung in the timestamps where
the increase started.

        $ sudo grep umount kern.log.for.2020-11-1*
        kern.log.for.2020-11-17-02-52-41:Nov 17 03:52:55 mfo-s390x-bionic 
kernel: [24748.572259] CPU: 4 PID: 10862 Comm: umount Tainted: G        W       
 4.15.0-125-generic #128-Ubuntu
        kern.log.for.2020-11-17-02-52-41:Nov 17 03:53:10 mfo-s390x-bionic 
kernel: [24762.822089] CPU: 4 PID: 11705 Comm: umount Tainted: G        W       
 4.15.0-125-generic #128-Ubuntu
        kern.log.for.2020-11-17-02-52-41:Nov 17 03:53:16 mfo-s390x-bionic 
kernel: [24768.881926] CPU: 4 PID: 12110 Comm: umount Tainted: G        W       
 4.15.0-125-generic #128-Ubuntu

        kern.log.for.2020-11-17-04-07-11:Nov 17 03:52:55 mfo-s390x-bionic 
kernel: [24748.572259] CPU: 4 PID: 10862 Comm: umount Tainted: G        W       
 4.15.0-125-generic #128-Ubuntu
        kern.log.for.2020-11-17-04-07-11:Nov 17 03:53:10 mfo-s390x-bionic 
kernel: [24762.822089] CPU: 4 PID: 11705 Comm: umount Tainted: G        W       
 4.15.0-125-generic #128-Ubuntu
        kern.log.for.2020-11-17-04-07-11:Nov 17 03:53:16 mfo-s390x-bionic 
kernel: [24768.881926] CPU: 4 PID: 12110 Comm: umount Tainted: G        W       
 4.15.0-125-generic #128-Ubuntu

This is the stack trace (s390x kernel pointer addresses.)

        24748.572081] ------------[ cut here ]------------
        [24748.572083] BTRFS: Transaction aborted (error -28) // i.e., -ENOSPC
        [24748.572217] WARNING: CPU: 4 PID: 10862 at 
/build/linux-9nQaV0/linux-4.15.0/fs/btrfs/extent-tree.c:3097 
btrfs_run_delayed_refs+0x208/0x260 [btrfs]
        ...
        [24748.572259] CPU: 4 PID: 10862 Comm: umount Tainted: G        W       
 4.15.0-125-generic #128-Ubuntu
        [24748.572259] Hardware name: IBM 2964 N63 400 (KVM/Linux)
        ...
        [24748.572297] Call Trace:
        [24748.572309] ([<000003ff8039710c>] btrfs_run_delayed_refs+0x204/0x260 
[btrfs])
        [24748.572322]  [<000003ff803b0b46>] 
btrfs_commit_transaction+0x66/0x9e8 [btrfs]
        [24748.572330]  [<00000000003cdf9e>] __sync_filesystem+0x56/0x80
        [24748.572333]  [<000000000038f088>] generic_shutdown_super+0x48/0x168
        [24748.572334]  [<000000000038f50e>] kill_anon_super+0x2e/0x40
        [24748.572344]  [<000003ff80375b60>] btrfs_kill_super+0x30/0x148 [btrfs]
        [24748.572346]  [<000000000038fa90>] deactivate_locked_super+0x70/0xa0
        [24748.572348]  [<00000000003b6b94>] cleanup_mnt+0x64/0xa8
        [24748.572351]  [<000000000019a542>] task_work_run+0xba/0x100
        [24748.572361]  [<0000000000108a6e>] do_notify_resume+0x4e/0x60
        [24748.572365]  [<00000000008ff3be>] system_call+0xe2/0x2c8

After these errors, the failure rate increased; but not all times/tests;
since it seems that at some point the umount finishes, unblocking tests.


Cause:
---

Since this issue didn't happen with the test/patched kernel, it would
seem to originate from other/new patches. There are indeed 2 new ones
not used in the test kernel:

        6f64bf0873eb btrfs: qgroup: fix data leak caused by race between 
writeback and truncate <<< NEW
        d14eef17dcf2 btrfs: don't force read-only after error in drop snapshot  
<<< NEW

$ ls -1r *.patch # test kernel on top of Ubuntu-4.15.0-122.124
0017-btrfs-ctree-check-key-order-before-merging-tree-bloc.patch
0016-btrfs-extent-tree-kill-the-BUG_ON-in-insert_inline_e.patch
0015-btrfs-extent-tree-kill-BUG_ON-in-__btrfs_free_extent.patch
0014-btrfs-extent_io-do-extra-check-for-extent-buffer-rea.patch
0013-btrfs-drop-unnecessary-offset_in_page-in-extent-buff.patch
0012-btrfs-use-BUG-instead-of-BUG_ON-1.patch
0011-btrfs-use-offset_in_page-instead-of-open-coding-it.patch
0010-btrfs-fix-wrong-address-when-faulting-in-pages-in-th.patch
0009-btrfs-fix-lockdep-splat-in-add_missing_dev.patch
0008-btrfs-require-only-sector-size-alignment-for-parent-.patch
0007-btrfs-fix-potential-deadlock-in-the-search-ioctl.patch
0006-uaccess-Add-non-pagefault-user-space-write-function.patch
0005-uaccess-Add-non-pagefault-user-space-read-functions.patch
0004-btrfs-set-the-lockdep-class-for-log-tree-extent-buff.patch
0003-btrfs-Remove-extraneous-extent_buffer_get-from-tree_.patch
0002-btrfs-Remove-redundant-extent_buffer_get-in-get_old_.patch
0001-btrfs-drop-path-before-adding-new-uuid-tree-entry.patch

$ git log --oneline Ubuntu-4.15.0-122.124..Ubuntu-4.15.0-125.128 -- fs/btrfs/
a0a951aaa63f btrfs: ctree: check key order before merging tree blocks
0db62fbc1d08 btrfs: extent-tree: kill the BUG_ON() in 
insert_inline_extent_backref()
4f0e0482c42a btrfs: extent-tree: kill BUG_ON() in __btrfs_free_extent()
36b1133f57e3 btrfs: extent_io: do extra check for extent buffer read write 
functions
82e64a4e1ff8 btrfs: drop unnecessary offset_in_page in extent buffer helpers
3393e02f4100 btrfs: use BUG() instead of BUG_ON(1)
10679c69e0a8 btrfs: use offset_in_page instead of open-coding it
6f64bf0873eb btrfs: qgroup: fix data leak caused by race between writeback and 
truncate <<< NEW
d14eef17dcf2 btrfs: don't force read-only after error in drop snapshot  <<< NEW
2f54a800e2ae btrfs: fix wrong address when faulting in pages in the search ioctl
71ef83d6ab86 btrfs: fix lockdep splat in add_missing_dev
cb1483e19a40 btrfs: require only sector size alignment for parent eb bytenr
d738658f503d btrfs: fix potential deadlock in the search ioctl
cca53f849a53 btrfs: set the lockdep class for log tree extent buffers
bb11dbfbc405 btrfs: Remove extraneous extent_buffer_get from tree_mod_log_rewind
9a4e81ea6a44 btrfs: Remove redundant extent_buffer_get in get_old_root
725ce0e39e6d btrfs: drop path before adding new uuid tree entry


The commit message in 6f64bf0873eb explicitly describes the umount path
(now in return to userspace, but still.)  So it does look like a match:

commit 6f64bf0873ebc15a4e710cb04c7630576637a845
Author: Qu Wenruo <w...@suse.com>
Date:   Fri Jul 17 15:12:05 2020 +0800

    btrfs: qgroup: fix data leak caused by race between writeback and truncate
    
    BugLink: https://bugs.launchpad.net/bugs/1901257
    
    [ Upstream commit fa91e4aa1716004ea8096d5185ec0451e206aea0 ]
    
    [BUG]
    When running tests like generic/013 on test device with btrfs quota
    enabled, it can normally lead to data leak, detected at unmount time:
    
      BTRFS warning (device dm-3): qgroup 0/5 has unreleased space, type 0 rsv 
4096
      ------------[ cut here ]------------
      WARNING: CPU: 11 PID: 16386 at fs/btrfs/disk-io.c:4142 
close_ctree+0x1dc/0x323 [btrfs]
      RIP: 0010:close_ctree+0x1dc/0x323 [btrfs]
      Call Trace:
       btrfs_put_super+0x15/0x17 [btrfs]
       generic_shutdown_super+0x72/0x110
       kill_anon_super+0x18/0x30
       btrfs_kill_super+0x17/0x30 [btrfs]
       deactivate_locked_super+0x3b/0xa0
       deactivate_super+0x40/0x50
       cleanup_mnt+0x135/0x190
       __cleanup_mnt+0x12/0x20
       task_work_run+0x64/0xb0
       __prepare_exit_to_usermode+0x1bc/0x1c0
       __syscall_return_slowpath+0x47/0x230
       do_syscall_64+0x64/0xb0
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      ---[ end trace caf08beafeca2392 ]---
      BTRFS error (device dm-3): qgroup reserved space leaked


Also, this issue happened with Focal's original kernel from master-next,
which happened to have that commit already; and it didn't happen with
Focal's 'pre-original' kernel from -updates, which didn't. More evidence.


Revert?
---

Apparently this doesn't need to be reverted; it doesn't happen everytime,
as seen in this test, and with previous tests for Focal:

As described above, this was also noticed in Focal's results w/ the original
kernel from master-next, which has that commit, but to a lesser extent (see 
comment #8) so it wasn't noticed as much.

Thus the impact seem to vary, and happen more rarely, because the results w/
focal-proposed did NOT hit it a single time in 11 runs, and commit IS there.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1902254

Title:
  Bionic: btrfs: kernel BUG at /build/linux-
  eTBZpZ/linux-4.15.0/fs/btrfs/ctree.c:3233!

Status in linux package in Ubuntu:
  Fix Committed
Status in linux source package in Bionic:
  Fix Committed
Status in linux source package in Focal:
  Fix Committed
Status in linux source package in Groovy:
  Fix Committed

Bug description:
  [Impact]

   * Users of btrfs started hitting a kernel BUG() (below)
     after upgrade from 4.15.0-99.100 to 4.15.0-109.110,
     which has 55 btrfs changes.

       kernel BUG at /build/linux-eTBZpZ/linux-4.15.0/fs/btrfs/ctree.c:3233!
       ...
       Krnl PSW : 00000000be9cb874 00000000ef3786e8 
(btrfs_set_item_key_safe+0x152/0x1c0 [btrfs])
       ...
       [...] Call Trace:
       [...] btrfs_set_item_key_safe+0x11c/0x1c0 [btrfs])
       [...] __btrfs_drop_extents+0xb5a/0xda8 [btrfs]
       [...] btrfs_log_changed_extents+0x35c/0xaf0 [btrfs]
       [...] btrfs_log_inode+0x9ee/0x1080 [btrfs]
       [...] btrfs_log_inode_parent+0x224/0xa10 [btrfs]
       [...] btrfs_log_dentry_safe+0x80/0xa8 [btrfs]
       [...] btrfs_sync_file+0x392/0x550 [btrfs]
       [...] do_fsync+0x5e/0x90
       [...] SyS_fdatasync+0x32/0x48
       [...] system_call+0xd8/0x2c8

       $ git log --oneline Ubuntu-4.15.0-99.100..Ubuntu-4.15.0-109.110 -- 
fs/btrfs/ | wc -l
       55

   * The error happens at random moments, regardless of a
     particular activity/load. Workaround is to downgrade.

  [Fix]

   * This BUG()/function is addressed in patch 4/4 [1] of series
     'btrfs: Enhanced runtime defence against fuzzed images' [2],
     after issues in the real world, not just crafted fs images:
     'one internal report has hit one BUG_ON() with real world fs'
   
       kernel BUG at fs/btrfs/ctree.c:3188!
       ...
       RIP: 0010:btrfs_set_item_key_safe+0x16c/0x180
   
   * The patch/set [3] is applied in v5.10-rc1 and Ubuntu Unstable:
     - d16c702fe4f2 btrfs: ctree: check key order before merging tree blocks
     - 07cce5cf3b48 btrfs: extent-tree: kill the BUG_ON() in 
insert_inline_extent_backref()
     - 1c2a07f598d5 btrfs: extent-tree: kill BUG_ON() in __btrfs_free_extent()
     - f98b6215d7d1 btrfs: extent_io: do extra check for extent buffer read 
write functions
   
  [Test Case]

   * There is working synthetic reproducer for this issue,
     which is hard to reproduce as reported in commit [4]
     that introduces debugging for the issue.
     
   * Regression tests with xfstests and stress-ng shows
     no regressions between un/patched kernels.

  [Other Info]

   * Trivial backports (only refreshing a few context lines)
     with 3 more dependency patches on Bionic and 1 on Focal.
     And Bionic needed one extra hunk to '#include' a header.
     Groovy all apply cleanly.

  [1] https://lore.kernel.org/linux-btrfs/20200819063550.62832-5-...@suse.com/
  [2] https://lore.kernel.org/linux-btrfs/20200819063550.62832-1-...@suse.com/
  [3] 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d16c702fe4f274bd77b47d3ab737eadcf24e0b93
  [4] 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7c15d41016dc886cc011e3854d855e219759ae68

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1902254/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to