Public bug reported:

BugLink: https://bugs.launchpad.net/bugs/2007219

[Impact]

A deadlock exists in the XFS filesystem that occurs when the XFS log
buffer becomes completely exhausted.

XFS maintains a circular ring buffer for the log, and it is a fixed
size. To be able to create a transaction, you need to be able to reserve
space on the log buffer.

Certain ioend transactions, such as file append, can be preallocated for
a negligible performance gain. This takes up space in the log buffer,
and these preallocated ioends are placed on a workqueue, behind other
ioends that are not preallocated.

The deadlock occurs when the XFS log buffer is running low on space, and
an ioend append transaction comes in. It is preallocated, consuming
nearly all of the remaining XFS log buffer space, and is placed at the
very end of the ioend workqueue. The kernel then takes a ioend from the
top of the ioend workqeueue, creates a transaction, xfs_trans_alloc(),
attempts to allocate space  for it, xfs_trans_reserve(),
xfs_log_reserve(), and then waits in a while loop checking free space in
the log, xlog_grant_head_check(), xlog_grant_head_wait().

Since there is no space, the thread sleeps with schedule(). This happens
with all ioend transactions, since the log is exhausted and I/O is not
moving.

Since I/O never moves, the thread is never woken up, and we get hung
task timeouts, with system failure shortly afterward.

An example hung task timeout is:

INFO: task kworker/60:0:4002982 blocked for more than 360 seconds.
      Tainted: G           OE     5.4.0-137-generic #154-Ubuntu
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kworker/60:0    D    0 4002982      2 0x90004080
Workqueue: xfs-conv/dm-3 xfs_end_io [xfs]
Call Trace:
 __schedule+0x2e3/0x740
 schedule+0x42/0xb0
 xlog_grant_head_wait+0xb9/0x1e0 [xfs]
 xlog_grant_head_check+0xde/0x100 [xfs]
 xfs_log_reserve+0xc9/0x1e0 [xfs]
 xfs_trans_reserve+0x17a/0x1e0 [xfs]
 xfs_trans_alloc+0xda/0x170 [xfs]
 xfs_iomap_write_unwritten+0x128/0x2f0 [xfs]
 xfs_end_ioend+0x15b/0x1b0 [xfs]
 xfs_end_io+0xb1/0xe0 [xfs]
 process_one_work+0x1eb/0x3b0
 worker_thread+0x4d/0x400
 kthread+0x104/0x140
 ? process_one_work+0x3b0/0x3b0
 ? kthread_park+0x90/0x90
 ret_from_fork+0x1f/0x40

There is no known workaround, other than to have a larger log buffer at
filesystem creation, but even then, it only buys you time until you get
high enough load to exhaust the log buffer.

[Fix]

This was fixed in 5.13-rc1 by the following patch:

commit 7cd3099f4925d7c15887d1940ebd65acd66100f5
Author: Brian Foster <bfos...@redhat.com>
Date:   Fri Apr 9 10:27:43 2021 -0700
Subject: xfs: drop submit side trans alloc for append ioends
Link: 
https://github.com/torvalds/linux/commit/7cd3099f4925d7c15887d1940ebd65acd66100f5

The patch more or less removes all preallocated ioend transactions, and
instead, when ioend appends are needed, they go to the standard
workqueue like any other ioend, where transactions are allocated when
they reach the top of the workqueue.

The patch required some backporting for Focal. The changes to the patch
itself is minimal and should be straightforward to read, however, the
changes to the XFS ioend subsystem between 5.4 and 5.13-rc1 were quite
extreme, with a lot of refactoring taking place over very many commits.

Additionally, the patch was part of a five part series, the first, fixes
the deadlock, and the rest remove all the code to do with transaction
preallocation.

It is safe to leave the rest of the code in place. It will become dead
code, but it will not be reachable, and not cause any risk of
regression, due to ioend->io_append_trans always being NULL, and not
entering into any of the if statements.

[Testcase]

There is currently no known testcase for this issue. The issue has only
been seen in a customer production environment, running under heavy
load. The issue has not been seen in a customer test environment, only
production.

The production workload is a busy Kubernetes cluster running containers
and VMs, with the hosts's filesystem being broken into separate
mountpoints over several partitions, all XFS.

The kubernetes containers are all backed from a large 4TB disk which is
XFS.

A test kernel is available in the following ppa:

https://launchpad.net/~mruffell/+archive/ubuntu/sf353709-test

This test kernel has been deployed to several production hosts, and the
deadlock no longer occurs.

[Where problems can occur]

We are changing how certain ioend transactions take place in the XFS
filesystem. If a regression were to occur, it would impact all XFS
users. Users would have to downgrade their kernel, as there are no
workarounds for enabling or disabling the behaviour change.

The overall risk of the change should be low. ioend append transactions
would still be processed in nearly the same way, still being placed at
the end of the ioend workqueue like any other transaction, with the only
change being it is allocated at the time the transaction is created, at
the top of the workqueue when it is processed, and not preallocated when
the ioend is first submitted.

There will be a very minor performance penalty, but it wouldn't be
measurable in any tangible workload.

I have run xfstests for the xfs/* subset against the released
5.4.0-137-generic and the test 5.4.0-137-generic test kernel with the
patch included. The test kernel had identical results, it will likely
not cause any regressions.

There is some additional risk leaving the dead code in place, but I have
read the code and the commits to remove the dead code, and I came to the
conclusion it is less risky to leave it in place, than to backport the
refactor commits.

[Other Info]

The XFS ioend subsystem refactor took place in the following commits:

commit 598ecfbaa742aca0dcdbbea25681406f95cc0b63
From: Christoph Hellwig <h...@lst.de>
Date: Thu, 17 Oct 2019 13:12:15 -0700
Subject: iomap: lift the xfs writeback code to iomap
Link: 
https://github.com/torvalds/linux/commit/598ecfbaa742aca0dcdbbea25681406f95cc0b63

commit 9e91c5728cab3d0aa3197d009c3d63e147914e77
From: Christoph Hellwig <h...@lst.de>
Date: Thu, 17 Oct 2019 13:12:13 -0700
Subject: iomap: lift common tracing code from xfs to iomap
Link: 
https://github.com/torvalds/linux/commit/9e91c5728cab3d0aa3197d009c3d63e147914e77

commit 433dad94ec5d6b90385b56a8bc8718dd9542b289
From: Christoph Hellwig <h...@lst.de>
Date: Thu, 17 Oct 2019 13:12:07 -0700
Subject: xfs: refactor the ioend merging code
Link: 
https://github.com/torvalds/linux/commit/433dad94ec5d6b90385b56a8bc8718dd9542b289

ioend->io_append_trans was renamed to ioend->io_private in the following
commit:

commit 5653017bc44e54baa299f3523f160c23ac0628fd
From: Christoph Hellwig <h...@lst.de>
Date: Thu, 17 Oct 2019 13:12:09 -0700
Subject: xfs: turn io_append_trans into an io_private void pointer
Link: 
https://github.com/torvalds/linux/commit/5653017bc44e54baa299f3523f160c23ac0628fd

The full five part preallocated ioend deadlock patch series is:

commit 7cd3099f4925d7c15887d1940ebd65acd66100f5
Author: Brian Foster <bfos...@redhat.com>
Date:   Fri Apr 9 10:27:43 2021 -0700
Subject: xfs: drop submit side trans alloc for append ioends
Link: 
https://github.com/torvalds/linux/commit/7cd3099f4925d7c15887d1940ebd65acd66100f5

commit 7adb8f14e134d5f885d47c4ccd620836235f0b7f
Author: Brian Foster <bfos...@redhat.com>
Date:   Fri Apr 9 10:27:55 2021 -0700
Subject: xfs: open code ioend needs workqueue helper
Link: 
https://github.com/torvalds/linux/commit/7adb8f14e134d5f885d47c4ccd620836235f0b7f

commit 044c6449f18f174ba8d86640936add3fc7582e49
Author: Brian Foster <bfos...@redhat.com>
Date:   Fri Apr 9 10:27:55 2021 -0700
Subject: xfs: drop unused ioend private merge and setfilesize code
Link: 
https://github.com/torvalds/linux/commit/044c6449f18f174ba8d86640936add3fc7582e49

commit e7a3d7e792a5ad50583a2e6c35e72bd2ca6096f4
Author: Brian Foster <bfos...@redhat.com>
Date:   Fri Apr 9 10:27:56 2021 -0700
Subject: xfs: drop unnecessary setfilesize helper
Link: 
https://github.com/torvalds/linux/commit/e7a3d7e792a5ad50583a2e6c35e72bd2ca6096f4

commit 6e552494fb90acae005d74ce6a2ee102d965184b
Author: Brian Foster <bfos...@redhat.com>
Date:   Tue May 4 08:54:29 2021 -0700
Subject: iomap: remove unused private field from ioend
Link: 
https://github.com/torvalds/linux/commit/6e552494fb90acae005d74ce6a2ee102d965184b

As you can see, the four latter commits are not necessary. They simply
remove dead code, which has no harm being left in place. They also do
not backport at all, not without the ALL of the significant refactor
commits.

Hence, we only take "xfs: drop submit side trans alloc for append
ioends" only, as it is the only needed commit to solve the problem.

** Affects: linux (Ubuntu)
     Importance: Undecided
         Status: Fix Released

** Affects: linux (Ubuntu Focal)
     Importance: High
     Assignee: Matthew Ruffell (mruffell)
         Status: In Progress


** Tags: focal sts

** Description changed:

- BugLink: https://bugs.launchpad.net/bugs/
+ BugLink: https://bugs.launchpad.net/bugs/2007219
  
  [Impact]
  
  A deadlock exists in the XFS filesystem that occurs when the XFS log
  buffer becomes completely exhausted.
  
  XFS maintains a circular ring buffer for the log, and it is a fixed
  size. To be able to create a transaction, you need to be able to reserve
  space on the log buffer.
  
  Certain ioend transactions, such as file append, can be preallocated for
  a negligible performance gain. This takes up space in the log buffer,
  and these preallocated ioends are placed on a workqueue, behind other
  ioends that are not preallocated.
  
  The deadlock occurs when the XFS log buffer is running low on space, and
  an ioend append transaction comes in. It is preallocated, consuming
  nearly all of the remaining XFS log buffer space, and is placed at the
  very end of the ioend workqueue. The kernel then takes a ioend from the
  top of the ioend workqeueue, creates a transaction, xfs_trans_alloc(),
  attempts to allocate space  for it, xfs_trans_reserve(),
  xfs_log_reserve(), and then waits in a while loop checking free space in
  the log, xlog_grant_head_check(), xlog_grant_head_wait().
  
  Since there is no space, the thread sleeps with schedule(). This happens
  with all ioend transactions, since the log is exhausted and I/O is not
  moving.
  
  Since I/O never moves, the thread is never woken up, and we get hung
  task timeouts, with system failure shortly afterward.
  
  An example hung task timeout is:
  
  INFO: task kworker/60:0:4002982 blocked for more than 360 seconds.
-       Tainted: G           OE     5.4.0-137-generic #154-Ubuntu
+       Tainted: G           OE     5.4.0-137-generic #154-Ubuntu
  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  kworker/60:0    D    0 4002982      2 0x90004080
  Workqueue: xfs-conv/dm-3 xfs_end_io [xfs]
  Call Trace:
-  __schedule+0x2e3/0x740
-  schedule+0x42/0xb0
-  xlog_grant_head_wait+0xb9/0x1e0 [xfs]
-  xlog_grant_head_check+0xde/0x100 [xfs]
-  xfs_log_reserve+0xc9/0x1e0 [xfs]
-  xfs_trans_reserve+0x17a/0x1e0 [xfs]
-  xfs_trans_alloc+0xda/0x170 [xfs]
-  xfs_iomap_write_unwritten+0x128/0x2f0 [xfs]
-  xfs_end_ioend+0x15b/0x1b0 [xfs]
-  xfs_end_io+0xb1/0xe0 [xfs]
-  process_one_work+0x1eb/0x3b0
-  worker_thread+0x4d/0x400
-  kthread+0x104/0x140
-  ? process_one_work+0x3b0/0x3b0
-  ? kthread_park+0x90/0x90
-  ret_from_fork+0x1f/0x40
+  __schedule+0x2e3/0x740
+  schedule+0x42/0xb0
+  xlog_grant_head_wait+0xb9/0x1e0 [xfs]
+  xlog_grant_head_check+0xde/0x100 [xfs]
+  xfs_log_reserve+0xc9/0x1e0 [xfs]
+  xfs_trans_reserve+0x17a/0x1e0 [xfs]
+  xfs_trans_alloc+0xda/0x170 [xfs]
+  xfs_iomap_write_unwritten+0x128/0x2f0 [xfs]
+  xfs_end_ioend+0x15b/0x1b0 [xfs]
+  xfs_end_io+0xb1/0xe0 [xfs]
+  process_one_work+0x1eb/0x3b0
+  worker_thread+0x4d/0x400
+  kthread+0x104/0x140
+  ? process_one_work+0x3b0/0x3b0
+  ? kthread_park+0x90/0x90
+  ret_from_fork+0x1f/0x40
  
  There is no known workaround, other than to have a larger log buffer at
  filesystem creation, but even then, it only buys you time until you get
  high enough load to exhaust the log buffer.
  
  [Fix]
  
  This was fixed in 5.13-rc1 by the following patch:
  
  commit 7cd3099f4925d7c15887d1940ebd65acd66100f5
  Author: Brian Foster <bfos...@redhat.com>
  Date:   Fri Apr 9 10:27:43 2021 -0700
  Subject: xfs: drop submit side trans alloc for append ioends
  Link: 
https://github.com/torvalds/linux/commit/7cd3099f4925d7c15887d1940ebd65acd66100f5
  
  The patch more or less removes all preallocated ioend transactions, and
  instead, when ioend appends are needed, they go to the standard
  workqueue like any other ioend, where transactions are allocated when
  they reach the top of the workqueue.
  
  The patch required some backporting for Focal. The changes to the patch
  itself is minimal and should be straightforward to read, however, the
  changes to the XFS ioend subsystem between 5.4 and 5.13-rc1 were quite
  extreme, with a lot of refactoring taking place over very many commits.
  
  Additionally, the patch was part of a five part series, the first, fixes
  the deadlock, and the rest remove all the code to do with transaction
  preallocation.
  
  It is safe to leave the rest of the code in place. It will become dead
  code, but it will not be reachable, and not cause any risk of
  regression, due to ioend->io_append_trans always being NULL, and not
  entering into any of the if statements.
  
  [Testcase]
  
  There is currently no known testcase for this issue. The issue has only
  been seen in a customer production environment, running under heavy
  load. The issue has not been seen in a customer test environment, only
  production.
  
  The production workload is a busy Kubernetes cluster running containers
  and VMs, with the hosts's filesystem being broken into separate
  mountpoints over several partitions, all XFS.
  
  The kubernetes containers are all backed from a large 4TB disk which is
  XFS.
  
  A test kernel is available in the following ppa:
  
  https://launchpad.net/~mruffell/+archive/ubuntu/sf353709-test
  
  This test kernel has been deployed to several production hosts, and the
  deadlock no longer occurs.
  
  [Where problems can occur]
  
  We are changing how certain ioend transactions take place in the XFS
  filesystem. If a regression were to occur, it would impact all XFS
  users. Users would have to downgrade their kernel, as there are no
  workarounds for enabling or disabling the behaviour change.
  
  The overall risk of the change should be low. ioend append transactions
  would still be processed in nearly the same way, still being placed at
  the end of the ioend workqueue like any other transaction, with the only
  change being it is allocated at the time the transaction is created, at
  the top of the workqueue when it is processed, and not preallocated when
  the ioend is first submitted.
  
  There will be a very minor performance penalty, but it wouldn't be
  measurable in any tangible workload.
  
  I have run xfstests for the xfs/* subset against the released
  5.4.0-137-generic and the test 5.4.0-137-generic test kernel with the
  patch included. The test kernel had identical results, it will likely
  not cause any regressions.
  
  There is some additional risk leaving the dead code in place, but I have
  read the code and the commits to remove the dead code, and I came to the
  conclusion it is less risky to leave it in place, than to backport the
  refactor commits.
  
  [Other Info]
  
  The XFS ioend subsystem refactor took place in the following commits:
  
  commit 598ecfbaa742aca0dcdbbea25681406f95cc0b63
  From: Christoph Hellwig <h...@lst.de>
  Date: Thu, 17 Oct 2019 13:12:15 -0700
  Subject: iomap: lift the xfs writeback code to iomap
  Link: 
https://github.com/torvalds/linux/commit/598ecfbaa742aca0dcdbbea25681406f95cc0b63
  
  commit 9e91c5728cab3d0aa3197d009c3d63e147914e77
  From: Christoph Hellwig <h...@lst.de>
  Date: Thu, 17 Oct 2019 13:12:13 -0700
  Subject: iomap: lift common tracing code from xfs to iomap
  Link: 
https://github.com/torvalds/linux/commit/9e91c5728cab3d0aa3197d009c3d63e147914e77
  
  commit 433dad94ec5d6b90385b56a8bc8718dd9542b289
  From: Christoph Hellwig <h...@lst.de>
  Date: Thu, 17 Oct 2019 13:12:07 -0700
  Subject: xfs: refactor the ioend merging code
  Link: 
https://github.com/torvalds/linux/commit/433dad94ec5d6b90385b56a8bc8718dd9542b289
  
  ioend->io_append_trans was renamed to ioend->io_private in the following
  commit:
  
  commit 5653017bc44e54baa299f3523f160c23ac0628fd
  From: Christoph Hellwig <h...@lst.de>
  Date: Thu, 17 Oct 2019 13:12:09 -0700
  Subject: xfs: turn io_append_trans into an io_private void pointer
  Link: 
https://github.com/torvalds/linux/commit/5653017bc44e54baa299f3523f160c23ac0628fd
  
  The full five part preallocated ioend deadlock patch series is:
  
  commit 7cd3099f4925d7c15887d1940ebd65acd66100f5
  Author: Brian Foster <bfos...@redhat.com>
  Date:   Fri Apr 9 10:27:43 2021 -0700
  Subject: xfs: drop submit side trans alloc for append ioends
  Link: 
https://github.com/torvalds/linux/commit/7cd3099f4925d7c15887d1940ebd65acd66100f5
  
  commit 7adb8f14e134d5f885d47c4ccd620836235f0b7f
  Author: Brian Foster <bfos...@redhat.com>
  Date:   Fri Apr 9 10:27:55 2021 -0700
  Subject: xfs: open code ioend needs workqueue helper
  Link: 
https://github.com/torvalds/linux/commit/7adb8f14e134d5f885d47c4ccd620836235f0b7f
  
  commit 044c6449f18f174ba8d86640936add3fc7582e49
  Author: Brian Foster <bfos...@redhat.com>
  Date:   Fri Apr 9 10:27:55 2021 -0700
  Subject: xfs: drop unused ioend private merge and setfilesize code
  Link: 
ttps://github.com/torvalds/linux/commit/044c6449f18f174ba8d86640936add3fc7582e49
  
  commit e7a3d7e792a5ad50583a2e6c35e72bd2ca6096f4
  Author: Brian Foster <bfos...@redhat.com>
  Date:   Fri Apr 9 10:27:56 2021 -0700
  Subject: xfs: drop unnecessary setfilesize helper
  Link: 
ttps://github.com/torvalds/linux/commit/e7a3d7e792a5ad50583a2e6c35e72bd2ca6096f4
  
  commit 6e552494fb90acae005d74ce6a2ee102d965184b
  Author: Brian Foster <bfos...@redhat.com>
  Date:   Tue May 4 08:54:29 2021 -0700
  Subject: iomap: remove unused private field from ioend
  Link: 
ttps://github.com/torvalds/linux/commit/6e552494fb90acae005d74ce6a2ee102d965184b
  
  As you can see, the four latter commits are not necessary. They simply
  remove dead code, which has no harm being left in place. They also do
  not backport at all, not without the ALL of the significant refactor
  commits.
  
  Hence, we only take "xfs: drop submit side trans alloc for append
  ioends" only, as it is the only needed commit to solve the problem.

** Also affects: linux (Ubuntu Focal)
   Importance: Undecided
       Status: New

** Changed in: linux (Ubuntu Focal)
       Status: New => In Progress

** Changed in: linux (Ubuntu)
       Status: New => Fix Released

** Changed in: linux (Ubuntu Focal)
   Importance: Undecided => High

** Changed in: linux (Ubuntu Focal)
     Assignee: (unassigned) => Matthew Ruffell (mruffell)

** Tags added: focal sts

** Description changed:

  BugLink: https://bugs.launchpad.net/bugs/2007219
  
  [Impact]
  
  A deadlock exists in the XFS filesystem that occurs when the XFS log
  buffer becomes completely exhausted.
  
  XFS maintains a circular ring buffer for the log, and it is a fixed
  size. To be able to create a transaction, you need to be able to reserve
  space on the log buffer.
  
  Certain ioend transactions, such as file append, can be preallocated for
  a negligible performance gain. This takes up space in the log buffer,
  and these preallocated ioends are placed on a workqueue, behind other
  ioends that are not preallocated.
  
  The deadlock occurs when the XFS log buffer is running low on space, and
  an ioend append transaction comes in. It is preallocated, consuming
  nearly all of the remaining XFS log buffer space, and is placed at the
  very end of the ioend workqueue. The kernel then takes a ioend from the
  top of the ioend workqeueue, creates a transaction, xfs_trans_alloc(),
  attempts to allocate space  for it, xfs_trans_reserve(),
  xfs_log_reserve(), and then waits in a while loop checking free space in
  the log, xlog_grant_head_check(), xlog_grant_head_wait().
  
  Since there is no space, the thread sleeps with schedule(). This happens
  with all ioend transactions, since the log is exhausted and I/O is not
  moving.
  
  Since I/O never moves, the thread is never woken up, and we get hung
  task timeouts, with system failure shortly afterward.
  
  An example hung task timeout is:
  
  INFO: task kworker/60:0:4002982 blocked for more than 360 seconds.
        Tainted: G           OE     5.4.0-137-generic #154-Ubuntu
  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  kworker/60:0    D    0 4002982      2 0x90004080
  Workqueue: xfs-conv/dm-3 xfs_end_io [xfs]
  Call Trace:
   __schedule+0x2e3/0x740
   schedule+0x42/0xb0
   xlog_grant_head_wait+0xb9/0x1e0 [xfs]
   xlog_grant_head_check+0xde/0x100 [xfs]
   xfs_log_reserve+0xc9/0x1e0 [xfs]
   xfs_trans_reserve+0x17a/0x1e0 [xfs]
   xfs_trans_alloc+0xda/0x170 [xfs]
   xfs_iomap_write_unwritten+0x128/0x2f0 [xfs]
   xfs_end_ioend+0x15b/0x1b0 [xfs]
   xfs_end_io+0xb1/0xe0 [xfs]
   process_one_work+0x1eb/0x3b0
   worker_thread+0x4d/0x400
   kthread+0x104/0x140
   ? process_one_work+0x3b0/0x3b0
   ? kthread_park+0x90/0x90
   ret_from_fork+0x1f/0x40
  
  There is no known workaround, other than to have a larger log buffer at
  filesystem creation, but even then, it only buys you time until you get
  high enough load to exhaust the log buffer.
  
  [Fix]
  
  This was fixed in 5.13-rc1 by the following patch:
  
  commit 7cd3099f4925d7c15887d1940ebd65acd66100f5
  Author: Brian Foster <bfos...@redhat.com>
  Date:   Fri Apr 9 10:27:43 2021 -0700
  Subject: xfs: drop submit side trans alloc for append ioends
  Link: 
https://github.com/torvalds/linux/commit/7cd3099f4925d7c15887d1940ebd65acd66100f5
  
  The patch more or less removes all preallocated ioend transactions, and
  instead, when ioend appends are needed, they go to the standard
  workqueue like any other ioend, where transactions are allocated when
  they reach the top of the workqueue.
  
  The patch required some backporting for Focal. The changes to the patch
  itself is minimal and should be straightforward to read, however, the
  changes to the XFS ioend subsystem between 5.4 and 5.13-rc1 were quite
  extreme, with a lot of refactoring taking place over very many commits.
  
  Additionally, the patch was part of a five part series, the first, fixes
  the deadlock, and the rest remove all the code to do with transaction
  preallocation.
  
  It is safe to leave the rest of the code in place. It will become dead
  code, but it will not be reachable, and not cause any risk of
  regression, due to ioend->io_append_trans always being NULL, and not
  entering into any of the if statements.
  
  [Testcase]
  
  There is currently no known testcase for this issue. The issue has only
  been seen in a customer production environment, running under heavy
  load. The issue has not been seen in a customer test environment, only
  production.
  
  The production workload is a busy Kubernetes cluster running containers
  and VMs, with the hosts's filesystem being broken into separate
  mountpoints over several partitions, all XFS.
  
  The kubernetes containers are all backed from a large 4TB disk which is
  XFS.
  
  A test kernel is available in the following ppa:
  
  https://launchpad.net/~mruffell/+archive/ubuntu/sf353709-test
  
  This test kernel has been deployed to several production hosts, and the
  deadlock no longer occurs.
  
  [Where problems can occur]
  
  We are changing how certain ioend transactions take place in the XFS
  filesystem. If a regression were to occur, it would impact all XFS
  users. Users would have to downgrade their kernel, as there are no
  workarounds for enabling or disabling the behaviour change.
  
  The overall risk of the change should be low. ioend append transactions
  would still be processed in nearly the same way, still being placed at
  the end of the ioend workqueue like any other transaction, with the only
  change being it is allocated at the time the transaction is created, at
  the top of the workqueue when it is processed, and not preallocated when
  the ioend is first submitted.
  
  There will be a very minor performance penalty, but it wouldn't be
  measurable in any tangible workload.
  
  I have run xfstests for the xfs/* subset against the released
  5.4.0-137-generic and the test 5.4.0-137-generic test kernel with the
  patch included. The test kernel had identical results, it will likely
  not cause any regressions.
  
  There is some additional risk leaving the dead code in place, but I have
  read the code and the commits to remove the dead code, and I came to the
  conclusion it is less risky to leave it in place, than to backport the
  refactor commits.
  
  [Other Info]
  
  The XFS ioend subsystem refactor took place in the following commits:
  
  commit 598ecfbaa742aca0dcdbbea25681406f95cc0b63
  From: Christoph Hellwig <h...@lst.de>
  Date: Thu, 17 Oct 2019 13:12:15 -0700
  Subject: iomap: lift the xfs writeback code to iomap
  Link: 
https://github.com/torvalds/linux/commit/598ecfbaa742aca0dcdbbea25681406f95cc0b63
  
  commit 9e91c5728cab3d0aa3197d009c3d63e147914e77
  From: Christoph Hellwig <h...@lst.de>
  Date: Thu, 17 Oct 2019 13:12:13 -0700
  Subject: iomap: lift common tracing code from xfs to iomap
  Link: 
https://github.com/torvalds/linux/commit/9e91c5728cab3d0aa3197d009c3d63e147914e77
  
  commit 433dad94ec5d6b90385b56a8bc8718dd9542b289
  From: Christoph Hellwig <h...@lst.de>
  Date: Thu, 17 Oct 2019 13:12:07 -0700
  Subject: xfs: refactor the ioend merging code
  Link: 
https://github.com/torvalds/linux/commit/433dad94ec5d6b90385b56a8bc8718dd9542b289
  
  ioend->io_append_trans was renamed to ioend->io_private in the following
  commit:
  
  commit 5653017bc44e54baa299f3523f160c23ac0628fd
  From: Christoph Hellwig <h...@lst.de>
  Date: Thu, 17 Oct 2019 13:12:09 -0700
  Subject: xfs: turn io_append_trans into an io_private void pointer
  Link: 
https://github.com/torvalds/linux/commit/5653017bc44e54baa299f3523f160c23ac0628fd
  
  The full five part preallocated ioend deadlock patch series is:
  
  commit 7cd3099f4925d7c15887d1940ebd65acd66100f5
  Author: Brian Foster <bfos...@redhat.com>
  Date:   Fri Apr 9 10:27:43 2021 -0700
  Subject: xfs: drop submit side trans alloc for append ioends
  Link: 
https://github.com/torvalds/linux/commit/7cd3099f4925d7c15887d1940ebd65acd66100f5
  
  commit 7adb8f14e134d5f885d47c4ccd620836235f0b7f
  Author: Brian Foster <bfos...@redhat.com>
  Date:   Fri Apr 9 10:27:55 2021 -0700
  Subject: xfs: open code ioend needs workqueue helper
  Link: 
https://github.com/torvalds/linux/commit/7adb8f14e134d5f885d47c4ccd620836235f0b7f
  
  commit 044c6449f18f174ba8d86640936add3fc7582e49
  Author: Brian Foster <bfos...@redhat.com>
  Date:   Fri Apr 9 10:27:55 2021 -0700
  Subject: xfs: drop unused ioend private merge and setfilesize code
- Link: 
ttps://github.com/torvalds/linux/commit/044c6449f18f174ba8d86640936add3fc7582e49
+ Link: 
https://github.com/torvalds/linux/commit/044c6449f18f174ba8d86640936add3fc7582e49
  
  commit e7a3d7e792a5ad50583a2e6c35e72bd2ca6096f4
  Author: Brian Foster <bfos...@redhat.com>
  Date:   Fri Apr 9 10:27:56 2021 -0700
  Subject: xfs: drop unnecessary setfilesize helper
- Link: 
ttps://github.com/torvalds/linux/commit/e7a3d7e792a5ad50583a2e6c35e72bd2ca6096f4
+ Link: 
https://github.com/torvalds/linux/commit/e7a3d7e792a5ad50583a2e6c35e72bd2ca6096f4
  
  commit 6e552494fb90acae005d74ce6a2ee102d965184b
  Author: Brian Foster <bfos...@redhat.com>
  Date:   Tue May 4 08:54:29 2021 -0700
  Subject: iomap: remove unused private field from ioend
- Link: 
ttps://github.com/torvalds/linux/commit/6e552494fb90acae005d74ce6a2ee102d965184b
+ Link: 
https://github.com/torvalds/linux/commit/6e552494fb90acae005d74ce6a2ee102d965184b
  
  As you can see, the four latter commits are not necessary. They simply
  remove dead code, which has no harm being left in place. They also do
  not backport at all, not without the ALL of the significant refactor
  commits.
  
  Hence, we only take "xfs: drop submit side trans alloc for append
  ioends" only, as it is the only needed commit to solve the problem.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2007219

Title:
  xfs: Preallocated ioend transactions cause deadlock due to log buffer
  exhaustion

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Focal:
  In Progress

Bug description:
  BugLink: https://bugs.launchpad.net/bugs/2007219

  [Impact]

  A deadlock exists in the XFS filesystem that occurs when the XFS log
  buffer becomes completely exhausted.

  XFS maintains a circular ring buffer for the log, and it is a fixed
  size. To be able to create a transaction, you need to be able to
  reserve space on the log buffer.

  Certain ioend transactions, such as file append, can be preallocated
  for a negligible performance gain. This takes up space in the log
  buffer, and these preallocated ioends are placed on a workqueue,
  behind other ioends that are not preallocated.

  The deadlock occurs when the XFS log buffer is running low on space,
  and an ioend append transaction comes in. It is preallocated,
  consuming nearly all of the remaining XFS log buffer space, and is
  placed at the very end of the ioend workqueue. The kernel then takes a
  ioend from the top of the ioend workqeueue, creates a transaction,
  xfs_trans_alloc(), attempts to allocate space  for it,
  xfs_trans_reserve(), xfs_log_reserve(), and then waits in a while loop
  checking free space in the log, xlog_grant_head_check(),
  xlog_grant_head_wait().

  Since there is no space, the thread sleeps with schedule(). This
  happens with all ioend transactions, since the log is exhausted and
  I/O is not moving.

  Since I/O never moves, the thread is never woken up, and we get hung
  task timeouts, with system failure shortly afterward.

  An example hung task timeout is:

  INFO: task kworker/60:0:4002982 blocked for more than 360 seconds.
        Tainted: G           OE     5.4.0-137-generic #154-Ubuntu
  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  kworker/60:0    D    0 4002982      2 0x90004080
  Workqueue: xfs-conv/dm-3 xfs_end_io [xfs]
  Call Trace:
   __schedule+0x2e3/0x740
   schedule+0x42/0xb0
   xlog_grant_head_wait+0xb9/0x1e0 [xfs]
   xlog_grant_head_check+0xde/0x100 [xfs]
   xfs_log_reserve+0xc9/0x1e0 [xfs]
   xfs_trans_reserve+0x17a/0x1e0 [xfs]
   xfs_trans_alloc+0xda/0x170 [xfs]
   xfs_iomap_write_unwritten+0x128/0x2f0 [xfs]
   xfs_end_ioend+0x15b/0x1b0 [xfs]
   xfs_end_io+0xb1/0xe0 [xfs]
   process_one_work+0x1eb/0x3b0
   worker_thread+0x4d/0x400
   kthread+0x104/0x140
   ? process_one_work+0x3b0/0x3b0
   ? kthread_park+0x90/0x90
   ret_from_fork+0x1f/0x40

  There is no known workaround, other than to have a larger log buffer
  at filesystem creation, but even then, it only buys you time until you
  get high enough load to exhaust the log buffer.

  [Fix]

  This was fixed in 5.13-rc1 by the following patch:

  commit 7cd3099f4925d7c15887d1940ebd65acd66100f5
  Author: Brian Foster <bfos...@redhat.com>
  Date:   Fri Apr 9 10:27:43 2021 -0700
  Subject: xfs: drop submit side trans alloc for append ioends
  Link: 
https://github.com/torvalds/linux/commit/7cd3099f4925d7c15887d1940ebd65acd66100f5

  The patch more or less removes all preallocated ioend transactions,
  and instead, when ioend appends are needed, they go to the standard
  workqueue like any other ioend, where transactions are allocated when
  they reach the top of the workqueue.

  The patch required some backporting for Focal. The changes to the
  patch itself is minimal and should be straightforward to read,
  however, the changes to the XFS ioend subsystem between 5.4 and
  5.13-rc1 were quite extreme, with a lot of refactoring taking place
  over very many commits.

  Additionally, the patch was part of a five part series, the first,
  fixes the deadlock, and the rest remove all the code to do with
  transaction preallocation.

  It is safe to leave the rest of the code in place. It will become dead
  code, but it will not be reachable, and not cause any risk of
  regression, due to ioend->io_append_trans always being NULL, and not
  entering into any of the if statements.

  [Testcase]

  There is currently no known testcase for this issue. The issue has
  only been seen in a customer production environment, running under
  heavy load. The issue has not been seen in a customer test
  environment, only production.

  The production workload is a busy Kubernetes cluster running
  containers and VMs, with the hosts's filesystem being broken into
  separate mountpoints over several partitions, all XFS.

  The kubernetes containers are all backed from a large 4TB disk which
  is XFS.

  A test kernel is available in the following ppa:

  https://launchpad.net/~mruffell/+archive/ubuntu/sf353709-test

  This test kernel has been deployed to several production hosts, and
  the deadlock no longer occurs.

  [Where problems can occur]

  We are changing how certain ioend transactions take place in the XFS
  filesystem. If a regression were to occur, it would impact all XFS
  users. Users would have to downgrade their kernel, as there are no
  workarounds for enabling or disabling the behaviour change.

  The overall risk of the change should be low. ioend append
  transactions would still be processed in nearly the same way, still
  being placed at the end of the ioend workqueue like any other
  transaction, with the only change being it is allocated at the time
  the transaction is created, at the top of the workqueue when it is
  processed, and not preallocated when the ioend is first submitted.

  There will be a very minor performance penalty, but it wouldn't be
  measurable in any tangible workload.

  I have run xfstests for the xfs/* subset against the released
  5.4.0-137-generic and the test 5.4.0-137-generic test kernel with the
  patch included. The test kernel had identical results, it will likely
  not cause any regressions.

  There is some additional risk leaving the dead code in place, but I
  have read the code and the commits to remove the dead code, and I came
  to the conclusion it is less risky to leave it in place, than to
  backport the refactor commits.

  [Other Info]

  The XFS ioend subsystem refactor took place in the following commits:

  commit 598ecfbaa742aca0dcdbbea25681406f95cc0b63
  From: Christoph Hellwig <h...@lst.de>
  Date: Thu, 17 Oct 2019 13:12:15 -0700
  Subject: iomap: lift the xfs writeback code to iomap
  Link: 
https://github.com/torvalds/linux/commit/598ecfbaa742aca0dcdbbea25681406f95cc0b63

  commit 9e91c5728cab3d0aa3197d009c3d63e147914e77
  From: Christoph Hellwig <h...@lst.de>
  Date: Thu, 17 Oct 2019 13:12:13 -0700
  Subject: iomap: lift common tracing code from xfs to iomap
  Link: 
https://github.com/torvalds/linux/commit/9e91c5728cab3d0aa3197d009c3d63e147914e77

  commit 433dad94ec5d6b90385b56a8bc8718dd9542b289
  From: Christoph Hellwig <h...@lst.de>
  Date: Thu, 17 Oct 2019 13:12:07 -0700
  Subject: xfs: refactor the ioend merging code
  Link: 
https://github.com/torvalds/linux/commit/433dad94ec5d6b90385b56a8bc8718dd9542b289

  ioend->io_append_trans was renamed to ioend->io_private in the
  following commit:

  commit 5653017bc44e54baa299f3523f160c23ac0628fd
  From: Christoph Hellwig <h...@lst.de>
  Date: Thu, 17 Oct 2019 13:12:09 -0700
  Subject: xfs: turn io_append_trans into an io_private void pointer
  Link: 
https://github.com/torvalds/linux/commit/5653017bc44e54baa299f3523f160c23ac0628fd

  The full five part preallocated ioend deadlock patch series is:

  commit 7cd3099f4925d7c15887d1940ebd65acd66100f5
  Author: Brian Foster <bfos...@redhat.com>
  Date:   Fri Apr 9 10:27:43 2021 -0700
  Subject: xfs: drop submit side trans alloc for append ioends
  Link: 
https://github.com/torvalds/linux/commit/7cd3099f4925d7c15887d1940ebd65acd66100f5

  commit 7adb8f14e134d5f885d47c4ccd620836235f0b7f
  Author: Brian Foster <bfos...@redhat.com>
  Date:   Fri Apr 9 10:27:55 2021 -0700
  Subject: xfs: open code ioend needs workqueue helper
  Link: 
https://github.com/torvalds/linux/commit/7adb8f14e134d5f885d47c4ccd620836235f0b7f

  commit 044c6449f18f174ba8d86640936add3fc7582e49
  Author: Brian Foster <bfos...@redhat.com>
  Date:   Fri Apr 9 10:27:55 2021 -0700
  Subject: xfs: drop unused ioend private merge and setfilesize code
  Link: 
https://github.com/torvalds/linux/commit/044c6449f18f174ba8d86640936add3fc7582e49

  commit e7a3d7e792a5ad50583a2e6c35e72bd2ca6096f4
  Author: Brian Foster <bfos...@redhat.com>
  Date:   Fri Apr 9 10:27:56 2021 -0700
  Subject: xfs: drop unnecessary setfilesize helper
  Link: 
https://github.com/torvalds/linux/commit/e7a3d7e792a5ad50583a2e6c35e72bd2ca6096f4

  commit 6e552494fb90acae005d74ce6a2ee102d965184b
  Author: Brian Foster <bfos...@redhat.com>
  Date:   Tue May 4 08:54:29 2021 -0700
  Subject: iomap: remove unused private field from ioend
  Link: 
https://github.com/torvalds/linux/commit/6e552494fb90acae005d74ce6a2ee102d965184b

  As you can see, the four latter commits are not necessary. They simply
  remove dead code, which has no harm being left in place. They also do
  not backport at all, not without the ALL of the significant refactor
  commits.

  Hence, we only take "xfs: drop submit side trans alloc for append
  ioends" only, as it is the only needed commit to solve the problem.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2007219/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to