Public bug reported:

[impact]

When running heavy IO operations on sysfs, the kernel will lock up and many 
hung tasks will be printed in dmesg.
This issue is due to a bug in the kernfs driver and was fixed upstream through 
4 commits:

    06fb4736139f kernfs: change kernfs_rename_lock into a read-write lock.
    c9f2dfb7b59e kernfs: Use a per-fs rwsem to protect per-fs list of 
kernfs_super_info.
    9caf69614225 kernfs: Introduce separate rwsem to protect inode attributes.
    393c3714081a kernfs: switch global kernfs_rwsem lock to per-fs lock

I have backported these commits and also backported two other required
commits that these commit depend on:

44a41882575b kernfs: dont take i_lock on inode attr read.
f44a3ca1e533 kernfs: move struct kernfs_root out of the public view.

One of the main signs of hitting this bug, is seeing a huge number of
tasks in the Uninterruptible state when analyzing the core dump:

crash> ps -S
  RU: 99
  UN: 7725
  IN: 6113
  ID: 444

Stack traces can be very different based on the application locking up
but what they all have in common is that they are waiting for a mutex to
unlock:

[308188.415026] INFO: task READACTED:15703 blocked for more than 360 seconds.
[308188.415362]       Not tainted 5.15.0-153-generic #163-Ubuntu
[308188.415749] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
this message.
[308188.416054] task:REDACTED     state:D stack:    0 pid:15703 ppid: 15160 
flags:0x00000000
[308188.416347] Call Trace:
[308188.416634]  <TASK>
[308188.416917]  __schedule+0x24e/0x590
[308188.417202]  schedule+0x69/0x110
[308188.417506]  schedule_preempt_disabled+0xe/0x20
[308188.417789]  __mutex_lock.constprop.0+0x267/0x490
[308188.418052]  ? rtnl_getlink+0x420/0x420
[308188.418311]  __mutex_lock_slowpath+0x13/0x20
[308188.418575]  mutex_lock+0x38/0x50
[308188.418818]  __netlink_dump_start+0xbf/0x2f0
[308188.419061]  ? rtnl_getlink+0x420/0x420
[308188.419321]  rtnetlink_rcv_msg+0x2af/0x400
[308188.419564]  ? rtnl_getlink+0x420/0x420
[308188.419807]  ? rtnl_calcit.isra.0+0x130/0x130
[308188.420051]  netlink_rcv_skb+0x53/0x100
[308188.420297]  rtnetlink_rcv+0x15/0x20
[308188.420539]  netlink_unicast+0x220/0x340
[308188.420781]  netlink_sendmsg+0x24b/0x4c0
[308188.421023]  __sock_sendmsg+0x66/0x70
[308188.421262]  __sys_sendto+0x113/0x190
[308188.421511]  ? __audit_syscall_exit+0x269/0x2d0
[308188.421754]  ? __audit_syscall_entry+0xde/0x120
[308188.422024]  __x64_sys_sendto+0x24/0x30
[308188.422258]  x64_sys_call+0x1bcb/0x1fa0
[308188.422485]  do_syscall_64+0x56/0xb0
[308188.422711]  entry_SYSCALL_64_after_hwframe+0x6c/0xd6
[308188.422946] RIP: 0033:0x48fbae
[308188.423196] RSP: 002b:000000c0038b1480 EFLAGS: 00000216 ORIG_RAX: 
000000000000002c
[308188.423454] RAX: ffffffffffffffda RBX: 000000000000008a RCX: 
000000000048fbae
[308188.423694] RDX: 0000000000000011 RSI: 000000c0074b84e0 RDI: 
000000000000008a
[308188.423935] RBP: 000000c0038b14c0 R08: 000000c0074b84d4 R09: 
000000000000000c
[308188.424180] R10: 0000000000000000 R11: 0000000000000216 R12: 
000000c0074b84e0
[308188.424422] R13: 0000000000000000 R14: 000000c0039a2540 R15: 
00000000000007c6
[308188.424667]  </TASK>

The issue currently affects Jammy 5.15 kernel and is not present in the
6.8 kernel or above.

[Test Plan]

Reproducing the issue requires a machine that is able to run a high
number of applications concurrently that are hammering kernfs with I/O
requests.

commit "9caf69614225 kernfs: Introduce separate rwsem to protect inode
attributes" contains a C program snippet reproducer.

Although in this loop the sysfs being accessed is for an infiniband device, 
having a any similar program that reads files from sysfs with multiple 
instances of the program running, will trigger the issue.
An example could be reading data from "/sys/class/net/<interface>/statistics/*" 
instead.

In other words running multiple instances of a program that does IO
operations on sysfs would confirm if the issue remains or is resolved.
To reliably reproduce or confirm the resolution of the issue, dozens if
not hundreds of instances should be executed.

[ Where problems could occur ]

* Since the fix changes the semaphore being used to access kernfs files from 
one that protects an entire kernfs filesystem to one that protects individual 
inodes, there is a chance that this might cause race conditions on read/write.
* This might also cause deadlocks similar to the ones this bug is reporting.

** Affects: linux (Ubuntu)
     Importance: Undecided
         Status: New

** Affects: linux (Ubuntu Jammy)
     Importance: High
     Assignee: Ghadi Rahme (ghadi-rahme)
         Status: New

** Also affects: linux (Ubuntu Jammy)
   Importance: Undecided
       Status: New

** Description changed:

  [impact]
  
  When running heavy IO operations on sysfs, the kernel will lock up and many 
hung tasks will be printed in dmesg.
  This issue is due to a bug in the kernfs driver and was fixed upstream 
through 4 commits:
  
-     06fb4736139f kernfs: change kernfs_rename_lock into a read-write lock.
-     c9f2dfb7b59e kernfs: Use a per-fs rwsem to protect per-fs list of 
kernfs_super_info.
-     9caf69614225 kernfs: Introduce separate rwsem to protect inode attributes.
-     393c3714081a kernfs: switch global kernfs_rwsem lock to per-fs lock
+     06fb4736139f kernfs: change kernfs_rename_lock into a read-write lock.
+     c9f2dfb7b59e kernfs: Use a per-fs rwsem to protect per-fs list of 
kernfs_super_info.
+     9caf69614225 kernfs: Introduce separate rwsem to protect inode attributes.
+     393c3714081a kernfs: switch global kernfs_rwsem lock to per-fs lock
  
  I have backported these commits and also backported two other required
  commits that these commit depend on:
  
  44a41882575b kernfs: dont take i_lock on inode attr read.
  f44a3ca1e533 kernfs: move struct kernfs_root out of the public view.
  
  One of the main signs of hitting this bug, is seeing a huge number of
  tasks in the Uninterruptible state when analyzing the core dump:
  
  crash> ps -S
-   RU: 99
-   UN: 7725
-   IN: 6113
-   ID: 444
+   RU: 99
+   UN: 7725
+   IN: 6113
+   ID: 444
  
  Stack traces can be very different based on the application locking up
  but what they all have in common is that they are waiting for a mutex to
  unlock:
  
  [308188.415026] INFO: task READACTED:15703 blocked for more than 360 seconds.
  [308188.415362]       Not tainted 5.15.0-153-generic #163-Ubuntu
  [308188.415749] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
this message.
  [308188.416054] task:REDACTED     state:D stack:    0 pid:15703 ppid: 15160 
flags:0x00000000
  [308188.416347] Call Trace:
  [308188.416634]  <TASK>
  [308188.416917]  __schedule+0x24e/0x590
  [308188.417202]  schedule+0x69/0x110
  [308188.417506]  schedule_preempt_disabled+0xe/0x20
  [308188.417789]  __mutex_lock.constprop.0+0x267/0x490
  [308188.418052]  ? rtnl_getlink+0x420/0x420
  [308188.418311]  __mutex_lock_slowpath+0x13/0x20
  [308188.418575]  mutex_lock+0x38/0x50
  [308188.418818]  __netlink_dump_start+0xbf/0x2f0
  [308188.419061]  ? rtnl_getlink+0x420/0x420
  [308188.419321]  rtnetlink_rcv_msg+0x2af/0x400
  [308188.419564]  ? rtnl_getlink+0x420/0x420
  [308188.419807]  ? rtnl_calcit.isra.0+0x130/0x130
  [308188.420051]  netlink_rcv_skb+0x53/0x100
  [308188.420297]  rtnetlink_rcv+0x15/0x20
  [308188.420539]  netlink_unicast+0x220/0x340
  [308188.420781]  netlink_sendmsg+0x24b/0x4c0
  [308188.421023]  __sock_sendmsg+0x66/0x70
  [308188.421262]  __sys_sendto+0x113/0x190
  [308188.421511]  ? __audit_syscall_exit+0x269/0x2d0
  [308188.421754]  ? __audit_syscall_entry+0xde/0x120
  [308188.422024]  __x64_sys_sendto+0x24/0x30
  [308188.422258]  x64_sys_call+0x1bcb/0x1fa0
  [308188.422485]  do_syscall_64+0x56/0xb0
  [308188.422711]  entry_SYSCALL_64_after_hwframe+0x6c/0xd6
  [308188.422946] RIP: 0033:0x48fbae
  [308188.423196] RSP: 002b:000000c0038b1480 EFLAGS: 00000216 ORIG_RAX: 
000000000000002c
  [308188.423454] RAX: ffffffffffffffda RBX: 000000000000008a RCX: 
000000000048fbae
  [308188.423694] RDX: 0000000000000011 RSI: 000000c0074b84e0 RDI: 
000000000000008a
  [308188.423935] RBP: 000000c0038b14c0 R08: 000000c0074b84d4 R09: 
000000000000000c
  [308188.424180] R10: 0000000000000000 R11: 0000000000000216 R12: 
000000c0074b84e0
  [308188.424422] R13: 0000000000000000 R14: 000000c0039a2540 R15: 
00000000000007c6
  [308188.424667]  </TASK>
  
+ The issue currently affects Jammy 5.15 kernel and is not present in the
+ 6.8 kernel or above.
  
  [Test Plan]
  
  Reproducing the issue requires a machine that is able to run a high
  number of applications concurrently that are hammering kernfs with I/O
  requests.
  
+ commit "9caf69614225 kernfs: Introduce separate rwsem to protect inode
+ attributes" contains a C program snippet reproducer.
  
- commit "9caf69614225 kernfs: Introduce separate rwsem to protect inode 
attributes" contains a C program snippet reproducer. 
- 
- Although in this loop the sysfs being accessed is for an infiniband device, 
having a any similar program that reads files from sysfs with multiple 
instances of the program running, will trigger the issue. 
+ Although in this loop the sysfs being accessed is for an infiniband device, 
having a any similar program that reads files from sysfs with multiple 
instances of the program running, will trigger the issue.
  An example could be reading data from 
"/sys/class/net/<interface>/statistics/*" instead.
  
  In other words running multiple instances of a program that does IO
  operations on sysfs would confirm if the issue remains or is resolved.
  To reliably reproduce or confirm the resolution of the issue, dozens if
  not hundreds of instances should be executed.
  
-  
  [ Where problems could occur ]
  
  * Since the fix changes the semaphore being used to access kernfs files from 
one that protects an entire kernfs filesystem to one that protects individual 
inodes, there is a chance that this might cause race conditions on read/write.
  * This might also cause deadlocks similar to the ones this bug is reporting.

** Changed in: linux (Ubuntu Jammy)
     Assignee: (unassigned) => Ghadi Rahme (ghadi-rahme)

** Changed in: linux (Ubuntu Jammy)
   Importance: Undecided => High

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2125142

Title:
  Hung task when heavily accessing kernfs files

Status in linux package in Ubuntu:
  New
Status in linux source package in Jammy:
  New

Bug description:
  [impact]

  When running heavy IO operations on sysfs, the kernel will lock up and many 
hung tasks will be printed in dmesg.
  This issue is due to a bug in the kernfs driver and was fixed upstream 
through 4 commits:

      06fb4736139f kernfs: change kernfs_rename_lock into a read-write lock.
      c9f2dfb7b59e kernfs: Use a per-fs rwsem to protect per-fs list of 
kernfs_super_info.
      9caf69614225 kernfs: Introduce separate rwsem to protect inode attributes.
      393c3714081a kernfs: switch global kernfs_rwsem lock to per-fs lock

  I have backported these commits and also backported two other required
  commits that these commit depend on:

  44a41882575b kernfs: dont take i_lock on inode attr read.
  f44a3ca1e533 kernfs: move struct kernfs_root out of the public view.

  One of the main signs of hitting this bug, is seeing a huge number of
  tasks in the Uninterruptible state when analyzing the core dump:

  crash> ps -S
    RU: 99
    UN: 7725
    IN: 6113
    ID: 444

  Stack traces can be very different based on the application locking up
  but what they all have in common is that they are waiting for a mutex
  to unlock:

  [308188.415026] INFO: task READACTED:15703 blocked for more than 360 seconds.
  [308188.415362]       Not tainted 5.15.0-153-generic #163-Ubuntu
  [308188.415749] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
this message.
  [308188.416054] task:REDACTED     state:D stack:    0 pid:15703 ppid: 15160 
flags:0x00000000
  [308188.416347] Call Trace:
  [308188.416634]  <TASK>
  [308188.416917]  __schedule+0x24e/0x590
  [308188.417202]  schedule+0x69/0x110
  [308188.417506]  schedule_preempt_disabled+0xe/0x20
  [308188.417789]  __mutex_lock.constprop.0+0x267/0x490
  [308188.418052]  ? rtnl_getlink+0x420/0x420
  [308188.418311]  __mutex_lock_slowpath+0x13/0x20
  [308188.418575]  mutex_lock+0x38/0x50
  [308188.418818]  __netlink_dump_start+0xbf/0x2f0
  [308188.419061]  ? rtnl_getlink+0x420/0x420
  [308188.419321]  rtnetlink_rcv_msg+0x2af/0x400
  [308188.419564]  ? rtnl_getlink+0x420/0x420
  [308188.419807]  ? rtnl_calcit.isra.0+0x130/0x130
  [308188.420051]  netlink_rcv_skb+0x53/0x100
  [308188.420297]  rtnetlink_rcv+0x15/0x20
  [308188.420539]  netlink_unicast+0x220/0x340
  [308188.420781]  netlink_sendmsg+0x24b/0x4c0
  [308188.421023]  __sock_sendmsg+0x66/0x70
  [308188.421262]  __sys_sendto+0x113/0x190
  [308188.421511]  ? __audit_syscall_exit+0x269/0x2d0
  [308188.421754]  ? __audit_syscall_entry+0xde/0x120
  [308188.422024]  __x64_sys_sendto+0x24/0x30
  [308188.422258]  x64_sys_call+0x1bcb/0x1fa0
  [308188.422485]  do_syscall_64+0x56/0xb0
  [308188.422711]  entry_SYSCALL_64_after_hwframe+0x6c/0xd6
  [308188.422946] RIP: 0033:0x48fbae
  [308188.423196] RSP: 002b:000000c0038b1480 EFLAGS: 00000216 ORIG_RAX: 
000000000000002c
  [308188.423454] RAX: ffffffffffffffda RBX: 000000000000008a RCX: 
000000000048fbae
  [308188.423694] RDX: 0000000000000011 RSI: 000000c0074b84e0 RDI: 
000000000000008a
  [308188.423935] RBP: 000000c0038b14c0 R08: 000000c0074b84d4 R09: 
000000000000000c
  [308188.424180] R10: 0000000000000000 R11: 0000000000000216 R12: 
000000c0074b84e0
  [308188.424422] R13: 0000000000000000 R14: 000000c0039a2540 R15: 
00000000000007c6
  [308188.424667]  </TASK>

  The issue currently affects Jammy 5.15 kernel and is not present in
  the 6.8 kernel or above.

  [Test Plan]

  Reproducing the issue requires a machine that is able to run a high
  number of applications concurrently that are hammering kernfs with I/O
  requests.

  commit "9caf69614225 kernfs: Introduce separate rwsem to protect inode
  attributes" contains a C program snippet reproducer.

  Although in this loop the sysfs being accessed is for an infiniband device, 
having a any similar program that reads files from sysfs with multiple 
instances of the program running, will trigger the issue.
  An example could be reading data from 
"/sys/class/net/<interface>/statistics/*" instead.

  In other words running multiple instances of a program that does IO
  operations on sysfs would confirm if the issue remains or is resolved.
  To reliably reproduce or confirm the resolution of the issue, dozens
  if not hundreds of instances should be executed.

  [ Where problems could occur ]

  * Since the fix changes the semaphore being used to access kernfs files from 
one that protects an entire kernfs filesystem to one that protects individual 
inodes, there is a chance that this might cause race conditions on read/write.
  * This might also cause deadlocks similar to the ones this bug is reporting.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2125142/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to