Public bug reported:

QEMU processes stuck on io_uring lock in Ubuntu 24.04, on kernel
6.8.0-56.

Since 2 weeks, im migrating more hosts towards ubuntu 24.04, coming from
22.04. Since then I notice the occasional VM that gets stuck in proc D
state. dmesg then shows the same Call Trace as pasted below.

On ubuntu 22.04 I was running the hwe package with kernel versions 6.5
and 6.8, although I wasn't running 6.8 as much as I am doing now.

I did find a locking patch in the 6.8.0-56 changelog and was wondering if that 
could be the cause:
+
+               /*
+                * For silly syzbot cases that deliberately overflow by huge
+                * amounts, check if we need to resched and drop and
+                * reacquire the locks if so. Nothing real would ever hit this.
+                * Ideally we'd have a non-posting unlock for this, but hard
+                * to care for a non-real case.
+                */
+               if (need_resched()) {
+                       io_cq_unlock_post(ctx);
+                       mutex_unlock(&ctx->uring_lock);
+                       cond_resched();
+                       mutex_lock(&ctx->uring_lock);
+                       io_cq_lock(ctx);
+               }

/proc/cmdline: BOOT_IMAGE=/boot/vmlinuz-6.8.0-56-generic
root=/dev/mapper/hv9-root ro verbose security=apparmor rootdelay=10
max_loop=16 default_hugepagesz=1G hugepagesz=1G hugepages=448
libata.force=noncq iommu=pt crashkernel=512M-4G:128M,4G-8G:256M,8G-:512M


dmesg snippet:
[Thu Mar 27 18:50:48 2025] INFO: task qemu-system-x86:15480 blocked for more 
than 552 seconds.
[Thu Mar 27 18:50:48 2025]       Tainted: G           OE      6.8.0-56-generic 
#58-Ubuntu
[Thu Mar 27 18:50:48 2025] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[Thu Mar 27 18:50:48 2025] task:qemu-system-x86 state:D stack:0     pid:15480 
tgid:15480 ppid:1      flags:0x00024006
[Thu Mar 27 18:50:48 2025] Call Trace:
[Thu Mar 27 18:50:48 2025]  <TASK>
[Thu Mar 27 18:50:48 2025]  __schedule+0x27c/0x6b0
[Thu Mar 27 18:50:48 2025]  schedule+0x33/0x110
[Thu Mar 27 18:50:48 2025]  schedule_preempt_disabled+0x15/0x30
[Thu Mar 27 18:50:48 2025]  __mutex_lock.constprop.0+0x42f/0x740
[Thu Mar 27 18:50:48 2025]  ? srso_return_thunk+0x5/0x5f
[Thu Mar 27 18:50:48 2025]  __mutex_lock_slowpath+0x13/0x20
[Thu Mar 27 18:50:48 2025]  mutex_lock+0x3c/0x50
[Thu Mar 27 18:50:48 2025]  __do_sys_io_uring_enter+0x2e7/0x4a0
[Thu Mar 27 18:50:48 2025]  __x64_sys_io_uring_enter+0x22/0x40
[Thu Mar 27 18:50:48 2025]  x64_sys_call+0xeda/0x25a0
[Thu Mar 27 18:50:48 2025]  do_syscall_64+0x7f/0x180
[Thu Mar 27 18:50:48 2025]  ? srso_return_thunk+0x5/0x5f
[Thu Mar 27 18:50:48 2025]  ? syscall_exit_to_user_mode+0x86/0x260
[Thu Mar 27 18:50:48 2025]  ? srso_return_thunk+0x5/0x5f
[Thu Mar 27 18:50:48 2025]  ? do_syscall_64+0x8c/0x180
[Thu Mar 27 18:50:48 2025]  ? srso_return_thunk+0x5/0x5f
[Thu Mar 27 18:50:48 2025]  ? syscall_exit_to_user_mode+0x86/0x260
[Thu Mar 27 18:50:48 2025]  ? srso_return_thunk+0x5/0x5f
[Thu Mar 27 18:50:48 2025]  ? do_syscall_64+0x8c/0x180
[Thu Mar 27 18:50:48 2025]  ? srso_return_thunk+0x5/0x5f
[Thu Mar 27 18:50:48 2025]  ? __x64_sys_ioctl+0xbb/0xf0
[Thu Mar 27 18:50:48 2025]  ? srso_return_thunk+0x5/0x5f
[Thu Mar 27 18:50:48 2025]  ? __x64_sys_ioctl+0xbb/0xf0
[Thu Mar 27 18:50:48 2025]  ? srso_return_thunk+0x5/0x5f
[Thu Mar 27 18:50:48 2025]  ? syscall_exit_to_user_mode+0x86/0x260
[Thu Mar 27 18:50:48 2025]  ? srso_return_thunk+0x5/0x5f
[Thu Mar 27 18:50:48 2025]  ? __x64_sys_ioctl+0xbb/0xf0
[Thu Mar 27 18:50:48 2025]  ? srso_return_thunk+0x5/0x5f
[Thu Mar 27 18:50:48 2025]  ? syscall_exit_to_user_mode+0x86/0x260
[Thu Mar 27 18:50:48 2025]  ? srso_return_thunk+0x5/0x5f
[Thu Mar 27 18:50:48 2025]  ? do_syscall_64+0x8c/0x180
[Thu Mar 27 18:50:48 2025]  ? irqentry_exit+0x43/0x50
[Thu Mar 27 18:50:48 2025]  ? srso_return_thunk+0x5/0x5f
[Thu Mar 27 18:50:48 2025]  entry_SYSCALL_64_after_hwframe+0x78/0x80

At this moment I have not tried to reproduce this yet, I can try running
fio on a test host with the same kernel to see if I can consistently
break it.

I also have a crash dump that I made of one of the hosts.

** Affects: linux (Ubuntu)
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2105471

Title:
  io_uring process deadlock

Status in linux package in Ubuntu:
  New

Bug description:
  QEMU processes stuck on io_uring lock in Ubuntu 24.04, on kernel
  6.8.0-56.

  Since 2 weeks, im migrating more hosts towards ubuntu 24.04, coming
  from 22.04. Since then I notice the occasional VM that gets stuck in
  proc D state. dmesg then shows the same Call Trace as pasted below.

  On ubuntu 22.04 I was running the hwe package with kernel versions 6.5
  and 6.8, although I wasn't running 6.8 as much as I am doing now.

  I did find a locking patch in the 6.8.0-56 changelog and was wondering if 
that could be the cause:
  +
  +             /*
  +              * For silly syzbot cases that deliberately overflow by huge
  +              * amounts, check if we need to resched and drop and
  +              * reacquire the locks if so. Nothing real would ever hit this.
  +              * Ideally we'd have a non-posting unlock for this, but hard
  +              * to care for a non-real case.
  +              */
  +             if (need_resched()) {
  +                     io_cq_unlock_post(ctx);
  +                     mutex_unlock(&ctx->uring_lock);
  +                     cond_resched();
  +                     mutex_lock(&ctx->uring_lock);
  +                     io_cq_lock(ctx);
  +             }

  /proc/cmdline: BOOT_IMAGE=/boot/vmlinuz-6.8.0-56-generic
  root=/dev/mapper/hv9-root ro verbose security=apparmor rootdelay=10
  max_loop=16 default_hugepagesz=1G hugepagesz=1G hugepages=448
  libata.force=noncq iommu=pt
  crashkernel=512M-4G:128M,4G-8G:256M,8G-:512M

  
  dmesg snippet:
  [Thu Mar 27 18:50:48 2025] INFO: task qemu-system-x86:15480 blocked for more 
than 552 seconds.
  [Thu Mar 27 18:50:48 2025]       Tainted: G           OE      
6.8.0-56-generic #58-Ubuntu
  [Thu Mar 27 18:50:48 2025] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
  [Thu Mar 27 18:50:48 2025] task:qemu-system-x86 state:D stack:0     pid:15480 
tgid:15480 ppid:1      flags:0x00024006
  [Thu Mar 27 18:50:48 2025] Call Trace:
  [Thu Mar 27 18:50:48 2025]  <TASK>
  [Thu Mar 27 18:50:48 2025]  __schedule+0x27c/0x6b0
  [Thu Mar 27 18:50:48 2025]  schedule+0x33/0x110
  [Thu Mar 27 18:50:48 2025]  schedule_preempt_disabled+0x15/0x30
  [Thu Mar 27 18:50:48 2025]  __mutex_lock.constprop.0+0x42f/0x740
  [Thu Mar 27 18:50:48 2025]  ? srso_return_thunk+0x5/0x5f
  [Thu Mar 27 18:50:48 2025]  __mutex_lock_slowpath+0x13/0x20
  [Thu Mar 27 18:50:48 2025]  mutex_lock+0x3c/0x50
  [Thu Mar 27 18:50:48 2025]  __do_sys_io_uring_enter+0x2e7/0x4a0
  [Thu Mar 27 18:50:48 2025]  __x64_sys_io_uring_enter+0x22/0x40
  [Thu Mar 27 18:50:48 2025]  x64_sys_call+0xeda/0x25a0
  [Thu Mar 27 18:50:48 2025]  do_syscall_64+0x7f/0x180
  [Thu Mar 27 18:50:48 2025]  ? srso_return_thunk+0x5/0x5f
  [Thu Mar 27 18:50:48 2025]  ? syscall_exit_to_user_mode+0x86/0x260
  [Thu Mar 27 18:50:48 2025]  ? srso_return_thunk+0x5/0x5f
  [Thu Mar 27 18:50:48 2025]  ? do_syscall_64+0x8c/0x180
  [Thu Mar 27 18:50:48 2025]  ? srso_return_thunk+0x5/0x5f
  [Thu Mar 27 18:50:48 2025]  ? syscall_exit_to_user_mode+0x86/0x260
  [Thu Mar 27 18:50:48 2025]  ? srso_return_thunk+0x5/0x5f
  [Thu Mar 27 18:50:48 2025]  ? do_syscall_64+0x8c/0x180
  [Thu Mar 27 18:50:48 2025]  ? srso_return_thunk+0x5/0x5f
  [Thu Mar 27 18:50:48 2025]  ? __x64_sys_ioctl+0xbb/0xf0
  [Thu Mar 27 18:50:48 2025]  ? srso_return_thunk+0x5/0x5f
  [Thu Mar 27 18:50:48 2025]  ? __x64_sys_ioctl+0xbb/0xf0
  [Thu Mar 27 18:50:48 2025]  ? srso_return_thunk+0x5/0x5f
  [Thu Mar 27 18:50:48 2025]  ? syscall_exit_to_user_mode+0x86/0x260
  [Thu Mar 27 18:50:48 2025]  ? srso_return_thunk+0x5/0x5f
  [Thu Mar 27 18:50:48 2025]  ? __x64_sys_ioctl+0xbb/0xf0
  [Thu Mar 27 18:50:48 2025]  ? srso_return_thunk+0x5/0x5f
  [Thu Mar 27 18:50:48 2025]  ? syscall_exit_to_user_mode+0x86/0x260
  [Thu Mar 27 18:50:48 2025]  ? srso_return_thunk+0x5/0x5f
  [Thu Mar 27 18:50:48 2025]  ? do_syscall_64+0x8c/0x180
  [Thu Mar 27 18:50:48 2025]  ? irqentry_exit+0x43/0x50
  [Thu Mar 27 18:50:48 2025]  ? srso_return_thunk+0x5/0x5f
  [Thu Mar 27 18:50:48 2025]  entry_SYSCALL_64_after_hwframe+0x78/0x80

  At this moment I have not tried to reproduce this yet, I can try
  running fio on a test host with the same kernel to see if I can
  consistently break it.

  I also have a crash dump that I made of one of the hosts.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2105471/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to