> From: Winiarski, Michal <[email protected]> > Sent: Thursday, November 6, 2025 6:56 PM > > On Thu, Nov 06, 2025 at 09:20:36AM +0100, Tian, Kevin wrote: > > > > this deferred_reset logic is a mlx unique thing. See: > > > > https://lore.kernel.org/kvm/[email protected]/ > > Interesting, that doesn't match my observations. > > [] ====================================================== > [] WARNING: possible circular locking dependency detected > [] 6.18.0-rc3-xe+ #90 Tainted: G S U > [] ------------------------------------------------------ > [] qemu-system-x86/4375 is trying to acquire lock: > [] ff1100015af3ec30 (&migf->lock){+.+.}-{3:3}, at: > xe_vfio_pci_set_device_state+0x22b/0x440 [xe_vfio_pci] > [] > but task is already holding lock: > [] ff1100014c541a58 (&xe_vdev->state_mutex){+.+.}-{3:3}, at: > xe_vfio_pci_set_device_state+0x43/0x440 [xe_vfio_pci] > [] > which lock already depends on the new lock. > > [] > the existing dependency chain (in reverse order) is: > [] > -> #3 (&xe_vdev->state_mutex){+.+.}-{3:3}: > [] __mutex_lock+0xba/0x1110 > [] mutex_lock_nested+0x1b/0x30 > [] xe_vfio_pci_reset_done+0x2b/0xc0 [xe_vfio_pci] > [] pci_try_reset_function+0xd7/0x150 > [] vfio_pci_core_ioctl+0x7f1/0xf20 [vfio_pci_core] > [] vfio_device_fops_unl_ioctl+0x163/0x310 [vfio] > [] __se_sys_ioctl+0x71/0xc0 > [] __x64_sys_ioctl+0x1d/0x30 > [] x64_sys_call+0x6ac/0xe50 > [] do_syscall_64+0xa1/0x560 > [] entry_SYSCALL_64_after_hwframe+0x4b/0x53 > [] > -> #2 (&vdev->memory_lock){++++}-{3:3}: > [] down_read+0x41/0x180 > [] vfio_pci_mmap_huge_fault+0xec/0x310 [vfio_pci_core] > [] handle_mm_fault+0x8aa/0x13b0 > [] fixup_user_fault+0x124/0x280 > [] vaddr_get_pfns+0x1d2/0x420 [vfio_iommu_type1] > [] vfio_pin_pages_remote+0x173/0x610 [vfio_iommu_type1] > [] vfio_pin_map_dma+0xfd/0x340 [vfio_iommu_type1] > [] vfio_iommu_type1_ioctl+0xfdf/0x1290 [vfio_iommu_type1] > [] vfio_fops_unl_ioctl+0x106/0x340 [vfio] > [] __se_sys_ioctl+0x71/0xc0 > [] __x64_sys_ioctl+0x1d/0x30 > [] x64_sys_call+0x6ac/0xe50 > [] do_syscall_64+0xa1/0x560 > [] entry_SYSCALL_64_after_hwframe+0x4b/0x53 > [] > -> #1 (&mm->mmap_lock){++++}-{3:3}: > [] __might_fault+0x56/0x90 > [] _copy_to_user+0x23/0x70 > [] xe_sriov_migration_data_read+0x17b/0x3f0 [xe] > [] xe_sriov_vfio_data_read+0x44/0x60 [xe] > [] xe_vfio_pci_save_read+0x55/0x80 [xe_vfio_pci] > [] vfs_read+0xc0/0x300 > [] ksys_read+0x79/0xf0 > [] __x64_sys_read+0x1b/0x30 > [] x64_sys_call+0xcc4/0xe50 > [] do_syscall_64+0xa1/0x560 > [] entry_SYSCALL_64_after_hwframe+0x4b/0x53 > [] > -> #0 (&migf->lock){+.+.}-{3:3}: > [] __lock_acquire+0x1aff/0x3450 > [] lock_acquire+0xde/0x280 > [] __mutex_lock+0xba/0x1110 > [] mutex_lock_nested+0x1b/0x30 > [] xe_vfio_pci_set_device_state+0x22b/0x440 [xe_vfio_pci] > [] vfio_ioctl_device_feature_mig_device_state+0x9c/0x1b0 [vfio] > [] vfio_device_fops_unl_ioctl+0x289/0x310 [vfio] > [] __se_sys_ioctl+0x71/0xc0 > [] __x64_sys_ioctl+0x1d/0x30 > [] x64_sys_call+0x6ac/0xe50 > [] do_syscall_64+0xa1/0x560 > [] entry_SYSCALL_64_after_hwframe+0x4b/0x53 > [] > other info that might help us debug this: > > [] Chain exists of: > &migf->lock --> &vdev->memory_lock --> &xe_vdev->state_mutex > > [] Possible unsafe locking scenario: > > [] CPU0 CPU1 > [] ---- ---- > [] lock(&xe_vdev->state_mutex); > [] lock(&vdev->memory_lock); > [] lock(&xe_vdev->state_mutex); > [] lock(&migf->lock); > [] > *** DEADLOCK *** > > [] 1 lock held by qemu-system-x86/4375: > [] #0: ff1100014c541a58 (&xe_vdev->state_mutex){+.+.}-{3:3}, at: > xe_vfio_pci_set_device_state+0x43/0x440 [xe_vfio_pci] > [] > stack backtrace: > [] CPU: 18 UID: 0 PID: 4375 Comm: qemu-system-x86 Tainted: G S U > 6.18.0-rc3-xe+ #90 PREEMPT(voluntary) > [] Tainted: [S]=CPU_OUT_OF_SPEC, [U]=USER > [] Hardware name: Intel Corporation WHITLEY/WHITLEY, BIOS > SE5C6200.86B.0027.P18.2206090856 06/09/2022 > [] Call Trace: > [] <TASK> > [] __dump_stack+0x19/0x30 > [] dump_stack_lvl+0x66/0x90 > [] dump_stack+0x10/0x14 > [] print_circular_bug+0x2fd/0x310 > [] check_noncircular+0x139/0x160 > [] __lock_acquire+0x1aff/0x3450 > [] ? vprintk_emit+0x3ec/0x560 > [] ? dev_vprintk_emit+0x162/0x1c0 > [] lock_acquire+0xde/0x280 > [] ? xe_vfio_pci_set_device_state+0x22b/0x440 [xe_vfio_pci] > [] ? xe_vfio_pci_set_device_state+0x22b/0x440 [xe_vfio_pci] > [] __mutex_lock+0xba/0x1110 > [] ? xe_vfio_pci_set_device_state+0x22b/0x440 [xe_vfio_pci] > [] mutex_lock_nested+0x1b/0x30 > [] xe_vfio_pci_set_device_state+0x22b/0x440 [xe_vfio_pci] > [] vfio_ioctl_device_feature_mig_device_state+0x9c/0x1b0 [vfio] > [] vfio_device_fops_unl_ioctl+0x289/0x310 [vfio] > [] __se_sys_ioctl+0x71/0xc0 > [] ? entry_SYSCALL_64_after_hwframe+0x4b/0x53 > [] __x64_sys_ioctl+0x1d/0x30 > [] x64_sys_call+0x6ac/0xe50 > [] do_syscall_64+0xa1/0x560 > [] ? __lock_acquire+0x73f/0x3450 > [] ? __lock_acquire+0x73f/0x3450 > [] ? __lock_acquire+0x73f/0x3450 > [] ? lock_release+0x10b/0x340 > [] ? wp_page_reuse+0x82/0x100 > [] ? lock_release+0x10b/0x340 > [] ? wp_page_reuse+0xcc/0x100 > [] ? lock_acquire+0xde/0x280 > [] ? count_memcg_event_mm+0x20/0x170 > [] ? lock_is_held_type+0x8f/0x140 > [] ? lock_release+0x10b/0x340 > [] ? count_memcg_event_mm+0x20/0x170 > [] ? count_memcg_event_mm+0x20/0x170 > [] ? count_memcg_event_mm+0x20/0x170 > [] ? count_memcg_event_mm+0x114/0x170 > [] ? handle_mm_fault+0x1300/0x13b0 > [] ? handle_mm_fault+0x3c/0x13b0 > [] ? lock_vma_under_rcu+0x8c/0x230 > [] ? lock_release+0x10b/0x340 > [] ? exc_page_fault+0x77/0xf0 > [] ? irqentry_exit_to_user_mode+0x100/0x130 > [] ? irqentry_exit+0x31/0x80 > [] entry_SYSCALL_64_after_hwframe+0x4b/0x53 > [] RIP: 0033:0x70dff032eb1d > [] Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10 c7 45 b0 10 00 00 00 > 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff > ff > 77 1a 48 8b 45 c8 64 48 2b 04 25 28 00 00 00 > [] RSP: 002b:00007ffcc0367ff0 EFLAGS: 00000246 ORIG_RAX: > 0000000000000010 > [] RAX: ffffffffffffffda RBX: 00005748e046d600 RCX: 000070dff032eb1d > [] RDX: 00007ffcc0368080 RSI: 0000000000003b75 RDI: 000000000000001d > [] RBP: 00007ffcc0368040 R08: 00000005748df663 R09: 0000000000000007 > [] R10: 00005748df663060 R11: 0000000000000246 R12: 0000000000000001 > [] R13: 0000000000000000 R14: 00005748e055f0b0 R15: 00007ffcc0368080 > [] </TASK> > > In short: > > 0: set_device_state > xe_vdev->state_mutex : migf->lock > 1: data_read > migf->lock : mm->mmap_lock > 2: vfio_pin_dma > mm->mmap_lock : vdev->memory_lock > 3: vfio_pci_ioctl_reset > vdev->memory_lock : xe_vdev->state_mutex
oh that's a good spot! Previous deadlock requires 3 parties, due to copy_from/to_user() under state_mutex: 0: set_device_state vdev->state_mutex : mm->mmap_lock 2: vfio_pin_dma mm->mmap_lock : vdev->memory_lock 3: vfio_pci_ioctl_reset vdev->memory_lock : vdev->state_mutex Now with migf->lock and the additional path of data_read it becomes a 4-parties game. and looks it's common. > > In other words: > set_device_state takes xe_vdev->state_mutex and blocks on migf->lock, > data_read takes migf->lock and blocks on mm->mmap_lock > vfio_pin_dma takes mm->mmap_lock and blocks on vdev->memory_lock > reset takes vdev->memory_lock and blocks on xe_vdev->state_mutex > > copy_to_user/copy_from_user doesn't have to be called under state_mutex, > it just needs to be taken under migf->lock. > The deferred reset trick exists because migf->lock needs to be taken > under state_mutex as part of reset_done callback, which completes the > chain and triggers the lockdep splat. this chain doesn't even reach migf->lock in the reset path. It's triggered already, when acquiring state_mutex. > > To me, it looks like something generic, that will have impact on any > device specific driver variant. > What am I missing? > > I wonder if drivers that don't implement the deferred reset trick were > ever executed with lockdep enabled. > @Jason, @Yishai, @Shameer, @Giovanni, @Brett: Sounds it's a right thing to pull back the deferred reset trick into every driver. anything overlooked?
