Public bug reported:

We found a GPU device disable/enable test failure, and it is related to
below call trace. When GPU device is disable, this call-trace happens at
the device disable step.

The system does not panic but the driver is not loaded back.

%echo 1 > /sys/bus/pci/devices/c09d:00:00.0/remove

Note: after this command, PCI bus is not removed but only ‘remove’ file
is disappeared with below call trace. All other PCI devices are removed
successfully.


<Call trace>
[   56.649648] hv_balloon: Max. dynamic memory size: 57344 MB
[  457.438303] NVRM: Attempting to remove minor device 0 with non-zero usage 
count!
[  457.438305] ------------[ cut here ]------------
[  457.438465] WARNING: CPU: 4 PID: 5026 at 
/var/lib/dkms/nvidia/430.50/build/nvidia/nv.c:4068 nvidia_remove+0x39d/0x3b0 
[nvidia]
[  457.438466] Modules linked in: xt_owner xt_conntrack nf_conntrack 
nf_defrag_ipv6 nf_defrag_ipv4 iptable_security bpfilter nvidia_uvm(OE) 
nvidia_drm(POE) nvidia_modeset(POE) nvidia(POE) drm_kms_helper syscopyarea 
sysfillrect sysimgblt fb_sys_fops nls_iso8859_1 drm 
drm_panel_orientation_quirks ipmi_devintf ipmi_msghandler i2c_core pci_hyperv 
hv_balloon serio_raw sch_fq_codel joydev ib_iser rdma_cm iw_cm ib_cm ib_core 
iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi parport_pc ppdev lp 
parport ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 
async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq 
libcrc32c raid1 raid0 multipath linear crct10dif_pclmul crc32_pclmul 
ghash_clmulni_intel hid_generic aesni_intel aes_x86_64 hyperv_fb crypto_simd 
cryptd glue_helper hid_hyperv cfbfillrect cfbimgblt hyperv_keyboard cfbcopyarea 
pata_acpi hid hv_netvsc hv_utils
[  457.438493] CPU: 4 PID: 5026 Comm: bash Tainted: P           OE     
5.0.0-1025-azure #27~18.04.1-Ubuntu
[  457.438494] Hardware name: Microsoft Corporation Virtual Machine/Virtual 
Machine, BIOS 090007  06/02/2017
[  457.438564] RIP: 0010:nvidia_remove+0x39d/0x3b0 [nvidia]
[  457.438565] Code: ff e8 17 c5 9a f3 41 8b 95 68 04 00 00 48 c7 c6 f8 97 8e 
c1 bf 04 00 00 00 e8 cf 9c 00 00 48 c7 c7 b0 82 8e c1 e8 b6 8b a1 f3 <0f> 0b e8 
cc a2 00 00 eb f9 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44
[  457.438566] RSP: 0018:ffffb1578bcfbcf8 EFLAGS: 00010282
[  457.438567] RAX: 0000000000000024 RBX: ffff8ec43bdf0000 RCX: 0000000000000006
[  457.438568] RDX: 0000000000000000 RSI: 0000000000000086 RDI: ffff8ec445d15580
[  457.438568] RBP: ffffb1578bcfbd40 R08: 0000000000000001 R09: 000000000000023c
[  457.438569] R10: ffffb1578bcfba38 R11: 0000000000000000 R12: ffff8ec43d3b2000
[  457.438569] R13: ffff8ec4388b3000 R14: ffffffffc19411b0 R15: 0000000000000060
[  457.438570] FS:  00007f92d7263740(0000) GS:ffff8ec445d00000(0000) 
knlGS:0000000000000000
[  457.438573] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  457.438573] CR2: 0000560d3d973f60 CR3: 0000000e4aeca004 CR4: 00000000001606e0
[  457.438574] Call Trace:
[  457.438579]  pci_device_remove+0x3e/0xc0
[  457.438582]  device_release_driver_internal+0x18d/0x260
[  457.438583]  device_release_driver+0x12/0x20
[  457.438585]  pci_stop_bus_device+0x68/0x90
[  457.438586]  pci_stop_and_remove_bus_device_locked+0x1a/0x30
[  457.438588]  remove_store+0x7c/0x90
[  457.438590]  dev_attr_store+0x1b/0x30
[  457.438592]  sysfs_kf_write+0x3c/0x50
[  457.438593]  kernfs_fop_write+0x125/0x1a0
[  457.438596]  __vfs_write+0x1b/0x40
[  457.438598]  vfs_write+0xb1/0x1a0
[  457.438599]  ksys_write+0x5c/0xe0
[  457.438601]  __x64_sys_write+0x1a/0x20
[  457.438603]  do_syscall_64+0x64/0x1b0
[  457.438607]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  457.438608] RIP: 0033:0x7f92d6947154
[  457.438609] Code: 89 02 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 
00 66 90 48 8d 05 b1 07 2e 00 8b 00 85 c0 75 13 b8 01 00 00 00 0f 05 <48> 3d 00 
f0 ff ff 77 54 f3 c3 66 90 41 54 55 49 89 d4 53 48 89 f5
[  457.438610] RSP: 002b:00007ffe5a69f208 EFLAGS: 00000246 ORIG_RAX: 
0000000000000001
[  457.438611] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f92d6947154
[  457.438612] RDX: 0000000000000002 RSI: 0000560d3d7bd8c0 RDI: 0000000000000001
[  457.438612] RBP: 0000560d3d7bd8c0 R08: 000000000000000a R09: 0000000000000001
[  457.438613] R10: 000000000000000a R11: 0000000000000246 R12: 00007f92d6c23760
[  457.438613] R13: 0000000000000002 R14: 00007f92d6c1f2a0 R15: 00007f92d6c1e760
[  457.438615] ---[ end trace 64ddc7a9a2dd8bd8 ]---

Kernel: 5.0.0-1025-azure

This issue happens with 18.04 and not 16.04.

** Affects: linux-azure (Ubuntu)
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1853014

Title:
  GPU device disable/enable test failure

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-azure/+bug/1853014/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to