[Bug 1873952] [NEW] Call trace during manual controller reset on NVMe/RoCE array

Jennifer Duong Mon, 20 Apr 2020 13:06:04 -0700

Public bug reported:

After manually resetting one of my E-Series NVMe/RoCE controller, I hit
the following call trace:


Apr 20 14:08:24 ICTM1611S01H4 kernel: [  949.958231] workqueue: WQ_MEM_RECLAIM 
nvme-wq:nvme_rdma_reconnect_ctrl_work [nvme_rdma] is flushing !WQ_MEM_RECLAIM 
ib_addr:process_one_req [ib_core]
Apr 20 14:08:24 ICTM1611S01H4 kernel: [  949.958244] WARNING: CPU: 11 PID: 6260 
at kernel/workqueue.c:2610 check_flush_dependency+0x11c/0x140
Apr 20 14:08:24 ICTM1611S01H4 kernel: [  949.958245] Modules linked in: xfs 
nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace fscache rpcrdma 
rdma_ucm ib_iser ib_umad libiscsi ib_ipoib scsi_transport_iscsi intel_rapl_msr 
intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp 
ipmi_ssif kvm_intel kvm intel_cstate intel_rapl_perf joydev input_leds dcdbas 
mei_me mei ipmi_si ipmi_devintf ipmi_msghandler mac_hid acpi_power_meter 
sch_fq_codel nvme_rdma rdma_cm iw_cm ib_cm nvme_fabrics nvme_core sunrpc 
ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov 
async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 
multipath linear mlx5_ib ib_uverbs uas usb_storage ib_core hid_generic usbhid 
hid mgag200 crct10dif_pclmul drm_vram_helper crc32_pclmul i2c_algo_bit ttm 
ghash_clmulni_intel drm_kms_helper ixgbe aesni_intel syscopyarea sysfillrect 
mxm_wmi xfrm_algo sysimgblt crypto_simd mlx5_core fb_sys_fops dca cryptd drm 
glue_helper mdio pci_hyperv_intf ahci lpc_ich tg3
Apr 20 14:08:24 ICTM1611S01H4 kernel: [  949.958305]  tls libahci mlxfw wmi 
scsi_dh_emc scsi_dh_rdac scsi_dh_alua dm_multipath
Apr 20 14:08:24 ICTM1611S01H4 kernel: [  949.958315] CPU: 11 PID: 6260 Comm: 
kworker/u34:3 Not tainted 5.4.0-24-generic #28-Ubuntu
Apr 20 14:08:24 ICTM1611S01H4 kernel: [  949.958316] Hardware name: Dell Inc. 
PowerEdge R730/072T6D, BIOS 2.8.0 005/17/2018
Apr 20 14:08:24 ICTM1611S01H4 kernel: [  949.958321] Workqueue: nvme-wq 
nvme_rdma_reconnect_ctrl_work [nvme_rdma]
Apr 20 14:08:24 ICTM1611S01H4 kernel: [  949.958326] RIP: 
0010:check_flush_dependency+0x11c/0x140
Apr 20 14:08:24 ICTM1611S01H4 kernel: [  949.958329] Code: 8d 8b b0 00 00 00 48 
8b 50 18 4d 89 e0 48 8d b1 b0 00 00 00 48 c7 c7 40 f8 75 9d 4c 89 c9 c6 05 f1 
d9 74 01 01 e8 1f 14 fe ff <0f> 0b e9 07 ff ff ff 80 3d df d9 74 01 00 75 92 e9 
3c ff ff ff 66
Apr 20 14:08:24 ICTM1611S01H4 kernel: [  949.958331] RSP: 0018:ffffb34bc4e87bf0 
EFLAGS: 00010086
Apr 20 14:08:24 ICTM1611S01H4 kernel: [  949.958333] RAX: 0000000000000000 RBX: 
ffff946423812400 RCX: 0000000000000000
Apr 20 14:08:24 ICTM1611S01H4 kernel: [  949.958334] RDX: 0000000000000089 RSI: 
ffffffff9df926a9 RDI: 0000000000000046
Apr 20 14:08:24 ICTM1611S01H4 kernel: [  949.958336] RBP: ffffb34bc4e87c10 R08: 
ffffffff9df92620 R09: 0000000000000089
Apr 20 14:08:24 ICTM1611S01H4 kernel: [  949.958337] R10: ffffffff9df92a00 R11: 
000000009df9268f R12: ffffffffc09be560
Apr 20 14:08:24 ICTM1611S01H4 kernel: [  949.958338] R13: ffff9468238b2f00 R14: 
0000000000000001 R15: ffff94682dbbb700
Apr 20 14:08:24 ICTM1611S01H4 kernel: [  949.958341] FS:  
0000000000000000(0000) GS:ffff94682fd40000(0000) knlGS:0000000000000000
Apr 20 14:08:24 ICTM1611S01H4 kernel: [  949.958342] CS:  0010 DS: 0000 ES: 
0000 CR0: 0000000080050033
Apr 20 14:08:24 ICTM1611S01H4 kernel: [  949.958344] CR2: 00007ff61cbf4ff8 CR3: 
000000040a40a001 CR4: 00000000003606e0
Apr 20 14:08:24 ICTM1611S01H4 kernel: [  949.958345] DR0: 0000000000000000 DR1: 
0000000000000000 DR2: 0000000000000000
Apr 20 14:08:24 ICTM1611S01H4 kernel: [  949.958347] DR3: 0000000000000000 DR6: 
00000000fffe0ff0 DR7: 0000000000000400
Apr 20 14:08:24 ICTM1611S01H4 kernel: [  949.958348] Call Trace:
Apr 20 14:08:24 ICTM1611S01H4 kernel: [  949.958355]  __flush_work+0x97/0x1d0
Apr 20 14:08:24 ICTM1611S01H4 kernel: [  949.958360]  
__cancel_work_timer+0x10e/0x190
Apr 20 14:08:24 ICTM1611S01H4 kernel: [  949.958368]  ? 
dev_printk_emit+0x4e/0x65
Apr 20 14:08:24 ICTM1611S01H4 kernel: [  949.958371]  
cancel_delayed_work_sync+0x13/0x20
Apr 20 14:08:24 ICTM1611S01H4 kernel: [  949.958387]  
rdma_addr_cancel+0x8a/0xb0 [ib_core]
Apr 20 14:08:24 ICTM1611S01H4 kernel: [  949.958393]  
cma_cancel_operation+0x72/0x1e0 [rdma_cm]
Apr 20 14:08:24 ICTM1611S01H4 kernel: [  949.958398]  
rdma_destroy_id+0x56/0x2f0 [rdma_cm]
Apr 20 14:08:24 ICTM1611S01H4 kernel: [  949.958402]  
nvme_rdma_alloc_queue.cold+0x28/0x5b [nvme_rdma]
Apr 20 14:08:24 ICTM1611S01H4 kernel: [  949.958406]  
nvme_rdma_setup_ctrl+0x37/0x720 [nvme_rdma]
Apr 20 14:08:24 ICTM1611S01H4 kernel: [  949.958412]  ? 
try_to_wake_up+0x224/0x6a0
Apr 20 14:08:24 ICTM1611S01H4 kernel: [  949.958416]  
nvme_rdma_reconnect_ctrl_work+0x27/0x40 [nvme_rdma]
Apr 20 14:08:24 ICTM1611S01H4 kernel: [  949.958419]  
process_one_work+0x1eb/0x3b0
Apr 20 14:08:24 ICTM1611S01H4 kernel: [  949.958422]  worker_thread+0x4d/0x400
Apr 20 14:08:24 ICTM1611S01H4 kernel: [  949.958427]  kthread+0x104/0x140
Apr 20 14:08:24 ICTM1611S01H4 kernel: [  949.958430]  ? 
process_one_work+0x3b0/0x3b0
Apr 20 14:08:24 ICTM1611S01H4 kernel: [  949.958432]  ? kthread_park+0x90/0x90
Apr 20 14:08:24 ICTM1611S01H4 kernel: [  949.958439]  ret_from_fork+0x35/0x40
Apr 20 14:08:24 ICTM1611S01H4 kernel: [  949.958442] ---[ end trace 
859f78e32cc2aa61 ]---

This seems to consistently occur on my direct-connect host and not my
fabric-attached hosts. I'm running with Ubuntu 20.04 kernel-5.4.0-24. I
have the following Mellanox cards installed:

MCX416A-CCAT FW 12.27.1016
MCX4121A-ACAT FW 14.27.1016

ProblemType: Bug
DistroRelease: Ubuntu 20.04
Package: nvme-cli 1.9-1
ProcVersionSignature: Ubuntu 5.4.0-24.28-generic 5.4.30
Uname: Linux 5.4.0-24-generic x86_64
ApportVersion: 2.20.11-0ubuntu27
Architecture: amd64
CasperMD5CheckResult: skip
Date: Fri Apr 17 14:28:50 2020
InstallationDate: Installed on 2020-04-15 (2 days ago)
InstallationMedia: Ubuntu-Server 20.04 LTS "Focal Fossa" - Alpha amd64 
(20200124)
ProcEnviron:
 TERM=xterm
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=en_US.UTF-8
 SHELL=/bin/bash
SourcePackage: nvme-cli
UpgradeStatus: No upgrade log present (probably fresh install)
modified.conffile..etc.nvme.hostnqn: ictm1611s01h4-hostnqn
mtime.conffile..etc.nvme.hostnqn: 2020-04-15T13:43:48.076829

** Affects: nvme-cli (Ubuntu)
     Importance: Undecided
         Status: New


** Tags: amd64 apport-bug focal

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1873952

Title:
  Call trace during manual controller reset on NVMe/RoCE array

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/nvme-cli/+bug/1873952/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1873952] [NEW] Call trace during manual controller reset on NVMe/RoCE array

Reply via email to