Public bug reported: After manually resetting one of my E-Series NVMe/RoCE controller, I hit the following call trace:
Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958231] workqueue: WQ_MEM_RECLAIM nvme-wq:nvme_rdma_reconnect_ctrl_work [nvme_rdma] is flushing !WQ_MEM_RECLAIM ib_addr:process_one_req [ib_core] Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958244] WARNING: CPU: 11 PID: 6260 at kernel/workqueue.c:2610 check_flush_dependency+0x11c/0x140 Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958245] Modules linked in: xfs nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace fscache rpcrdma rdma_ucm ib_iser ib_umad libiscsi ib_ipoib scsi_transport_iscsi intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp ipmi_ssif kvm_intel kvm intel_cstate intel_rapl_perf joydev input_leds dcdbas mei_me mei ipmi_si ipmi_devintf ipmi_msghandler mac_hid acpi_power_meter sch_fq_codel nvme_rdma rdma_cm iw_cm ib_cm nvme_fabrics nvme_core sunrpc ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear mlx5_ib ib_uverbs uas usb_storage ib_core hid_generic usbhid hid mgag200 crct10dif_pclmul drm_vram_helper crc32_pclmul i2c_algo_bit ttm ghash_clmulni_intel drm_kms_helper ixgbe aesni_intel syscopyarea sysfillrect mxm_wmi xfrm_algo sysimgblt crypto_simd mlx5_core fb_sys_fops dca cryptd drm glue_helper mdio pci_hyperv_intf ahci lpc_ich tg3 Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958305] tls libahci mlxfw wmi scsi_dh_emc scsi_dh_rdac scsi_dh_alua dm_multipath Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958315] CPU: 11 PID: 6260 Comm: kworker/u34:3 Not tainted 5.4.0-24-generic #28-Ubuntu Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958316] Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.8.0 005/17/2018 Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958321] Workqueue: nvme-wq nvme_rdma_reconnect_ctrl_work [nvme_rdma] Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958326] RIP: 0010:check_flush_dependency+0x11c/0x140 Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958329] Code: 8d 8b b0 00 00 00 48 8b 50 18 4d 89 e0 48 8d b1 b0 00 00 00 48 c7 c7 40 f8 75 9d 4c 89 c9 c6 05 f1 d9 74 01 01 e8 1f 14 fe ff <0f> 0b e9 07 ff ff ff 80 3d df d9 74 01 00 75 92 e9 3c ff ff ff 66 Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958331] RSP: 0018:ffffb34bc4e87bf0 EFLAGS: 00010086 Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958333] RAX: 0000000000000000 RBX: ffff946423812400 RCX: 0000000000000000 Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958334] RDX: 0000000000000089 RSI: ffffffff9df926a9 RDI: 0000000000000046 Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958336] RBP: ffffb34bc4e87c10 R08: ffffffff9df92620 R09: 0000000000000089 Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958337] R10: ffffffff9df92a00 R11: 000000009df9268f R12: ffffffffc09be560 Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958338] R13: ffff9468238b2f00 R14: 0000000000000001 R15: ffff94682dbbb700 Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958341] FS: 0000000000000000(0000) GS:ffff94682fd40000(0000) knlGS:0000000000000000 Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958342] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958344] CR2: 00007ff61cbf4ff8 CR3: 000000040a40a001 CR4: 00000000003606e0 Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958345] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958347] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958348] Call Trace: Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958355] __flush_work+0x97/0x1d0 Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958360] __cancel_work_timer+0x10e/0x190 Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958368] ? dev_printk_emit+0x4e/0x65 Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958371] cancel_delayed_work_sync+0x13/0x20 Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958387] rdma_addr_cancel+0x8a/0xb0 [ib_core] Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958393] cma_cancel_operation+0x72/0x1e0 [rdma_cm] Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958398] rdma_destroy_id+0x56/0x2f0 [rdma_cm] Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958402] nvme_rdma_alloc_queue.cold+0x28/0x5b [nvme_rdma] Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958406] nvme_rdma_setup_ctrl+0x37/0x720 [nvme_rdma] Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958412] ? try_to_wake_up+0x224/0x6a0 Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958416] nvme_rdma_reconnect_ctrl_work+0x27/0x40 [nvme_rdma] Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958419] process_one_work+0x1eb/0x3b0 Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958422] worker_thread+0x4d/0x400 Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958427] kthread+0x104/0x140 Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958430] ? process_one_work+0x3b0/0x3b0 Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958432] ? kthread_park+0x90/0x90 Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958439] ret_from_fork+0x35/0x40 Apr 20 14:08:24 ICTM1611S01H4 kernel: [ 949.958442] ---[ end trace 859f78e32cc2aa61 ]--- This seems to consistently occur on my direct-connect host and not my fabric-attached hosts. I'm running with Ubuntu 20.04 kernel-5.4.0-24. I have the following Mellanox cards installed: MCX416A-CCAT FW 12.27.1016 MCX4121A-ACAT FW 14.27.1016 ProblemType: Bug DistroRelease: Ubuntu 20.04 Package: nvme-cli 1.9-1 ProcVersionSignature: Ubuntu 5.4.0-24.28-generic 5.4.30 Uname: Linux 5.4.0-24-generic x86_64 ApportVersion: 2.20.11-0ubuntu27 Architecture: amd64 CasperMD5CheckResult: skip Date: Fri Apr 17 14:28:50 2020 InstallationDate: Installed on 2020-04-15 (2 days ago) InstallationMedia: Ubuntu-Server 20.04 LTS "Focal Fossa" - Alpha amd64 (20200124) ProcEnviron: TERM=xterm PATH=(custom, no user) XDG_RUNTIME_DIR=<set> LANG=en_US.UTF-8 SHELL=/bin/bash SourcePackage: nvme-cli UpgradeStatus: No upgrade log present (probably fresh install) modified.conffile..etc.nvme.hostnqn: ictm1611s01h4-hostnqn mtime.conffile..etc.nvme.hostnqn: 2020-04-15T13:43:48.076829 ** Affects: nvme-cli (Ubuntu) Importance: Undecided Status: New ** Tags: amd64 apport-bug focal -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1873952 Title: Call trace during manual controller reset on NVMe/RoCE array To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/nvme-cli/+bug/1873952/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs