Public bug reported: We're running a server at AWS which collects data from machines over CIFS. This involves a a lot of mounting and umounting of CIFS (about 100 targets with 2 shares each with 10 delay in between). The targets might sometimes become unavailable when they turned of for the weekend or rebooted.
The server doing this has to be rebooted every few hours because CIFS connection start to hang and don't recover. The usual symptom is: Jul 24 10:12:59 connector kernel: [ 7765.705409] CIFS: Attempting to mount //172.22.2.112/Meldung Jul 24 10:13:01 connector kernel: [ 7767.689258] CIFS: Attempting to mount //172.22.2.112/Wartung Jul 24 10:13:06 connector kernel: [ 7772.758283] CIFS: Attempting to mount //172.30.113.108/Meldung Jul 24 10:13:06 connector kernel: [ 7773.300475] CIFS: Attempting to mount //172.30.113.108/Wartung Jul 24 10:13:09 connector kernel: [ 7776.364516] CIFS: Attempting to mount //172.30.99.55/Meldung Jul 24 10:13:11 connector kernel: [ 7777.978731] CIFS: Attempting to mount //172.30.99.55/Wartung [...] Jul 24 10:16:13 connector kernel: [ 7960.390529] CIFS VFS: \\172.30.113.108 has not responded in 180 seconds. Reconnecting... Jul 24 10:16:15 connector kernel: [ 7962.468649] CIFS VFS: \\172.30.93.171 has not responded in 180 seconds. Reconnecting... Jul 24 10:16:18 connector kernel: [ 7964.999037] CIFS VFS: \\172.30.99.55 has not responded in 180 seconds. Reconnecting... Jul 24 10:16:31 connector kernel: [ 7977.798821] INFO: task cifsd:26252 blocked for more than 120 seconds. Jul 24 10:16:31 connector kernel: [ 7977.803730] Not tainted 5.4.0-1020-aws #20-Ubuntu Jul 24 10:16:31 connector kernel: [ 7977.808526] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jul 24 10:16:31 connector kernel: [ 7977.820291] cifsd D 0 26252 2 0x80004000 Jul 24 10:16:31 connector kernel: [ 7977.820298] Call Trace: Jul 24 10:16:31 connector kernel: [ 7977.820307] __schedule+0x2e3/0x740 Jul 24 10:16:31 connector kernel: [ 7977.820310] ? __switch_to_asm+0x40/0x70 Jul 24 10:16:31 connector kernel: [ 7977.820313] ? __switch_to_asm+0x34/0x70 Jul 24 10:16:31 connector kernel: [ 7977.820315] schedule+0x42/0xb0 Jul 24 10:16:31 connector kernel: [ 7977.820318] rwsem_down_read_slowpath+0x16c/0x4a0 Jul 24 10:16:31 connector kernel: [ 7977.820321] down_read+0x85/0xa0 Jul 24 10:16:31 connector kernel: [ 7977.820324] iterate_supers_type+0x70/0xf0 Jul 24 10:16:31 connector kernel: [ 7977.820411] ? cifs_set_cifscreds.isra.0+0x800/0x800 [cifs] Jul 24 10:16:31 connector kernel: [ 7977.820429] cifs_reconnect+0x8a/0xdc0 [cifs] Jul 24 10:16:31 connector kernel: [ 7977.820433] ? vprintk_func+0x4c/0xbc Jul 24 10:16:31 connector kernel: [ 7977.820449] cifs_readv_from_socket+0x17a/0x260 [cifs] Jul 24 10:16:31 connector kernel: [ 7977.820465] cifs_read_from_socket+0x4c/0x70 [cifs] Jul 24 10:16:31 connector kernel: [ 7977.820482] ? allocate_buffers+0x43/0x130 [cifs] Jul 24 10:16:31 connector kernel: [ 7977.820497] cifs_demultiplex_thread+0xe1/0xcc0 [cifs] Jul 24 10:16:31 connector kernel: [ 7977.820500] kthread+0x104/0x140 Jul 24 10:16:31 connector kernel: [ 7977.820516] ? cifs_handle_standard+0x1b0/0x1b0 [cifs] Jul 24 10:16:31 connector kernel: [ 7977.820518] ? kthread_park+0x90/0x90 Jul 24 10:16:31 connector kernel: [ 7977.820520] ret_from_fork+0x22/0x40 Jul 24 10:16:31 connector kernel: [ 7977.820524] INFO: task cifsd:26328 blocked for more than 120 seconds. Jul 24 10:16:31 connector kernel: [ 7977.827503] Not tainted 5.4.0-1020-aws #20-Ubuntu That is, cifsd gets stuck fetching credentials for the reconnect. I'm attaching the full syslog with stack traces from all hung cifsd task (I don't see where the deadlock is there). The mounting/unmounting is done in a privileged Docker container. If we restart that, we usually run into an Oops: Jul 25 07:43:29 connector kernel: [64677.164367] Oops: 0000 [#1] SMP NOPTI Jul 25 07:43:29 connector kernel: [64677.164370] CPU: 0 PID: 265452 Comm: cifsd Not tainted 5.4.0-1020-aws #20-Ubuntu Jul 25 07:43:29 connector kernel: [64677.164370] Hardware name: Amazon EC2 t3a.large/, BIOS 1.0 10/16/2017 Jul 25 07:43:29 connector kernel: [64677.164400] RIP: 0010:cifs_reconnect+0x9be/0xdc0 [cifs] Jul 25 07:43:29 connector kernel: [64677.164403] Code: e8 bb 43 0c d5 66 90 48 8b 45 c0 48 8d 55 c0 4c 8d 6d b8 48 39 c2 74 62 49 be 00 01 00 00 00 00 ad de 48 8b 45 c0 4c 8d 78 f 8 <48> 8b 00 48 8d 58 f8 4d 39 ef 74 3d 49 8b 57 10 48 89 50 08 48 89 Jul 25 07:43:29 connector kernel: [64677.218175] RSP: 0018:ffffbf25c0b27cf8 EFLAGS: 00010286 Jul 25 07:43:29 connector kernel: [64677.222539] RAX: 0000000000000000 RBX: ffff9cdef66f0800 RCX: ffffffff95cd8510 Jul 25 07:43:29 connector kernel: [64677.227607] RDX: ffffbf25c0b27d30 RSI: ffffbf25c0b27d18 RDI: ffffffffc0aeec18 Jul 25 07:43:29 connector kernel: [64677.232638] RBP: ffffbf25c0b27d70 R08: 0000000000000180 R09: 0000000000000000 Jul 25 07:43:29 connector kernel: [64677.237666] R10: ffff9cdf32a173c8 R11: 0000000000000000 R12: 00000000fffffffe Jul 25 07:43:29 connector kernel: [64677.242789] R13: ffffbf25c0b27d28 R14: dead000000000100 R15: fffffffffffffff8 Jul 25 07:43:29 connector kernel: [64677.247874] FS: 0000000000000000(0000) GS:ffff9cdf32a00000(0000) knlGS:0000000000000000 Jul 25 07:43:29 connector kernel: [64677.254956] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jul 25 07:43:29 connector kernel: [64677.259348] CR2: 0000000000000000 CR3: 00000001cddce000 CR4: 00000000003406f0 Jul 25 07:43:29 connector kernel: [64677.264439] Call Trace: Jul 25 07:43:29 connector kernel: [64677.267345] ? vprintk_func+0x4c/0xbc Jul 25 07:43:29 connector kernel: [64677.270720] cifs_readv_from_socket+0x17a/0x260 [cifs] Jul 25 07:43:29 connector kernel: [64677.274889] cifs_read_from_socket+0x4c/0x70 [cifs] Jul 25 07:43:29 connector kernel: [64677.278914] ? cifs_add_credits+0x56/0x60 [cifs] Jul 25 07:43:29 connector kernel: [64677.282722] ? allocate_buffers+0x6d/0x130 [cifs] Jul 25 07:43:29 connector kernel: [64677.286453] cifs_demultiplex_thread+0xe1/0xcc0 [cifs] Jul 25 07:43:29 connector kernel: [64677.290566] kthread+0x104/0x140 Jul 25 07:43:29 connector kernel: [64677.293969] ? cifs_handle_standard+0x1b0/0x1b0 [cifs] Jul 25 07:43:29 connector kernel: [64677.298096] ? kthread_park+0x90/0x90 Jul 25 07:43:29 connector kernel: [64677.301535] ret_from_fork+0x22/0x40 Jul 25 07:43:29 connector kernel: [64677.304799] Modules linked in: md4 nls_utf8 cifs libarc4 libdes rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace fscache xt_nat veth vxlan ip 6_udp_tunnel udp_tunnel xt_policy iptable_mangle xt_mark xt_u32 xt_tcpudp xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype iptable_filter iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c bpfilter br_netfilter bridge stp llc aufs overlay dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua ppdev crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd glue_helper ena serio_raw parport_pc parport sch_fq_codel drm i2c_core sunrpc ip_tables x_tables a utofs4 Jul 25 07:43:29 connector kernel: [64677.387761] CR2: 0000000000000000 Jul 25 07:43:29 connector kernel: [64677.391027] ---[ end trace b498d70d7111f607 ]--- The mount options used are: ro,relatime,vers=1.0,cache=strict,username=xxx,domain=xxx,uid=0,noforceuid,gid=0,noforcegid,addr=172.30.2.138,file_mode=0755,dir_mode=0755,soft,nounix,serverino,mapposix,rsize=61440,wsize=65536,bsize=1048576,echo_interval=60,actimeo=1 The attached log files also contain a bit of CIFS debug messages generated with: echo 'module cifs +p' > /sys/kernel/debug/dynamic_debug/control echo 'file fs/cifs/* +p' > /sys/kernel/debug/dynamic_debug/control echo 1 > /proc/fs/cifs/cifsFYI Is there any way of trying a newer kernel? https://github.com/torvalds/linux/commits/master/fs/cifs suggests some of the problems (at least the Oops) might have been fixed. ProblemType: Bug DistroRelease: Ubuntu 20.04 Package: linux-image-5.4.0-1020-aws 5.4.0-1020.20 ProcVersionSignature: User Name 5.4.0-1020.20-aws 5.4.44 Uname: Linux 5.4.0-1020-aws x86_64 ApportVersion: 2.20.11-0ubuntu27.4 Architecture: amd64 CasperMD5CheckResult: skip Date: Sat Jul 25 11:55:47 2020 Ec2AMI: ami-07d14b5d47292e022 Ec2AMIManifest: (unknown) Ec2AvailabilityZone: eu-central-1a Ec2InstanceType: t3a.large Ec2Kernel: unavailable Ec2Ramdisk: unavailable ProcEnviron: TERM=xterm-256color PATH=(custom, no user) XDG_RUNTIME_DIR=<set> LANG=C.UTF-8 SHELL=/usr/bin/zsh SourcePackage: linux-aws UpgradeStatus: No upgrade log present (probably fresh install) ** Affects: linux-aws (Ubuntu) Importance: Undecided Status: New ** Tags: amd64 apport-bug ec2-images focal -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux-aws in Ubuntu. https://bugs.launchpad.net/bugs/1888936 Title: cifsd deadlocks / CIFS related Oopses Status in linux-aws package in Ubuntu: New Bug description: We're running a server at AWS which collects data from machines over CIFS. This involves a a lot of mounting and umounting of CIFS (about 100 targets with 2 shares each with 10 delay in between). The targets might sometimes become unavailable when they turned of for the weekend or rebooted. The server doing this has to be rebooted every few hours because CIFS connection start to hang and don't recover. The usual symptom is: Jul 24 10:12:59 connector kernel: [ 7765.705409] CIFS: Attempting to mount //172.22.2.112/Meldung Jul 24 10:13:01 connector kernel: [ 7767.689258] CIFS: Attempting to mount //172.22.2.112/Wartung Jul 24 10:13:06 connector kernel: [ 7772.758283] CIFS: Attempting to mount //172.30.113.108/Meldung Jul 24 10:13:06 connector kernel: [ 7773.300475] CIFS: Attempting to mount //172.30.113.108/Wartung Jul 24 10:13:09 connector kernel: [ 7776.364516] CIFS: Attempting to mount //172.30.99.55/Meldung Jul 24 10:13:11 connector kernel: [ 7777.978731] CIFS: Attempting to mount //172.30.99.55/Wartung [...] Jul 24 10:16:13 connector kernel: [ 7960.390529] CIFS VFS: \\172.30.113.108 has not responded in 180 seconds. Reconnecting... Jul 24 10:16:15 connector kernel: [ 7962.468649] CIFS VFS: \\172.30.93.171 has not responded in 180 seconds. Reconnecting... Jul 24 10:16:18 connector kernel: [ 7964.999037] CIFS VFS: \\172.30.99.55 has not responded in 180 seconds. Reconnecting... Jul 24 10:16:31 connector kernel: [ 7977.798821] INFO: task cifsd:26252 blocked for more than 120 seconds. Jul 24 10:16:31 connector kernel: [ 7977.803730] Not tainted 5.4.0-1020-aws #20-Ubuntu Jul 24 10:16:31 connector kernel: [ 7977.808526] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jul 24 10:16:31 connector kernel: [ 7977.820291] cifsd D 0 26252 2 0x80004000 Jul 24 10:16:31 connector kernel: [ 7977.820298] Call Trace: Jul 24 10:16:31 connector kernel: [ 7977.820307] __schedule+0x2e3/0x740 Jul 24 10:16:31 connector kernel: [ 7977.820310] ? __switch_to_asm+0x40/0x70 Jul 24 10:16:31 connector kernel: [ 7977.820313] ? __switch_to_asm+0x34/0x70 Jul 24 10:16:31 connector kernel: [ 7977.820315] schedule+0x42/0xb0 Jul 24 10:16:31 connector kernel: [ 7977.820318] rwsem_down_read_slowpath+0x16c/0x4a0 Jul 24 10:16:31 connector kernel: [ 7977.820321] down_read+0x85/0xa0 Jul 24 10:16:31 connector kernel: [ 7977.820324] iterate_supers_type+0x70/0xf0 Jul 24 10:16:31 connector kernel: [ 7977.820411] ? cifs_set_cifscreds.isra.0+0x800/0x800 [cifs] Jul 24 10:16:31 connector kernel: [ 7977.820429] cifs_reconnect+0x8a/0xdc0 [cifs] Jul 24 10:16:31 connector kernel: [ 7977.820433] ? vprintk_func+0x4c/0xbc Jul 24 10:16:31 connector kernel: [ 7977.820449] cifs_readv_from_socket+0x17a/0x260 [cifs] Jul 24 10:16:31 connector kernel: [ 7977.820465] cifs_read_from_socket+0x4c/0x70 [cifs] Jul 24 10:16:31 connector kernel: [ 7977.820482] ? allocate_buffers+0x43/0x130 [cifs] Jul 24 10:16:31 connector kernel: [ 7977.820497] cifs_demultiplex_thread+0xe1/0xcc0 [cifs] Jul 24 10:16:31 connector kernel: [ 7977.820500] kthread+0x104/0x140 Jul 24 10:16:31 connector kernel: [ 7977.820516] ? cifs_handle_standard+0x1b0/0x1b0 [cifs] Jul 24 10:16:31 connector kernel: [ 7977.820518] ? kthread_park+0x90/0x90 Jul 24 10:16:31 connector kernel: [ 7977.820520] ret_from_fork+0x22/0x40 Jul 24 10:16:31 connector kernel: [ 7977.820524] INFO: task cifsd:26328 blocked for more than 120 seconds. Jul 24 10:16:31 connector kernel: [ 7977.827503] Not tainted 5.4.0-1020-aws #20-Ubuntu That is, cifsd gets stuck fetching credentials for the reconnect. I'm attaching the full syslog with stack traces from all hung cifsd task (I don't see where the deadlock is there). The mounting/unmounting is done in a privileged Docker container. If we restart that, we usually run into an Oops: Jul 25 07:43:29 connector kernel: [64677.164367] Oops: 0000 [#1] SMP NOPTI Jul 25 07:43:29 connector kernel: [64677.164370] CPU: 0 PID: 265452 Comm: cifsd Not tainted 5.4.0-1020-aws #20-Ubuntu Jul 25 07:43:29 connector kernel: [64677.164370] Hardware name: Amazon EC2 t3a.large/, BIOS 1.0 10/16/2017 Jul 25 07:43:29 connector kernel: [64677.164400] RIP: 0010:cifs_reconnect+0x9be/0xdc0 [cifs] Jul 25 07:43:29 connector kernel: [64677.164403] Code: e8 bb 43 0c d5 66 90 48 8b 45 c0 48 8d 55 c0 4c 8d 6d b8 48 39 c2 74 62 49 be 00 01 00 00 00 00 ad de 48 8b 45 c0 4c 8d 78 f 8 <48> 8b 00 48 8d 58 f8 4d 39 ef 74 3d 49 8b 57 10 48 89 50 08 48 89 Jul 25 07:43:29 connector kernel: [64677.218175] RSP: 0018:ffffbf25c0b27cf8 EFLAGS: 00010286 Jul 25 07:43:29 connector kernel: [64677.222539] RAX: 0000000000000000 RBX: ffff9cdef66f0800 RCX: ffffffff95cd8510 Jul 25 07:43:29 connector kernel: [64677.227607] RDX: ffffbf25c0b27d30 RSI: ffffbf25c0b27d18 RDI: ffffffffc0aeec18 Jul 25 07:43:29 connector kernel: [64677.232638] RBP: ffffbf25c0b27d70 R08: 0000000000000180 R09: 0000000000000000 Jul 25 07:43:29 connector kernel: [64677.237666] R10: ffff9cdf32a173c8 R11: 0000000000000000 R12: 00000000fffffffe Jul 25 07:43:29 connector kernel: [64677.242789] R13: ffffbf25c0b27d28 R14: dead000000000100 R15: fffffffffffffff8 Jul 25 07:43:29 connector kernel: [64677.247874] FS: 0000000000000000(0000) GS:ffff9cdf32a00000(0000) knlGS:0000000000000000 Jul 25 07:43:29 connector kernel: [64677.254956] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jul 25 07:43:29 connector kernel: [64677.259348] CR2: 0000000000000000 CR3: 00000001cddce000 CR4: 00000000003406f0 Jul 25 07:43:29 connector kernel: [64677.264439] Call Trace: Jul 25 07:43:29 connector kernel: [64677.267345] ? vprintk_func+0x4c/0xbc Jul 25 07:43:29 connector kernel: [64677.270720] cifs_readv_from_socket+0x17a/0x260 [cifs] Jul 25 07:43:29 connector kernel: [64677.274889] cifs_read_from_socket+0x4c/0x70 [cifs] Jul 25 07:43:29 connector kernel: [64677.278914] ? cifs_add_credits+0x56/0x60 [cifs] Jul 25 07:43:29 connector kernel: [64677.282722] ? allocate_buffers+0x6d/0x130 [cifs] Jul 25 07:43:29 connector kernel: [64677.286453] cifs_demultiplex_thread+0xe1/0xcc0 [cifs] Jul 25 07:43:29 connector kernel: [64677.290566] kthread+0x104/0x140 Jul 25 07:43:29 connector kernel: [64677.293969] ? cifs_handle_standard+0x1b0/0x1b0 [cifs] Jul 25 07:43:29 connector kernel: [64677.298096] ? kthread_park+0x90/0x90 Jul 25 07:43:29 connector kernel: [64677.301535] ret_from_fork+0x22/0x40 Jul 25 07:43:29 connector kernel: [64677.304799] Modules linked in: md4 nls_utf8 cifs libarc4 libdes rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace fscache xt_nat veth vxlan ip 6_udp_tunnel udp_tunnel xt_policy iptable_mangle xt_mark xt_u32 xt_tcpudp xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype iptable_filter iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c bpfilter br_netfilter bridge stp llc aufs overlay dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua ppdev crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd glue_helper ena serio_raw parport_pc parport sch_fq_codel drm i2c_core sunrpc ip_tables x_tables a utofs4 Jul 25 07:43:29 connector kernel: [64677.387761] CR2: 0000000000000000 Jul 25 07:43:29 connector kernel: [64677.391027] ---[ end trace b498d70d7111f607 ]--- The mount options used are: ro,relatime,vers=1.0,cache=strict,username=xxx,domain=xxx,uid=0,noforceuid,gid=0,noforcegid,addr=172.30.2.138,file_mode=0755,dir_mode=0755,soft,nounix,serverino,mapposix,rsize=61440,wsize=65536,bsize=1048576,echo_interval=60,actimeo=1 The attached log files also contain a bit of CIFS debug messages generated with: echo 'module cifs +p' > /sys/kernel/debug/dynamic_debug/control echo 'file fs/cifs/* +p' > /sys/kernel/debug/dynamic_debug/control echo 1 > /proc/fs/cifs/cifsFYI Is there any way of trying a newer kernel? https://github.com/torvalds/linux/commits/master/fs/cifs suggests some of the problems (at least the Oops) might have been fixed. ProblemType: Bug DistroRelease: Ubuntu 20.04 Package: linux-image-5.4.0-1020-aws 5.4.0-1020.20 ProcVersionSignature: User Name 5.4.0-1020.20-aws 5.4.44 Uname: Linux 5.4.0-1020-aws x86_64 ApportVersion: 2.20.11-0ubuntu27.4 Architecture: amd64 CasperMD5CheckResult: skip Date: Sat Jul 25 11:55:47 2020 Ec2AMI: ami-07d14b5d47292e022 Ec2AMIManifest: (unknown) Ec2AvailabilityZone: eu-central-1a Ec2InstanceType: t3a.large Ec2Kernel: unavailable Ec2Ramdisk: unavailable ProcEnviron: TERM=xterm-256color PATH=(custom, no user) XDG_RUNTIME_DIR=<set> LANG=C.UTF-8 SHELL=/usr/bin/zsh SourcePackage: linux-aws UpgradeStatus: No upgrade log present (probably fresh install) To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-aws/+bug/1888936/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp