Hi, We have the same issue on our k8s cluster.
Description: ubuntu16.04.1 LTSx86_64 Release: 16.04.1 Kernel: 4.4.0-104-generic the dump file can be downloaded via following way: wget http://129.226.115.161/dump.202006231820.tar.gz I did some analysis, however i still didnot find the root cause: Load the vmcore in crash (please refer to the hyperlink above). Crash should present details similar to the following: crash> bt PID: 11388 TASK: ffff880eb1f79e00 CPU: 29 COMMAND: "heartbeat" #0 [ffff8809131a7b08] machine_kexec at ffffffff8105c22b #1 [ffff8809131a7b68] crash_kexec at ffffffff8110e852 #2 [ffff8809131a7c38] oops_end at ffffffff81031c49 #3 [ffff8809131a7c60] die at ffffffff810320fb #4 [ffff8809131a7c90] do_trap at ffffffff8102f121 #5 [ffff8809131a7ce0] do_error_trap at ffffffff8102f4a9 #6 [ffff8809131a7da0] do_invalid_op at ffffffff8102fa10 #7 [ffff8809131a7db0] invalid_op at ffffffff8184638e [exception RIP: __fput+541] RIP: ffffffff812126ad RSP: ffff8809131a7e68 RFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff880ef6915700 RCX: 0000000365fb1705 RDX: 0000000000000001 RSI: ffff880fff55a020 RDI: 0000000000000000 RBP: ffff8809131a7ea0 R8: 000000000001a020 R9: ffffffff811b591d R10: ffffea002b69b300 R11: ffff880ef6915710 R12: 0000000000000010 R13: ffff880ed152aef8 R14: ffff8800bba18aa0 R15: ffff880ed1513a40 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #8 [ffff8809131a7e60] __fput at ffffffff812125ac #9 [ffff8809131a7ea8] ____fput at ffffffff812126ee #10 [ffff8809131a7eb8] task_work_run at ffffffff8109f101 #11 [ffff8809131a7ef8] exit_to_usermode_loop at ffffffff81003242 #12 [ffff8809131a7f30] syscall_return_slowpath at ffffffff81003c6e #13 [ffff8809131a7f50] int_ret_from_sys_call at ffffffff818449d0 RIP: 000000000047f704 RSP: 000000c423b77c98 RFLAGS: 00000246 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000000000047f704 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000000000ca RBP: 000000c423b77ce0 R8: 0000000000000000 R9: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 0000000000000000 R14: 000000c423b78ee0 R15: 0000000000000008 ORIG_RAX: 0000000000000003 CS: 0033 SS: 002b crash> crash> log [19156101.592212] ------------[ cut here ]------------ [19156101.593103] kernel BUG at /build/linux-SwhOyu/linux-4.4.0/include/linux/fs.h:2582! [19156101.594385] invalid opcode: 0000 [#1] SMP [19156101.595083] Modules linked in: binfmt_misc af_packet_diag netlink_diag dccp_diag dccp tcp_diag udp_diag inet_diag unix_diag veth br_netfilter ipt_MASQUERADE nf_nat_masquerade_ipv4 xt_set xt_mark ip_set_hash_ipportnet ip_set_hash_ipportip ip_set_bitmap_port ip_set_hash_ipport ip_set dummy xt_comment xt_addrtype iptable_nat nf_nat_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 ip_vs_sh ip_vs_wrr ip_vs_rr ip_vs xt_tcpudp bridge stp llc nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_filter ip_tables xt_conntrack x_tables nf_nat nf_conntrack aufs isofs ppdev crct10dif_pclmul parport_pc crc32_pclmul input_leds joydev ghash_clmulni_intel parport serio_raw ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core ib_addr sunrpc iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov [19156101.606434] async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear aesni_intel aes_x86_64 glue_helper lrw gf128mul ablk_helper cryptd psmouse floppy [19156101.609129] CPU: 29 PID: 11388 Comm: heartbeat Not tainted 4.4.0-104-generic #127-Ubuntu [19156101.610384] Hardware name: Smdbmds KVM, BIOS seabios-1.9.1-qemu-project.org 04/01/2014 [19156101.611637] task: ffff880eb1f79e00 ti: ffff8809131a4000 task.ti: ffff8809131a4000 [19156101.612905] RIP: 0010:[<ffffffff812126ad>] [<ffffffff812126ad>] __fput+0x21d/0x220 [19156101.614188] RSP: 0018:ffff8809131a7e68 EFLAGS: 00010246 [19156101.614989] RAX: 0000000000000000 RBX: ffff880ef6915700 RCX: 0000000365fb1705 [19156101.616143] RDX: 0000000000000001 RSI: ffff880fff55a020 RDI: 0000000000000000 [19156101.617285] RBP: ffff8809131a7ea0 R08: 000000000001a020 R09: ffffffff811b591d [19156101.618422] R10: ffffea002b69b300 R11: ffff880ef6915710 R12: 0000000000000010 [19156101.619574] R13: ffff880ed152aef8 R14: ffff8800bba18aa0 R15: ffff880ed1513a40 [19156101.620785] FS: 000000c42085bc90(0000) GS:ffff880fff540000(0000) knlGS:0000000000000000 [19156101.622074] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [19156101.622921] CR2: 00007f508b166b04 CR3: 0000000e0981b000 CR4: 00000000003406e0 [19156101.624062] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [19156101.625210] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [19156101.626349] Stack: [19156101.626765] ffff880ed152aef8 ffff880ef6915710 ffff880eb1f79e00 ffffffff8210ad50 [19156101.628018] ffff880ef6915700 0000000000000000 ffff880eb1f7a4a0 ffff8809131a7eb0 [19156101.629305] ffffffff812126ee ffff8809131a7ef0 ffffffff8109f101 ffff880eb1f7a4d4 [19156101.630568] Call Trace: [19156101.631036] [<ffffffff812126ee>] ____fput+0xe/0x10 [19156101.631804] [<ffffffff8109f101>] task_work_run+0x81/0xa0 [19156101.632612] [<ffffffff81003242>] exit_to_usermode_loop+0xc2/0xd0 [19156101.633497] [<ffffffff81003c6e>] syscall_return_slowpath+0x4e/0x60 [19156101.634402] [<ffffffff818449d0>] int_ret_from_sys_call+0x25/0x8f [19156101.635285] Code: 0f 84 cf fe ff ff 48 8b 43 28 48 8b 80 80 00 00 00 48 85 c0 0f 84 bb fe ff ff 31 d2 48 89 de bf ff ff ff ff ff d0 e9 aa fe ff ff <0f> 0b 90 0f 1f 44 00 00 31 ff 48 87 3d 8a 6e fc 00 48 85 ff 74 [19156101.639244] RIP [<ffffffff812126ad>] __fput+0x21d/0x220 [19156101.640049] RSP <ffff8809131a7e68> Referencing the line above in the source code, fs.h:2582!, we see the panic is due to a BUG_ON: static void __fput(struct file *file) { .... if ((file->f_mode & (FMODE_READ | FMODE_WRITE)) == FMODE_READ) i_readcount_dec(inode);// .... } static inline void i_readcount_dec(struct inode *inode) { BUG_ON(!atomic_read(&inode->i_readcount)); atomic_dec(&inode->i_readcount); } And the corresponding dissasembly for the panic location: crash> dis -r __fput+541 ... 0xffffffff812126a1 <__fput+529>: mov $0xffffffff,%edi 0xffffffff812126a6 <__fput+534>: callq *%rax 0xffffffff812126a8 <__fput+536>: jmpq 0xffffffff81212557 <__fput+199> 0xffffffff812126ad <__fput+541>: ud2 crash> Jumped to the ud2 that caused the panic. Where did we jump from? crash> dis __fput | grep __fput+541 0xffffffff81212638 <__fput+424>: je 0xffffffff812126ad <__fput+541> 0xffffffff812126ad <__fput+541>: ud2 crash> And the assembly before the je: crash> dis __fput | grep __fput+541 -B3 0xffffffff8121262e <__fput+414>: retq 0xffffffff8121262f <__fput+415>: mov 0x154(%r13),%eax 0xffffffff81212636 <__fput+422>: test %eax,%eax 0xffffffff81212638 <__fput+424>: je 0xffffffff812126ad <__fput+541> Above r13 is likely the inode, so the 0x154(%r13) is inode.i_readcount: crash> struct inode.i_readcount -xo struct inode { [0x154] atomic_t i_readcount; } crash> The r13 is ffff880ed152aef8, so get the value of inode.i_readcount is 111: crash> bt | grep R13 R13: ffff880ed152aef8 R14: ffff8800bba18aa0 R15: ffff880ed1513a40 crash> inode.i_readcount.counter ffff880ed152aef8 i_readcount.counter = 111 crash> the inode.i_readcount.counter is not equal 0, why call BUG_ON? -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1653498 Title: Server reboots every 4.1 weeks Status in linux package in Ubuntu: Confirmed Status in linux source package in Xenial: Confirmed Bug description: Every 4.1 uptime weeks (more or less), our 34 servers reboots with the logs below. Description: Ubuntu 16.04.1 LTS Release: 16.04 The servers hosts the stack flanneld (0.5.5) docker (1.11.2, build b9f10c9) kubernetes (v1.3.6) plus etcd (2.3.7) Jan 02 06:40:32 prd-node021 kernel: ------------[ cut here ]------------ Jan 02 06:40:32 prd-node021 kernel: kernel BUG at /build/linux-xHzv4a/linux-4.4.0/include/linux/fs.h:2569! Jan 02 06:40:32 prd-node021 kernel: invalid opcode: 0000 [#1] SMP Jan 02 06:40:32 prd-node021 kernel: Modules linked in: nf_conntrack_netlink nfnetlink veth xt_statistic xt_nat xt_recent ipt_REJECT nf_reject_ipv4 xt_tcpudp tcp_diag inet_diag Jan 02 06:40:32 prd-node021 kernel: raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear crct10dif_pclmul crc Jan 02 06:40:32 prd-node021 kernel: CPU: 46 PID: 22749 Comm: iptables-restor Not tainted 4.4.0-47-generic #68-Ubuntu Jan 02 06:40:32 prd-node021 kernel: Hardware name: HP ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 09/13/2016 Jan 02 06:40:32 prd-node021 kernel: task: ffff882f7ccb44c0 ti: ffff882fcb810000 task.ti: ffff882fcb810000 Jan 02 06:40:32 prd-node021 kernel: RIP: 0010:[<ffffffff8120f9ed>] [<ffffffff8120f9ed>] __fput+0x21d/0x220 Jan 02 06:40:32 prd-node021 kernel: RSP: 0018:ffff882fcb813e68 EFLAGS: 00010246 Jan 02 06:40:32 prd-node021 kernel: RAX: 0000000000000000 RBX: ffff8829dbd92700 RCX: 00000000308c6396 Jan 02 06:40:32 prd-node021 kernel: RDX: 0000000000000001 RSI: ffff88301f299f60 RDI: 0000000000000000 Jan 02 06:40:32 prd-node021 kernel: RBP: ffff882fcb813ea0 R08: 0000000000019f60 R09: ffffffff811b3a1d Jan 02 06:40:32 prd-node021 kernel: R10: ffffea0064756680 R11: ffff8829dbd92710 R12: 0000000000000010 Jan 02 06:40:32 prd-node021 kernel: R13: ffff8817d3520518 R14: ffff881021a34da0 R15: ffff8817d3542540 Jan 02 06:40:32 prd-node021 kernel: FS: 00007fcdaa753700(0000) GS:ffff88301f280000(0000) knlGS:0000000000000000 Jan 02 06:40:32 prd-node021 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jan 02 06:40:32 prd-node021 kernel: CR2: 00007fcdaa758000 CR3: 0000001917993000 CR4: 00000000003406e0 Jan 02 06:40:32 prd-node021 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Jan 02 06:40:32 prd-node021 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Jan 02 06:40:32 prd-node021 kernel: Stack: Jan 02 06:40:32 prd-node021 kernel: ffff8817d3520518 ffff8829dbd92710 ffff882f7ccb44c0 ffffffff82103a30 Jan 02 06:40:32 prd-node021 kernel: ffff8829dbd92700 0000000000000000 ffff882f7ccb4b38 ffff882fcb813eb0 Jan 02 06:40:32 prd-node021 kernel: ffffffff8120fa2e ffff882fcb813ef0 ffffffff8109ee01 ffff882f7ccb4b6c Jan 02 06:40:32 prd-node021 kernel: Call Trace: Jan 02 06:40:32 prd-node021 kernel: [<ffffffff8120fa2e>] ____fput+0xe/0x10 Jan 02 06:40:32 prd-node021 kernel: [<ffffffff8109ee01>] task_work_run+0x81/0xa0 Jan 02 06:40:32 prd-node021 kernel: [<ffffffff81003242>] exit_to_usermode_loop+0xc2/0xd0 Jan 02 06:40:32 prd-node021 kernel: [<ffffffff81003c6e>] syscall_return_slowpath+0x4e/0x60 Jan 02 06:40:32 prd-node021 kernel: [<ffffffff81835150>] int_ret_from_sys_call+0x25/0x8f Jan 02 06:40:32 prd-node021 kernel: Code: 0f 84 cf fe ff ff 48 8b 43 28 48 8b 80 80 00 00 00 48 85 c0 0f 84 bb fe ff ff 31 d2 48 89 de bf ff ff ff ff ff d0 e9 aa fe ff ff <0f> Jan 02 06:40:32 prd-node021 kernel: RIP [<ffffffff8120f9ed>] __fput+0x21d/0x220 -- Reboot -- Jan 02 06:42:56 prd-node021 systemd-journald[819]: Runtime journal (/run/log/journal/) is 8.0M, max 1.8G, 1.8G free. Jan 02 06:42:56 prd-node021 kernel: Initializing cgroup subsys cpuset ProblemType: Bug DistroRelease: Ubuntu 16.04 Package: linux-image-4.4.0-47-generic 4.4.0-47.68 ProcVersionSignature: Ubuntu 4.4.0-47.68-generic 4.4.24 Uname: Linux 4.4.0-47-generic x86_64 AlsaDevices: total 0 crw-rw---- 1 root audio 116, 1 Jan 2 06:42 seq crw-rw---- 1 root audio 116, 33 Jan 2 06:42 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay' ApportVersion: 2.20.1-0ubuntu2.1 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord' AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: Date: Mon Jan 2 10:49:15 2017 HibernationDevice: RESUME=/dev/mapper/vg00-swap InstallationDate: Installed on 2016-09-13 (110 days ago) InstallationMedia: Ubuntu-Server 16.04.1 LTS "Xenial Xerus" - Beta amd64 (20160803) IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig' MachineType: HP ProLiant DL360 Gen9 PciMultimedia: ProcFB: 0 VESA VGA ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-4.4.0-47-generic root=/dev/mapper/vg00-root ro cgroup_enable=memory swapaccount=1 RelatedPackageVersions: linux-restricted-modules-4.4.0-47-generic N/A linux-backports-modules-4.4.0-47-generic N/A linux-firmware 1.157.5 RfKill: Error: [Errno 2] No such file or directory: 'rfkill' SourcePackage: linux UpgradeStatus: No upgrade log present (probably fresh install) dmi.bios.date: 09/13/2016 dmi.bios.vendor: HP dmi.bios.version: P89 dmi.board.name: ProLiant DL360 Gen9 dmi.board.vendor: HP dmi.chassis.type: 23 dmi.chassis.vendor: HP dmi.modalias: dmi:bvnHP:bvrP89:bd09/13/2016:svnHP:pnProLiantDL360Gen9:pvr:rvnHP:rnProLiantDL360Gen9:rvr:cvnHP:ct23:cvr: dmi.product.name: ProLiant DL360 Gen9 dmi.sys.vendor: HP To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1653498/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp