Hi,

We have the same issue on our k8s cluster.

Description: ubuntu16.04.1 LTSx86_64
Release: 16.04.1

Kernel: 4.4.0-104-generic

the dump file can be downloaded via following way:
wget http://129.226.115.161/dump.202006231820.tar.gz


I did some analysis, however i still didnot find the root cause:

Load the vmcore in crash (please refer to the hyperlink above). Crash should 
present details similar to the following:
crash> bt
PID: 11388  TASK: ffff880eb1f79e00  CPU: 29  COMMAND: "heartbeat"
 #0 [ffff8809131a7b08] machine_kexec at ffffffff8105c22b
 #1 [ffff8809131a7b68] crash_kexec at ffffffff8110e852
 #2 [ffff8809131a7c38] oops_end at ffffffff81031c49
 #3 [ffff8809131a7c60] die at ffffffff810320fb
 #4 [ffff8809131a7c90] do_trap at ffffffff8102f121
 #5 [ffff8809131a7ce0] do_error_trap at ffffffff8102f4a9
 #6 [ffff8809131a7da0] do_invalid_op at ffffffff8102fa10
 #7 [ffff8809131a7db0] invalid_op at ffffffff8184638e
    [exception RIP: __fput+541]
    RIP: ffffffff812126ad  RSP: ffff8809131a7e68  RFLAGS: 00010246
    RAX: 0000000000000000  RBX: ffff880ef6915700  RCX: 0000000365fb1705
    RDX: 0000000000000001  RSI: ffff880fff55a020  RDI: 0000000000000000
    RBP: ffff8809131a7ea0   R8: 000000000001a020   R9: ffffffff811b591d
    R10: ffffea002b69b300  R11: ffff880ef6915710  R12: 0000000000000010
    R13: ffff880ed152aef8  R14: ffff8800bba18aa0  R15: ffff880ed1513a40
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #8 [ffff8809131a7e60] __fput at ffffffff812125ac
 #9 [ffff8809131a7ea8] ____fput at ffffffff812126ee
#10 [ffff8809131a7eb8] task_work_run at ffffffff8109f101
#11 [ffff8809131a7ef8] exit_to_usermode_loop at ffffffff81003242
#12 [ffff8809131a7f30] syscall_return_slowpath at ffffffff81003c6e
#13 [ffff8809131a7f50] int_ret_from_sys_call at ffffffff818449d0
    RIP: 000000000047f704  RSP: 000000c423b77c98  RFLAGS: 00000246
    RAX: 0000000000000000  RBX: 0000000000000000  RCX: 000000000047f704
    RDX: 0000000000000000  RSI: 0000000000000000  RDI: 00000000000000ca
    RBP: 000000c423b77ce0   R8: 0000000000000000   R9: 0000000000000000
    R10: 0000000000000000  R11: 0000000000000246  R12: 0000000000000000
    R13: 0000000000000000  R14: 000000c423b78ee0  R15: 0000000000000008
    ORIG_RAX: 0000000000000003  CS: 0033  SS: 002b
crash>

crash> log
[19156101.592212] ------------[ cut here ]------------
[19156101.593103] kernel BUG at 
/build/linux-SwhOyu/linux-4.4.0/include/linux/fs.h:2582!
[19156101.594385] invalid opcode: 0000 [#1] SMP
[19156101.595083] Modules linked in: binfmt_misc af_packet_diag netlink_diag 
dccp_diag dccp tcp_diag udp_diag inet_diag unix_diag veth br_netfilter 
ipt_MASQUERADE nf_nat_masquerade_ipv4 xt_set xt_mark ip_set_hash_ipportnet 
ip_set_hash_ipportip ip_set_bitmap_port ip_set_hash_ipport ip_set dummy 
xt_comment xt_addrtype iptable_nat nf_nat_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 
ip_vs_sh ip_vs_wrr ip_vs_rr ip_vs xt_tcpudp bridge stp llc nf_conntrack_netlink 
nfnetlink xfrm_user xfrm_algo iptable_filter ip_tables xt_conntrack x_tables 
nf_nat nf_conntrack aufs isofs ppdev crct10dif_pclmul parport_pc crc32_pclmul 
input_leds joydev ghash_clmulni_intel parport serio_raw ib_iser rdma_cm iw_cm 
ib_cm ib_sa ib_mad ib_core ib_addr sunrpc iscsi_tcp libiscsi_tcp libiscsi 
scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov
[19156101.606434]  async_memcpy async_pq async_xor async_tx xor raid6_pq 
libcrc32c raid1 raid0 multipath linear aesni_intel aes_x86_64 glue_helper lrw 
gf128mul ablk_helper cryptd psmouse floppy
[19156101.609129] CPU: 29 PID: 11388 Comm: heartbeat Not tainted 
4.4.0-104-generic #127-Ubuntu
[19156101.610384] Hardware name: Smdbmds KVM, BIOS 
seabios-1.9.1-qemu-project.org 04/01/2014
[19156101.611637] task: ffff880eb1f79e00 ti: ffff8809131a4000 task.ti: 
ffff8809131a4000
[19156101.612905] RIP: 0010:[<ffffffff812126ad>]  [<ffffffff812126ad>] 
__fput+0x21d/0x220
[19156101.614188] RSP: 0018:ffff8809131a7e68  EFLAGS: 00010246
[19156101.614989] RAX: 0000000000000000 RBX: ffff880ef6915700 RCX: 
0000000365fb1705
[19156101.616143] RDX: 0000000000000001 RSI: ffff880fff55a020 RDI: 
0000000000000000
[19156101.617285] RBP: ffff8809131a7ea0 R08: 000000000001a020 R09: 
ffffffff811b591d
[19156101.618422] R10: ffffea002b69b300 R11: ffff880ef6915710 R12: 
0000000000000010
[19156101.619574] R13: ffff880ed152aef8 R14: ffff8800bba18aa0 R15: 
ffff880ed1513a40
[19156101.620785] FS:  000000c42085bc90(0000) GS:ffff880fff540000(0000) 
knlGS:0000000000000000
[19156101.622074] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[19156101.622921] CR2: 00007f508b166b04 CR3: 0000000e0981b000 CR4: 
00000000003406e0
[19156101.624062] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
0000000000000000
[19156101.625210] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 
0000000000000400
[19156101.626349] Stack:
[19156101.626765]  ffff880ed152aef8 ffff880ef6915710 ffff880eb1f79e00 
ffffffff8210ad50
[19156101.628018]  ffff880ef6915700 0000000000000000 ffff880eb1f7a4a0 
ffff8809131a7eb0
[19156101.629305]  ffffffff812126ee ffff8809131a7ef0 ffffffff8109f101 
ffff880eb1f7a4d4
[19156101.630568] Call Trace:
[19156101.631036]  [<ffffffff812126ee>] ____fput+0xe/0x10
[19156101.631804]  [<ffffffff8109f101>] task_work_run+0x81/0xa0
[19156101.632612]  [<ffffffff81003242>] exit_to_usermode_loop+0xc2/0xd0
[19156101.633497]  [<ffffffff81003c6e>] syscall_return_slowpath+0x4e/0x60
[19156101.634402]  [<ffffffff818449d0>] int_ret_from_sys_call+0x25/0x8f
[19156101.635285] Code: 0f 84 cf fe ff ff 48 8b 43 28 48 8b 80 80 00 00 00 48 
85 c0 0f 84 bb fe ff ff 31 d2 48 89 de bf ff ff ff ff ff d0 e9 aa fe ff ff <0f> 
0b 90 0f 1f 44 00 00 31 ff 48 87 3d 8a 6e fc 00 48 85 ff 74
[19156101.639244] RIP  [<ffffffff812126ad>] __fput+0x21d/0x220
[19156101.640049]  RSP <ffff8809131a7e68>


Referencing the line above in the source code, fs.h:2582!, we see the panic is 
due to a BUG_ON:

static void __fput(struct file *file)
{
    ....   
        if ((file->f_mode & (FMODE_READ | FMODE_WRITE)) == FMODE_READ)
                i_readcount_dec(inode);//
    ....   
}


static inline void i_readcount_dec(struct inode *inode)
{
        BUG_ON(!atomic_read(&inode->i_readcount));
        atomic_dec(&inode->i_readcount);
}


And the corresponding dissasembly for the panic location:

crash> dis -r __fput+541
...
0xffffffff812126a1 <__fput+529>:        mov    $0xffffffff,%edi
0xffffffff812126a6 <__fput+534>:        callq  *%rax
0xffffffff812126a8 <__fput+536>:        jmpq   0xffffffff81212557 <__fput+199>
0xffffffff812126ad <__fput+541>:        ud2
crash>

Jumped to the ud2 that caused the panic. Where did we jump from?

crash> dis  __fput | grep __fput+541
0xffffffff81212638 <__fput+424>:        je     0xffffffff812126ad <__fput+541>
0xffffffff812126ad <__fput+541>:        ud2
crash>


And the assembly before the je:

crash> dis  __fput | grep __fput+541 -B3
0xffffffff8121262e <__fput+414>:        retq
0xffffffff8121262f <__fput+415>:        mov    0x154(%r13),%eax
0xffffffff81212636 <__fput+422>:        test   %eax,%eax
0xffffffff81212638 <__fput+424>:        je     0xffffffff812126ad <__fput+541>


Above r13 is likely the inode, so the 0x154(%r13) is inode.i_readcount:

crash> struct inode.i_readcount -xo
struct inode {
  [0x154] atomic_t i_readcount;
}
crash>



The r13 is  ffff880ed152aef8, so get the value of inode.i_readcount is
111:

crash> bt | grep R13
    R13: ffff880ed152aef8  R14: ffff8800bba18aa0  R15: ffff880ed1513a40


crash> inode.i_readcount.counter ffff880ed152aef8
  i_readcount.counter = 111
crash>


the inode.i_readcount.counter is not equal 0, why call BUG_ON?

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1653498

Title:
  Server reboots every 4.1 weeks

Status in linux package in Ubuntu:
  Confirmed
Status in linux source package in Xenial:
  Confirmed

Bug description:
  Every 4.1 uptime weeks (more or less), our 34 servers reboots with the
  logs below.

  Description:    Ubuntu 16.04.1 LTS
  Release:        16.04

  The servers hosts the stack flanneld (0.5.5) docker (1.11.2, build
  b9f10c9) kubernetes (v1.3.6) plus etcd (2.3.7)


  Jan 02 06:40:32 prd-node021 kernel: ------------[ cut here ]------------
  Jan 02 06:40:32 prd-node021 kernel: kernel BUG at 
/build/linux-xHzv4a/linux-4.4.0/include/linux/fs.h:2569!
  Jan 02 06:40:32 prd-node021 kernel: invalid opcode: 0000 [#1] SMP
  Jan 02 06:40:32 prd-node021 kernel: Modules linked in: nf_conntrack_netlink 
nfnetlink veth xt_statistic xt_nat xt_recent ipt_REJECT nf_reject_ipv4 
xt_tcpudp tcp_diag inet_diag
  Jan 02 06:40:32 prd-node021 kernel:  raid456 async_raid6_recov async_memcpy 
async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear 
crct10dif_pclmul crc
  Jan 02 06:40:32 prd-node021 kernel: CPU: 46 PID: 22749 Comm: iptables-restor 
Not tainted 4.4.0-47-generic #68-Ubuntu
  Jan 02 06:40:32 prd-node021 kernel: Hardware name: HP ProLiant DL360 
Gen9/ProLiant DL360 Gen9, BIOS P89 09/13/2016
  Jan 02 06:40:32 prd-node021 kernel: task: ffff882f7ccb44c0 ti: 
ffff882fcb810000 task.ti: ffff882fcb810000
  Jan 02 06:40:32 prd-node021 kernel: RIP: 0010:[<ffffffff8120f9ed>]  
[<ffffffff8120f9ed>] __fput+0x21d/0x220
  Jan 02 06:40:32 prd-node021 kernel: RSP: 0018:ffff882fcb813e68  EFLAGS: 
00010246
  Jan 02 06:40:32 prd-node021 kernel: RAX: 0000000000000000 RBX: 
ffff8829dbd92700 RCX: 00000000308c6396
  Jan 02 06:40:32 prd-node021 kernel: RDX: 0000000000000001 RSI: 
ffff88301f299f60 RDI: 0000000000000000
  Jan 02 06:40:32 prd-node021 kernel: RBP: ffff882fcb813ea0 R08: 
0000000000019f60 R09: ffffffff811b3a1d
  Jan 02 06:40:32 prd-node021 kernel: R10: ffffea0064756680 R11: 
ffff8829dbd92710 R12: 0000000000000010
  Jan 02 06:40:32 prd-node021 kernel: R13: ffff8817d3520518 R14: 
ffff881021a34da0 R15: ffff8817d3542540
  Jan 02 06:40:32 prd-node021 kernel: FS:  00007fcdaa753700(0000) 
GS:ffff88301f280000(0000) knlGS:0000000000000000
  Jan 02 06:40:32 prd-node021 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 
0000000080050033
  Jan 02 06:40:32 prd-node021 kernel: CR2: 00007fcdaa758000 CR3: 
0000001917993000 CR4: 00000000003406e0
  Jan 02 06:40:32 prd-node021 kernel: DR0: 0000000000000000 DR1: 
0000000000000000 DR2: 0000000000000000
  Jan 02 06:40:32 prd-node021 kernel: DR3: 0000000000000000 DR6: 
00000000fffe0ff0 DR7: 0000000000000400
  Jan 02 06:40:32 prd-node021 kernel: Stack:
  Jan 02 06:40:32 prd-node021 kernel:  ffff8817d3520518 ffff8829dbd92710 
ffff882f7ccb44c0 ffffffff82103a30
  Jan 02 06:40:32 prd-node021 kernel:  ffff8829dbd92700 0000000000000000 
ffff882f7ccb4b38 ffff882fcb813eb0
  Jan 02 06:40:32 prd-node021 kernel:  ffffffff8120fa2e ffff882fcb813ef0 
ffffffff8109ee01 ffff882f7ccb4b6c
  Jan 02 06:40:32 prd-node021 kernel: Call Trace:
  Jan 02 06:40:32 prd-node021 kernel:  [<ffffffff8120fa2e>] ____fput+0xe/0x10
  Jan 02 06:40:32 prd-node021 kernel:  [<ffffffff8109ee01>] 
task_work_run+0x81/0xa0
  Jan 02 06:40:32 prd-node021 kernel:  [<ffffffff81003242>] 
exit_to_usermode_loop+0xc2/0xd0
  Jan 02 06:40:32 prd-node021 kernel:  [<ffffffff81003c6e>] 
syscall_return_slowpath+0x4e/0x60
  Jan 02 06:40:32 prd-node021 kernel:  [<ffffffff81835150>] 
int_ret_from_sys_call+0x25/0x8f
  Jan 02 06:40:32 prd-node021 kernel: Code: 0f 84 cf fe ff ff 48 8b 43 28 48 8b 
80 80 00 00 00 48 85 c0 0f 84 bb fe ff ff 31 d2 48 89 de bf ff ff ff ff ff d0 
e9 aa fe ff ff <0f>
  Jan 02 06:40:32 prd-node021 kernel: RIP  [<ffffffff8120f9ed>] 
__fput+0x21d/0x220
  -- Reboot --
  Jan 02 06:42:56 prd-node021 systemd-journald[819]: Runtime journal 
(/run/log/journal/) is 8.0M, max 1.8G, 1.8G free.
  Jan 02 06:42:56 prd-node021 kernel: Initializing cgroup subsys cpuset

  ProblemType: Bug
  DistroRelease: Ubuntu 16.04
  Package: linux-image-4.4.0-47-generic 4.4.0-47.68
  ProcVersionSignature: Ubuntu 4.4.0-47.68-generic 4.4.24
  Uname: Linux 4.4.0-47-generic x86_64
  AlsaDevices:
   total 0
   crw-rw---- 1 root audio 116,  1 Jan  2 06:42 seq
   crw-rw---- 1 root audio 116, 33 Jan  2 06:42 timer
  AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
  ApportVersion: 2.20.1-0ubuntu2.1
  Architecture: amd64
  ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
  AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', 
'/dev/snd/timer'] failed with exit code 1:
  Date: Mon Jan  2 10:49:15 2017
  HibernationDevice: RESUME=/dev/mapper/vg00-swap
  InstallationDate: Installed on 2016-09-13 (110 days ago)
  InstallationMedia: Ubuntu-Server 16.04.1 LTS "Xenial Xerus" - Beta amd64 
(20160803)
  IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
  MachineType: HP ProLiant DL360 Gen9
  PciMultimedia:
   
  ProcFB: 0 VESA VGA
  ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-4.4.0-47-generic 
root=/dev/mapper/vg00-root ro cgroup_enable=memory swapaccount=1
  RelatedPackageVersions:
   linux-restricted-modules-4.4.0-47-generic N/A
   linux-backports-modules-4.4.0-47-generic  N/A
   linux-firmware                            1.157.5
  RfKill: Error: [Errno 2] No such file or directory: 'rfkill'
  SourcePackage: linux
  UpgradeStatus: No upgrade log present (probably fresh install)
  dmi.bios.date: 09/13/2016
  dmi.bios.vendor: HP
  dmi.bios.version: P89
  dmi.board.name: ProLiant DL360 Gen9
  dmi.board.vendor: HP
  dmi.chassis.type: 23
  dmi.chassis.vendor: HP
  dmi.modalias: 
dmi:bvnHP:bvrP89:bd09/13/2016:svnHP:pnProLiantDL360Gen9:pvr:rvnHP:rnProLiantDL360Gen9:rvr:cvnHP:ct23:cvr:
  dmi.product.name: ProLiant DL360 Gen9
  dmi.sys.vendor: HP

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1653498/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to