Before trying the upstream kernel, I tried to replicate the issue. After
noticing it was happening every time there are heavy file I/O. I was
able to easily reproduce it at will by running apps that do lot of file
I/O. I was also monitoring free memory every second to understand why
kernel is invoking oom-killer to randomly killing applications. When
oom-killer started to kill random applications, the memory looked like
this.

Every 1.0s: free -h         gorilla: Sat Jan 14 09:52:01 2017
              total        used        free      shared  buff/cache   available
Mem:           5.9G        755M        127M         17M        5.1G        4.6G
Swap:          2.0G          0B        2.0G

As you can see, there are lot of available memory (mostly in cache and I am 
very sure most of it are clean cache) but for some reason, it was not reclaimed 
by kernel (kswapd0?). So I decided to run "echo 3 > /proc/sys/vm/drop_caches" 
frequently to force dropping cache, and sure enough everything worked fine. 
Right now, I haven't seen this problem in the last 2+ days. 
 
root@gorilla:~# cat /var/log/syslog|egrep "NMI watchdog: BUG: soft 
lockup|oom-killer"
root@gorilla:~# uptime
 07:37:29 up 2 days, 19:34,  1 user,  load average: 1.63, 0.77, 0.29

Now that I suspect this may be a possible bug in kswapd0, I did a search
here for similar issues for kswapd0 and found one (see below) but I am
not sure it is the same problem though the symptoms and workaround are
same.

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1518457

At the end of this report (comment #142) says, they have no problem in
4.4.0-45 kernel but Yakkety based 4.8+ kernel has this problem. Assuming
this is the same issue, I can confirm the same as I have never had this
problem before upgrading to Yakkety. I am wondering if the bug made its
way back since this fix. Since I have a workaround, I am going to
continue with it; it is not ideal but seem to hold it. The last note on
the above report says the bug is fixed and any new problem should be
opened as a new bug. Can this report be treated as new bug to address
this problem?

Thanks
 


** Changed in: linux (Ubuntu)
       Status: Incomplete => Confirmed

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1655356

Title:
  NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [kswapd0:50];
  oom-killer; and eventual kernel panic on 16.10 (upgrade from 16.04)

Status in linux package in Ubuntu:
  Confirmed

Bug description:
  I have a Dell (PowerEdge T110/0V52N7, BIOS 1.6.4 03/02/2011) was
  running Ubuntu 16.04 for a while. Ever since I upgraded to 16.10, this
  problem started with errors, OOM and an eventual kernel panic. It can
  run fine for about 3-4 hours or so. I see the following errors on
  syslog (also attached w/ other logs and information I can gather).

  Jan  9 07:36:32 gorilla kernel: [69304.099302] NMI watchdog: BUG: soft lockup 
- CPU#1 stuck for 22s! [kswapd0:50]
  Jan  9 07:37:00 gorilla kernel: [69332.119587] NMI watchdog: BUG: soft lockup 
- CPU#1 stuck for 22s! [kswapd0:50]
  Jan  9 07:37:33 gorilla kernel: [69364.114705] NMI watchdog: BUG: soft lockup 
- CPU#3 stuck for 22s! [kswapd0:50]
  Jan  9 07:38:01 gorilla kernel: [69392.127352] NMI watchdog: BUG: soft lockup 
- CPU#3 stuck for 22s! [kswapd0:50]
  Jan  9 07:38:37 gorilla kernel: [69428.134132] NMI watchdog: BUG: soft lockup 
- CPU#3 stuck for 22s! [kswapd0:50]
  Jan  9 07:39:45 gorilla kernel: [69496.112694] NMI watchdog: BUG: soft lockup 
- CPU#1 stuck for 23s! [kswapd0:50]
  Jan  9 07:40:13 gorilla kernel: [69524.112050] NMI watchdog: BUG: soft lockup 
- CPU#1 stuck for 22s! [kswapd0:50]
  Jan  9 07:40:49 gorilla kernel: [69560.104511] NMI watchdog: BUG: soft lockup 
- CPU#1 stuck for 22s! [kswapd0:50]
  Jan  9 07:41:17 gorilla kernel: [69588.107302] NMI watchdog: BUG: soft lockup 
- CPU#1 stuck for 22s! [kswapd0:50]
  Jan  9 07:41:45 gorilla kernel: [69616.104843] NMI watchdog: BUG: soft lockup 
- CPU#1 stuck for 23s! [kswapd0:50]

  Jan  8 11:52:27 gorilla kernel: [ 2852.818471] rsync invoked oom-killer: 
gfp_mask=0x26000d0(GFP_TEMPORARY|__GFP_NOTRACK), order=0, oom_score_adj=0
  Jan  9 07:38:56 gorilla kernel: [69448.096571] kthreadd invoked oom-killer: 
gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0
  Jan  9 07:39:46 gorilla kernel: [69497.705922] apache2 invoked oom-killer: 
gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0
  Jan  9 07:40:50 gorilla kernel: [69561.956773] sh invoked oom-killer: 
gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0
  Jan  9 07:41:10 gorilla kernel: [69582.329364] rsync invoked oom-killer: 
gfp_mask=0x26000d0(GFP_TEMPORARY|__GFP_NOTRACK), order=0, oom_score_adj=0
  Jan  9 07:42:40 gorilla kernel: [69672.181041] sessionclean invoked 
oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, 
oom_score_adj=0
  Jan  9 07:42:41 gorilla kernel: [69673.298714] apache2 invoked oom-killer: 
gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0
  Jan  9 07:42:59 gorilla kernel: [69691.320169] apache2 invoked oom-killer: 
gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0
  Jan  9 07:43:03 gorilla kernel: [69694.769140] sessionclean invoked 
oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, 
oom_score_adj=0
  Jan  9 07:43:20 gorilla kernel: [69712.255535] kthreadd invoked oom-killer: 
gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0

  
  Jan  8 11:46:11 gorilla kernel: [ 2476.342532] perf: interrupt took too long 
(2512 > 2500), lowering kernel.perf_event_max_sample_rate to 79500
  Jan  8 11:49:04 gorilla kernel: [ 2650.045417] perf: interrupt took too long 
(3147 > 3140), lowering kernel.perf_event_max_sample_rate to 63500
  Jan  8 11:49:56 gorilla kernel: [ 2701.973751] perf: interrupt took too long 
(3982 > 3933), lowering kernel.perf_event_max_sample_rate to 50000
  Jan  8 11:51:47 gorilla kernel: [ 2812.208307] perf: interrupt took too long 
(4980 > 4977), lowering kernel.perf_event_max_sample_rate to 40000
  Jan  8 13:56:06 gorilla kernel: [ 5678.539070] perf: interrupt took too long 
(2513 > 2500), lowering kernel.perf_event_max_sample_rate to 79500
  Jan  8 15:59:49 gorilla kernel: [13101.158417] perf: interrupt took too long 
(3148 > 3141), lowering kernel.perf_event_max_sample_rate to 63500
  Jan  9 02:15:54 gorilla kernel: [50065.939132] perf: interrupt took too long 
(3942 > 3935), lowering kernel.perf_event_max_sample_rate to 50500
  Jan  9 07:35:30 gorilla kernel: [69241.742219] perf: interrupt took too long 
(4932 > 4927), lowering kernel.perf_event_max_sample_rate to 40500
  Jan  9 07:35:54 gorilla kernel: [69265.928531] perf: interrupt took too long 
(6170 > 6165), lowering kernel.perf_event_max_sample_rate to 32250
  Jan  9 07:36:53 gorilla kernel: [69325.386696] perf: interrupt took too long 
(7723 > 7712), lowering kernel.perf_event_max_sample_rate to 25750

  
  Just to make sure if this is not memory related, I ran memtest for 12 passes 
over night and found no errors on memory. Removed the external backup drives to 
isolate the problem. Checked similar issues on lanchpad.net but most of them 
are related to video driver and power supply.

  Appreciate help.

  Thanks
  -Arul

  Attachments:
  ------------

  syslog
  uname.txt
  swaps.txt
  dmesg.txt
  df.txt
  lspci.txt
  lsusb.txt
  meminfo.txt
  cpuinfo.txt
  --- 
  ApportVersion: 2.20.3-0ubuntu8.2
  Architecture: i386
  AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', 
'/dev/snd/timer'] failed with exit code 1:
  DistroRelease: Ubuntu 16.10
  HibernationDevice: RESUME=UUID=f7ae4452-47ac-43eb-992d-8f5fc8e26f93
  InstallationDate: Installed on 2011-08-13 (1977 days ago)
  InstallationMedia: Ubuntu-Server 11.04 "Natty Narwhal" - Release i386 
(20110426)
  IwConfig: Error: [Errno 2] No such file or directory
  MachineType: Dell Inc. PowerEdge T110
  Package: linux (not installed)
  ProcEnviron:
   TERM=xterm-256color
   PATH=(custom, no user)
   XDG_RUNTIME_DIR=<set>
   LANG=en_US.UTF-8
   SHELL=/bin/bash
  ProcFB: 0 VESA VGA
  ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.8.0-32-generic 
root=UUID=3dcb6aa5-7006-4707-9985-1125faa68aca ro quiet
  ProcVersionSignature: Ubuntu 4.8.0-32.34-generic 4.8.11
  PulseList: Error: command ['pacmd', 'list'] failed with exit code 1: No 
PulseAudio daemon running, or not running as session daemon.
  RelatedPackageVersions:
   linux-restricted-modules-4.8.0-32-generic N/A
   linux-backports-modules-4.8.0-32-generic  N/A
   linux-firmware                            1.161.1
  RfKill: Error: [Errno 2] No such file or directory
  Tags:  yakkety
  Uname: Linux 4.8.0-32-generic i686
  UnreportableReason: The report belongs to a package that is not installed.
  UpgradeStatus: Upgraded to yakkety on 2017-01-01 (9 days ago)
  UserGroups: fuse
  _MarkForUpload: False
  dmi.bios.date: 03/02/2011
  dmi.bios.vendor: Dell Inc.
  dmi.bios.version: 1.6.4
  dmi.board.name: 0V52N7
  dmi.board.vendor: Dell Inc.
  dmi.board.version: A02
  dmi.chassis.type: 17
  dmi.chassis.vendor: Dell Inc.
  dmi.modalias: 
dmi:bvnDellInc.:bvr1.6.4:bd03/02/2011:svnDellInc.:pnPowerEdgeT110:pvr:rvnDellInc.:rn0V52N7:rvrA02:cvnDellInc.:ct17:cvr:
  dmi.product.name: PowerEdge T110
  dmi.sys.vendor: Dell Inc.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1655356/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to