FYI, the "4.13.0-38.43" set of fixes referenced in the bug you mentioned
has been in Linux-azure since 4.13.0-1013. Looking at your trace, I
think the fsnotify/VFS race condition I referenced in the previous
comment may be more applicable.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to the bug report.
https://bugs.launchpad.net/bugs/1772264

Title:
  watchdog: BUG: soft lockup - CPU#5 stuck for 23s! [java:5783]

Status in linux-azure package in Ubuntu:
  New

Bug description:
  Hello Team,

  I have a Customer who is experiencing this issue once every 2 days and
  here are the details of the bug :

  May 14 05:24:21 localhost kernel: [6006808.160001] watchdog: BUG: soft lockup 
- CPU#5 stuck for 23s! [java:5783]
  May 14 05:24:21 localhost kernel: [6006808.160055] Modules linked in: ufs 
msdos xfs ip6table_filter ip6_tables iptable_filter nf_conntrack_ipv4 
nf_defrag_ipv4 xt_owner xt_conntrack nf_conntrack iptable_security ip_tables 
x_tables udf crc_itu_t i2c_piix4 hv_balloon joydev i2c_core serio_raw ib_iser 
rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi 
scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov 
async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 
multipath linear hid_generic crct10dif_pclmul crc32_pclmul ghash_clmulni_intel 
pcbc aesni_intel aes_x86_64 crypto_simd glue_helper hid_hyperv cryptd hyperv_fb 
pata_acpi hv_utils cfbfillrect cfbimgblt ptp cfbcopyarea hid hyperv_keyboard 
hv_netvsc pps_core
  May 14 05:24:21 localhost kernel: [6006808.160055] CPU: 5 PID: 5783 Comm: 
java Not tainted 4.13.0-1011-azure #14-Ubuntu
  May 14 05:24:21 localhost kernel: [6006808.160055] Hardware name: Microsoft 
Corporation Virtual Machine/Virtual Machine, BIOS 090007  06/02/2017
  May 14 05:24:21 localhost kernel: [6006808.160055] task: ffff8b91a48fc5c0 
task.stack: ffffb5c4cd014000
  May 14 05:24:21 localhost kernel: [6006808.160055] RIP: 
0010:fsnotify+0x1f9/0x4f0
  May 14 05:24:21 localhost kernel: [6006808.160055] RSP: 0018:ffffb5c4cd017e08 
EFLAGS: 00000246 ORIG_RAX: ffffffffffffff0c
  May 14 05:24:21 localhost kernel: [6006808.160055] RAX: 0000000000000001 RBX: 
ffff8ba0f6246020 RCX: 00000000ffffffff
  May 14 05:24:21 localhost kernel: [6006808.160055] RDX: ffff8ba0f6246048 RSI: 
0000000000000000 RDI: ffffffff9bc57020
  May 14 05:24:21 localhost kernel: [6006808.160055] RBP: ffffb5c4cd017ea8 R08: 
0000000000000000 R09: 0000000000000000
  May 14 05:24:21 localhost kernel: [6006808.160055] R10: ffffe93042d21080 R11: 
0000000000000000 R12: 0000000000000000
  May 14 05:24:21 localhost kernel: [6006808.160055] R13: ffff8ba0f6246048 R14: 
0000000000000000 R15: 0000000000000000
  May 14 05:24:21 localhost kernel: [6006808.160055] FS:  
00007f154838a700(0000) GS:ffff8ba0fd740000(0000) knlGS:0000000000000000
  May 14 05:24:21 localhost kernel: [6006808.160055] CS:  0010 DS: 0000 ES: 
0000 CR0: 0000000080050033
  May 14 05:24:21 localhost kernel: [6006808.160055] CR2: 00007f3fcc254000 CR3: 
0000000165c38000 CR4: 00000000001406e0
  May 14 05:24:21 localhost kernel: [6006808.160055] Call Trace:
  May 14 05:24:21 localhost kernel: [6006808.160055]  ? 
new_sync_write+0xe5/0x140
  May 14 05:24:21 localhost kernel: [6006808.160055]  vfs_write+0x15a/0x1b0
  May 14 05:24:21 localhost kernel: [6006808.160055]  ? 
syscall_trace_enter+0xcd/0x2f0
  May 14 05:24:21 localhost kernel: [6006808.160055]  SyS_write+0x55/0xc0
  May 14 05:24:21 localhost kernel: [6006808.160055]  do_syscall_64+0x61/0xd0
  May 14 05:24:21 localhost kernel: [6006808.160055]  
entry_SYSCALL64_slow_path+0x25/0x25
  May 14 05:24:21 localhost kernel: [6006808.160055] RIP: 0033:0x7f489076a2dd. 
#“This indicates Softlockup error message stored in the RIP Register”
  May 14 05:24:21 localhost kernel: [6006808.160055] RSP: 002b:00007f15483872a0 
EFLAGS: 00000293 ORIG_RAX: 0000000000000001
  May 14 05:24:21 localhost kernel: [6006808.160055] RAX: ffffffffffffffda RBX: 
00007f1548389380 RCX: 00007f489076a2dd
  May 14 05:24:21 localhost kernel: [6006808.160055] RDX: 0000000000001740 RSI: 
00007f1548387310 RDI: 000000000000063c
  May 14 05:24:21 localhost kernel: [6006808.160055] RBP: 00007f15483872d0 R08: 
00007f1548387310 R09: 00007f418a55b0b8
  May 14 05:24:21 localhost kernel: [6006808.160055] R10: 00000000005b31ee R11: 
0000000000000293 R12: 0000000000001740
  May 14 05:24:21 localhost kernel: [6006808.160055] R13: 00007f1548387310 R14: 
000000000000063c R15: 00007f40000051e0

  The customer is using Elastic and hence he submitted a issue in
  Elastic Search github post which they are pointing that this is a
  Kernel issue and not a elastic search Issue :

  Attached github post for reference :
  https://github.com/elastic/elasticsearch/issues/30667

  For now, I have asked him to increase the kernel.watchdog_thresh
  parameter from 10 to 20 to relax the situation.

  The customer wants to know for sure whether this is a Kernel bug. I
  also asked him to perform Kernel update. However, if he is confirmed
  that this is a bug in the current Kernel, he is willing to do so in
  all the 65 servers.

  The customer also submitted a bug to the Java process team which seems to be 
causing the issue, 
  There reply was it is a kernel issue and the following launchpad link was 
given although I personally think that is not really the case here. However, I 
may be wrong :

  https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1730717

  This is the Information regarding the Performance of Java process
  within the customer's CPU

  Avg. Load: Avg=3, max=9
  CPU: Avg=29, max=73
  MEM: Avg=18, max=23

  CGROUP:
  ubuntu@prod-elasticsearch-data-008:~$ cat /proc/1399/cgroup
  12:rdma:/
  11:devices:/system.slice/elasticsearch.service
  10:pids:/system.slice/elasticsearch.service
  9:cpuset:/
  8:blkio:/system.slice/elasticsearch.service
  7:memory:/system.slice/elasticsearch.service
  6:perf_event:/
  5:cpu,cpuacct:/system.slice/elasticsearch.service
  4:net_cls,net_prio:/
  3:freezer:/
  2:hugetlb:/
  1:name=systemd:/system.slice/elasticsearch.service

  /etc/lsb-release
  DISTRIB_ID=Ubuntu
  DISTRIB_RELEASE=16.04
  DISTRIB_CODENAME=xenial
  DISTRIB_DESCRIPTION="Ubuntu 16.04.4 LTS"

  Kernel Version : 4.13.0-1011-azure #14-Ubuntu

  Please let me know your thoughts given the above information. Also, if
  extra information required, I will be happy to gather and provide you

  Regards,
  Sriharsha B S,
  Microsoft Azure Linux Team

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-azure/+bug/1772264/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to