You have been subscribed to a public bug:

---Problem Description---
Got a message from Watchdog about self-detected hard LOCKUP
 
---uname output---
Linux power 5.0.0-23-generic #24~18.04.1-Ubuntu SMP Mon Jul 29 16:08:34 UTC 
2019 ppc64le ppc64le ppc64le GNU/Linux
 
---Additional Hardware Info---
Architecture:        ppc64le
Byte Order:          Little Endian
CPU(s):              128
On-line CPU(s) list: 0-127
Thread(s) per core:  4
Core(s) per socket:  16
Socket(s):           2
NUMA node(s):        6
Model:               2.2 (pvr 004e 1202)
Model name:          POWER9, altivec supported
CPU max MHz:         3800.0000
CPU min MHz:         2300.0000
L1d cache:           32K
L1i cache:           32K
L2 cache:            512K
L3 cache:            10240K
NUMA node0 CPU(s):   0-63
NUMA node8 CPU(s):   64-127
NUMA node252 CPU(s):
NUMA node253 CPU(s):
NUMA node254 CPU(s):
NUMA node255 CPU(s):
---
free
              total        used        free      shared  buff/cache   available
Mem:     1071807104     5110016   985192768     6229440    81504320  1056273664
Swap:       2097088           0     2097088
--
lsblk
NAME    MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda       8:0    1 894.3G  0 disk
??sda1    8:1    1     7M  0 part
??sda2    8:2    1 894.3G  0 part /
sdb       8:16   1 894.3G  0 disk
nvme0n1 259:1    0   2.9T  0 disk /nvmdisk1
---
 
Machine Type = AC922, bare metal 
 
---Steps to Reproduce---
 This problem I encountered when running customer workload and I switched SMT 
levels from SMT2 to SMT1 and I got a 
lockup error right away!! this seems to be a different one... postgresql DB 
daemon was running on the system.
 
Stack trace output:
 [756383.688067] watchdog: CPU 53 self-detected hard LOCKUP @ 
_raw_spin_lock+0x54/0xe0
[756383.688068] watchdog: CPU 53 TB:387344180861438, last heartbeat 
TB:387337108856720 (13812ms ago)
[756383.688069] Modules linked in: binfmt_misc veth ipt_MASQUERADE 
nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_nat_ipv4 
xt_addrtype iptable_filter bpfilter xt_conntrack nf_nat nf_conntrack 
nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter bridge stp llc aufs overlay 
vmx_crypto ofpart cmdlinepart powernv_flash ipmi_powernv opal_prd mtd 
ipmi_devintf at24 ibmpowernv ipmi_msghandler uio_pdrv_genirq uio sch_fq_codel 
ib_iser rdma_cm iw_cm ib_cm iscsi_tcp libiscsi_tcp libiscsi 
scsi_transport_iscsi ip_tables x_tables autofs4 btrfs zstd_compress raid10 
raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq 
libcrc32c raid1 raid0 multipath linear mlx5_ib ib_uverbs ib_core ast 
crct10dif_vpmsum i2c_algo_bit crc32c_vpmsum ttm mlx5_core drm_kms_helper 
syscopyarea nvme sysfillrect sysimgblt fb_sys_fops drm nvme_core ahci libahci 
tls mlxfw devlink tg3 drm_panel_orientation_quirks
[756383.688088] CPU: 53 PID: 119744 Comm: postgres Not tainted 5.0.0-23-generic 
#24~18.04.1-Ubuntu
[756383.688088] NIP:  c000000000e0fcc4 LR: c00000000015fd90 CTR: 
c000000000600460
[756383.688089] REGS: c000007fffb3bd70 TRAP: 0900   Not tainted  
(5.0.0-23-generic)
[756383.688089] MSR:  9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 28242824  
XER: 00000000
[756383.688091] CFAR: c000000000e0fcec IRQMASK: 1 
[756383.688092] GPR00: c00000000015fd90 c000206f2cdf7970 c00000000185c700 
c00020732ea49100 
[756383.688093] GPR04: c000206f2cdf7a38 0000000000000000 c000206f2cdf7b00 
0000000000000001 
[756383.688095] GPR08: 0000000000000003 000000008000007d 0000000080000035 
fffffffffffffffd 
[756383.688096] GPR12: 0000000000002000 c000007ffffc5080 00007cde07504dd8 
00000f495eee0d68 
[756383.688097] GPR16: 00007fffc0eb2bd7 00007fffc0eb2aa0 00000f496c289088 
00007fffc0eb2a74 
[756383.688098] GPR20: 0000000000000000 0000000000000001 0000000000000001 
0000000000000000 
[756383.688099] GPR24: 0000000000000000 c000206f2cdf7a38 c000000001349100 
000020732d700000 
[756383.688100] GPR28: c000000001891c70 c000206f36d8b400 c000000001895c78 
c00020732ea49100 
[756383.688102] NIP [c000000000e0fcc4] _raw_spin_lock+0x54/0xe0
[756383.688102] LR [c00000000015fd90] __task_rq_lock+0x80/0x150
[756383.688102] Call Trace:
[756383.688103] [c000206f2cdf7970] [c000206f2cdf79d0] 0xc000206f2cdf79d0 
(unreliable)
[756383.688103] [c000206f2cdf79a0] [c000007fd3847818] 0xc000007fd3847818
[756383.688104] [c000206f2cdf7a10] [c0000000001649c0] try_to_wake_up+0x380/0x710
[756383.688105] [c000206f2cdf7aa0] [c000000000164de0] wake_up_q+0x70/0xd0
[756383.688105] [c000206f2cdf7ae0] [c0000000005fab54] do_semtimedop+0x474/0xcc0
[756383.688106] [c000206f2cdf7d60] [c0000000005fc634] ksys_semtimedop+0xd4/0xf0
[756383.688107] [c000206f2cdf7dc0] [c00000000060047c] sys_ipc+0x14c/0x470
[756383.688107] [c000206f2cdf7e20] [c00000000000b288] system_call+0x5c/0x70
[756383.688108] Instruction dump:
[756383.688108] 40c20010 7d40192d 40c2fff0 7c2004ac 2fa90000 4d9e0020 fbc1fff0 
3fc20004 
[756383.688110] 3bde9578 fbe1fff8 7c7f1b78 f821ffd1 <7c210b78> e93e0000 
75290010 41820014 
[756386.336267] watchdog: CPU 53 became unstuck TB:387345536789288
[756386.336292] CPU: 53 PID: 330 Comm: migration/53 Not tainted 
5.0.0-23-generic #24~18.04.1-Ubuntu
[756386.336294] Call Trace:
[756386.336301] [c000007fed49fb40] [c000000000dea90c] dump_stack+0xb0/0xf4 
(unreliable)
[756386.336307] [c000007fed49fb80] [c0000000000342dc] 
wd_smp_clear_cpu_pending+0x41c/0x430
[756386.336311] [c000007fed49fc30] [c00000000022909c] multi_cpu_stop+0x14c/0x210
[756386.336313] [c000007fed49fc90] [c0000000002294bc] 
cpu_stopper_thread+0xfc/0x1e0
[756386.336317] [c000007fed49fd40] [c000000000157d00] 
smpboot_thread_fn+0x270/0x2c0
[756386.336321] [c000007fed49fdb0] [c000000000151608] kthread+0x1a8/0x1b0
[756386.336324] [c000007fed49fe20] [c00000000000b65c] 
ret_from_kernel_thread+0x5c/0x80
[771875.432658] irq_migrate_all_off_this_cpu: 91 callbacks suppressed
[771875.432660] IRQ 110: no longer affine to CPU1
[771875.432694] IRQ 194: no longer affine to CPU1
[771875.498115] IRQ 192: no longer affine to CPU5
[771875.498124] IRQ 193: no longer affine to CPU5
[771875.498133] IRQ 201: no longer affine to CPU5
[771875.551051] IRQ 153: no longer affine to CPU9
[771875.551073] IRQ 229: no longer affine to CPU9
[771875.551149] IRQ 543: no longer affine to CPU9
[771875.602160] IRQ 199: no longer affine to CPU13
[771875.602170] IRQ 226: no longer affine to CPU13


== srikar.dronamr...@in.ibm.com ==
Also these false positives will probably be fixed by the commit 

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=7ae3f6e130e8dc6188b59e3b4ebc2f16e9c8d053

which reads 
>From 7ae3f6e130e8dc6188b59e3b4ebc2f16e9c8d053 Mon Sep 17 00:00:00 2001
From: Nicholas Piggin <npig...@gmail.com>
Date: Tue, 9 Apr 2019 14:40:05 +1000
Subject: [PATCH] powerpc/watchdog: Use hrtimers for per-CPU heartbeat

Using a jiffies timer creates a dependency on the tick_do_timer_cpu
incrementing jiffies. If that CPU has locked up and jiffies is not
incrementing, the watchdog heartbeat timer for all CPUs stops and
creates false positives and confusing warnings on local CPUs, and
also causes the SMP detector to stop, so the root cause is never
detected.

Fix this by using hrtimer based timers for the watchdog heartbeat,
like the generic kernel hardlockup detector.

** Affects: linux (Ubuntu)
     Importance: Undecided
     Assignee: Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage)
         Status: New


** Tags: architecture-ppc64le bugnameltc-180737 severity-high 
targetmilestone-inin18041
-- 
Watchdog error about hard lockup
https://bugs.launchpad.net/bugs/1842465
You received this bug notification because you are a member of Kernel Packages, 
which is subscribed to linux in Ubuntu.

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to