** Changed in: ubuntu-power-systems Assignee: Canonical Kernel Team (canonical-kernel-team) => Frank Heimes (frank-heimes)
-- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1842465 Title: Watchdog error about hard lockup Status in The Ubuntu-power-systems project: Confirmed Status in linux package in Ubuntu: Confirmed Bug description: ---Problem Description--- Got a message from Watchdog about self-detected hard LOCKUP ---uname output--- Linux power 5.0.0-23-generic #24~18.04.1-Ubuntu SMP Mon Jul 29 16:08:34 UTC 2019 ppc64le ppc64le ppc64le GNU/Linux ---Additional Hardware Info--- Architecture: ppc64le Byte Order: Little Endian CPU(s): 128 On-line CPU(s) list: 0-127 Thread(s) per core: 4 Core(s) per socket: 16 Socket(s): 2 NUMA node(s): 6 Model: 2.2 (pvr 004e 1202) Model name: POWER9, altivec supported CPU max MHz: 3800.0000 CPU min MHz: 2300.0000 L1d cache: 32K L1i cache: 32K L2 cache: 512K L3 cache: 10240K NUMA node0 CPU(s): 0-63 NUMA node8 CPU(s): 64-127 NUMA node252 CPU(s): NUMA node253 CPU(s): NUMA node254 CPU(s): NUMA node255 CPU(s): --- free total used free shared buff/cache available Mem: 1071807104 5110016 985192768 6229440 81504320 1056273664 Swap: 2097088 0 2097088 -- lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 1 894.3G 0 disk ??sda1 8:1 1 7M 0 part ??sda2 8:2 1 894.3G 0 part / sdb 8:16 1 894.3G 0 disk nvme0n1 259:1 0 2.9T 0 disk /nvmdisk1 --- Machine Type = AC922, bare metal ---Steps to Reproduce--- This problem I encountered when running customer workload and I switched SMT levels from SMT2 to SMT1 and I got a lockup error right away!! this seems to be a different one... postgresql DB daemon was running on the system. Stack trace output: [756383.688067] watchdog: CPU 53 self-detected hard LOCKUP @ _raw_spin_lock+0x54/0xe0 [756383.688068] watchdog: CPU 53 TB:387344180861438, last heartbeat TB:387337108856720 (13812ms ago) [756383.688069] Modules linked in: binfmt_misc veth ipt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_nat_ipv4 xt_addrtype iptable_filter bpfilter xt_conntrack nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter bridge stp llc aufs overlay vmx_crypto ofpart cmdlinepart powernv_flash ipmi_powernv opal_prd mtd ipmi_devintf at24 ibmpowernv ipmi_msghandler uio_pdrv_genirq uio sch_fq_codel ib_iser rdma_cm iw_cm ib_cm iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear mlx5_ib ib_uverbs ib_core ast crct10dif_vpmsum i2c_algo_bit crc32c_vpmsum ttm mlx5_core drm_kms_helper syscopyarea nvme sysfillrect sysimgblt fb_sys_fops drm nvme_core ahci libahci tls mlxfw devlink tg3 drm_panel_orientation_quirks [756383.688088] CPU: 53 PID: 119744 Comm: postgres Not tainted 5.0.0-23-generic #24~18.04.1-Ubuntu [756383.688088] NIP: c000000000e0fcc4 LR: c00000000015fd90 CTR: c000000000600460 [756383.688089] REGS: c000007fffb3bd70 TRAP: 0900 Not tainted (5.0.0-23-generic) [756383.688089] MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 28242824 XER: 00000000 [756383.688091] CFAR: c000000000e0fcec IRQMASK: 1 [756383.688092] GPR00: c00000000015fd90 c000206f2cdf7970 c00000000185c700 c00020732ea49100 [756383.688093] GPR04: c000206f2cdf7a38 0000000000000000 c000206f2cdf7b00 0000000000000001 [756383.688095] GPR08: 0000000000000003 000000008000007d 0000000080000035 fffffffffffffffd [756383.688096] GPR12: 0000000000002000 c000007ffffc5080 00007cde07504dd8 00000f495eee0d68 [756383.688097] GPR16: 00007fffc0eb2bd7 00007fffc0eb2aa0 00000f496c289088 00007fffc0eb2a74 [756383.688098] GPR20: 0000000000000000 0000000000000001 0000000000000001 0000000000000000 [756383.688099] GPR24: 0000000000000000 c000206f2cdf7a38 c000000001349100 000020732d700000 [756383.688100] GPR28: c000000001891c70 c000206f36d8b400 c000000001895c78 c00020732ea49100 [756383.688102] NIP [c000000000e0fcc4] _raw_spin_lock+0x54/0xe0 [756383.688102] LR [c00000000015fd90] __task_rq_lock+0x80/0x150 [756383.688102] Call Trace: [756383.688103] [c000206f2cdf7970] [c000206f2cdf79d0] 0xc000206f2cdf79d0 (unreliable) [756383.688103] [c000206f2cdf79a0] [c000007fd3847818] 0xc000007fd3847818 [756383.688104] [c000206f2cdf7a10] [c0000000001649c0] try_to_wake_up+0x380/0x710 [756383.688105] [c000206f2cdf7aa0] [c000000000164de0] wake_up_q+0x70/0xd0 [756383.688105] [c000206f2cdf7ae0] [c0000000005fab54] do_semtimedop+0x474/0xcc0 [756383.688106] [c000206f2cdf7d60] [c0000000005fc634] ksys_semtimedop+0xd4/0xf0 [756383.688107] [c000206f2cdf7dc0] [c00000000060047c] sys_ipc+0x14c/0x470 [756383.688107] [c000206f2cdf7e20] [c00000000000b288] system_call+0x5c/0x70 [756383.688108] Instruction dump: [756383.688108] 40c20010 7d40192d 40c2fff0 7c2004ac 2fa90000 4d9e0020 fbc1fff0 3fc20004 [756383.688110] 3bde9578 fbe1fff8 7c7f1b78 f821ffd1 <7c210b78> e93e0000 75290010 41820014 [756386.336267] watchdog: CPU 53 became unstuck TB:387345536789288 [756386.336292] CPU: 53 PID: 330 Comm: migration/53 Not tainted 5.0.0-23-generic #24~18.04.1-Ubuntu [756386.336294] Call Trace: [756386.336301] [c000007fed49fb40] [c000000000dea90c] dump_stack+0xb0/0xf4 (unreliable) [756386.336307] [c000007fed49fb80] [c0000000000342dc] wd_smp_clear_cpu_pending+0x41c/0x430 [756386.336311] [c000007fed49fc30] [c00000000022909c] multi_cpu_stop+0x14c/0x210 [756386.336313] [c000007fed49fc90] [c0000000002294bc] cpu_stopper_thread+0xfc/0x1e0 [756386.336317] [c000007fed49fd40] [c000000000157d00] smpboot_thread_fn+0x270/0x2c0 [756386.336321] [c000007fed49fdb0] [c000000000151608] kthread+0x1a8/0x1b0 [756386.336324] [c000007fed49fe20] [c00000000000b65c] ret_from_kernel_thread+0x5c/0x80 [771875.432658] irq_migrate_all_off_this_cpu: 91 callbacks suppressed [771875.432660] IRQ 110: no longer affine to CPU1 [771875.432694] IRQ 194: no longer affine to CPU1 [771875.498115] IRQ 192: no longer affine to CPU5 [771875.498124] IRQ 193: no longer affine to CPU5 [771875.498133] IRQ 201: no longer affine to CPU5 [771875.551051] IRQ 153: no longer affine to CPU9 [771875.551073] IRQ 229: no longer affine to CPU9 [771875.551149] IRQ 543: no longer affine to CPU9 [771875.602160] IRQ 199: no longer affine to CPU13 [771875.602170] IRQ 226: no longer affine to CPU13 == srikar.dronamr...@in.ibm.com == Also these false positives will probably be fixed by the commit https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=7ae3f6e130e8dc6188b59e3b4ebc2f16e9c8d053 which reads From 7ae3f6e130e8dc6188b59e3b4ebc2f16e9c8d053 Mon Sep 17 00:00:00 2001 From: Nicholas Piggin <npig...@gmail.com> Date: Tue, 9 Apr 2019 14:40:05 +1000 Subject: [PATCH] powerpc/watchdog: Use hrtimers for per-CPU heartbeat Using a jiffies timer creates a dependency on the tick_do_timer_cpu incrementing jiffies. If that CPU has locked up and jiffies is not incrementing, the watchdog heartbeat timer for all CPUs stops and creates false positives and confusing warnings on local CPUs, and also causes the SMP detector to stop, so the root cause is never detected. Fix this by using hrtimer based timers for the watchdog heartbeat, like the generic kernel hardlockup detector. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu-power-systems/+bug/1842465/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp