------- Comment From dnban...@us.ibm.com 2018-04-26 10:58 EDT------- I took a quick look at the crash stacks mentioned in c191-c193. Since we don't have a debug kernel for "4.15.0-15-generic #16+bug166588" I just looked at the stacks. From that it seems reasonable to draw the conclusion that these appear to be all manifestations of issues we have seen before. I tried to categorize them below. Note that some of these were hit before booting into the actual kernel so it would be a good idea to install a skiroot kernel with the above patches as well (as was indeed decided in the meeting and Klaus mentions in #194).
crash 201804260138 ============== [ 27.682301] NIP [c000000000389760] kmem_cache_alloc+0x2e0/0x340 [ 27.682343] LR [c00000000038974c] kmem_cache_alloc+0x2cc/0x340 [ 27.682386] Call Trace: [ 27.682406] [c000000005fef5c0] [c000000005fef610] 0xc000000005fef610 (unreliable) [ 27.682459] [c000000005fef620] [c0000000002dfacc] mempool_alloc_slab+0x2c/0x40 [ 27.682510] [c000000005fef640] [c0000000002dff18] mempool_alloc+0x88/0x1e0 [ 27.682555] [c000000005fef6d0] [c0000000006724fc] bio_alloc_bioset+0x1ac/0x2e0 [ 27.682607] [c000000005fef740] [c00000000042a904] submit_bh_wbc+0xd4/0x240 [ 27.682650] [c000000005fef790] [c00000000042b9a0] ll_rw_block+0x130/0x1a0 [ 27.682694] [c000000005fef7f0] [c00000000042bae4] __breadahead+0x44/0xb0 [ 27.682739] [c000000005fef820] [c0000000004cb9a8] __ext4_get_inode_loc+0x448/0x5c0 [ 27.682789] [c000000005fef8e0] [c0000000004cffbc] ext4_iget+0x9c/0xc40 [ 27.682832] [c000000005fef9d0] [c0000000004ef234] ext4_lookup+0x1b4/0x2e0 GPR24: e6eef6af4c054c5f c000200e585a3901 26eed6a1145f755e c0000000002dfacc ^^^^^^^^^^^^ GPR28: c000000ff901ee00 0000000001011200 c000200e585a3901 c000000ff901ee00 ^^^^^^^^^^^^ ^^^^^^^^^^^^ appears to be kmem cache corruption. seems like another instantiation of the double free issue (likely). crash 201804252219 ============== [ 84.702368] NIP [c000000000389ed0] kmem_cache_alloc_node+0x2f0/0x350 [ 84.702407] LR [c000000000389ebc] kmem_cache_alloc_node+0x2dc/0x350 [ 84.702446] Call Trace: [ 84.702463] [c000000005e77940] [c000000000389d94] kmem_cache_alloc_node+0x1b4/0x350 (unreliable) [ 84.702520] [c000000005e779b0] [c000000000b2eb6c] __alloc_skb+0x6c/0x220 [ 84.702560] [c000000005e77a10] [c000000000b30a6c] alloc_skb_with_frags+0x7c/0x2e0 [ 84.702608] [c000000005e77aa0] [c000000000b246cc] sock_alloc_send_pskb+0x29c/0x2c0 [ 84.702655] [c000000005e77b50] [c000000000c569e4] unix_stream_sendmsg+0x264/0x5c0 [ 84.702703] [c000000005e77c30] [c000000000b1eb64] sock_sendmsg+0x64/0x90 [ 84.702743] [c000000005e77c60] [c000000000b1ec48] sock_write_iter+0xb8/0x120 [ 84.702791] [c000000005e77d00] [c0000000003cf494] new_sync_write+0x104/0x160 [ 84.702838] [c000000005e77d90] [c0000000003d2bd8] vfs_write+0xd8/0x220 [ 84.702878] [c000000005e77de0] [c0000000003d2ef8] SyS_write+0x68/0x110 [ 84.702919] [c000000005e77e30] [c00000000000b184] system_call+0x58/0x6c GPR24: c000200e585ebc01 26eed6a1145bf0fd c000000000b2eb6c c000000ff901ee00 ^^^^^^^^^^^^ GPR28: ffffffffffffffff 00000000015004c0 c000200e585ebc01 c000000ff901ee00 ^^^^^^^^^^^^ ^^^^^^^^^^^^ appears to be kmem cache corruption. another case of double free (?) crash 201804251933 ============= [ 7083.142916] NIP [c00000000013277c] process_one_work+0x3c/0x5a0 [ 7083.142965] LR [c000000000132d78] worker_thread+0x98/0x630 [ 7083.143004] Call Trace: [ 7083.143026] [c000200bb70b7c90] [c0000000001329f4] process_one_work+0x2b4/0x5a0 (unreliable) [ 7083.143085] [c000200bb70b7d20] [c000000000132d78] worker_thread+0x98/0x630 [ 7083.143134] [c000200bb70b7dc0] [c00000000013b9a8] kthread+0x1a8/0x1b0 [ 7083.143185] [c000200bb70b7e30] [c00000000000b528] ret_from_kernel_thread+0x5c/0xb4 GPR08: c000200e60eb7df0 0000000000000000 0000000000002040 c000200e60ea10a8 ^^^^^^^^^^^^ the worker object issue again. crash 201804251726 ============== [ 48.707329] NIP [c000000000389ed0] kmem_cache_alloc_node+0x2f0/0x350 [ 48.707376] LR [c000000000389ebc] kmem_cache_alloc_node+0x2dc/0x350 [ 48.707422] Call Trace: [ 48.707444] [c000200e46c07890] [c000000000389d94] kmem_cache_alloc_node+0x1b4/0x350 (unreliable) [ 48.707511] [c000200e46c07900] [c000000000b2eb6c] __alloc_skb+0x6c/0x220 [ 48.707561] [c000200e46c07960] [c000000000cf4004] kobject_uevent_env+0x804/0xa40 [ 48.707620] [c000200e46c07a40] [c000000000aa3338] dm_kobject_uevent+0x78/0xd0 [ 48.707676] [c000200e46c07ae0] [c000000000aab930] dev_suspend+0x360/0x390 [ 48.707725] [c000200e46c07b30] [c000000000aac110] ctl_ioctl+0x200/0x5a0 [ 48.707773] [c000200e46c07d20] [c000000000aac4d0] dm_ctl_ioctl+0x20/0x30 [ 48.707822] [c000200e46c07d40] [c0000000003ef9f4] do_vfs_ioctl+0xd4/0xa00 [ 48.707870] [c000200e46c07de0] [c0000000003f03e4] SyS_ioctl+0xc4/0x130 [ 48.707920] [c000200e46c07e30] [c00000000000b184] system_call+0x58/0x6c GPR24: c000200e585e3a01 26eed6a1145b76a7 c000000000b2eb6c c000000ff901ee00 ^^^^^^^^^^^^ GPR28: ffffffffffffffff 00000000014000c0 c000200e585e3a01 c000000ff901ee00 ^^^^^^^^^^^^ ^^^^^^^^^^^^ appears to be a case of kmem cache corruption again. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1762844 Title: ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into xmon after moving to 4.15.0-15.16 kernel Status in The Ubuntu-power-systems project: Incomplete Status in linux package in Ubuntu: Incomplete Status in linux source package in Bionic: Incomplete Bug description: Problem Description: =================== Host crashed & enters into xmon after updating to 4.15.0-15.16 kernel kernel. Steps to re-create: ================== 1. boslcp3 is up with BMC:118 & PNOR: 20180330 levels 2. Installed boslcp3 with latest kernel 4.15.0-13-generic 3. Enabled "-proposed" kernel in /etc/apt/sources.list file 4. Ran sudo apt-get update & apt-get upgrade 5. root@boslcp3:~# ls /boot abi-4.15.0-13-generic retpoline-4.15.0-13-generic abi-4.15.0-15-generic retpoline-4.15.0-15-generic config-4.15.0-13-generic System.map-4.15.0-13-generic config-4.15.0-15-generic System.map-4.15.0-15-generic grub vmlinux initrd.img vmlinux-4.15.0-13-generic initrd.img-4.15.0-13-generic vmlinux-4.15.0-15-generic initrd.img-4.15.0-15-generic vmlinux.old initrd.img.old 6. Rebooted & booted with 4.15.0-15 kernel 7. Enabled xmon by editing file "vi /etc/default/grub" and ran update-grub 8. Rebooted host. 9. Booted with 4.15.0-15 & provided root/password credentials in login prompt 10. Host crashed & enters into XMON state with 'Unable to handle kernel paging request' root@boslcp3:~# [ 66.295233] Unable to handle kernel paging request for data at address 0x8882f6ed90e9151a [ 66.295297] Faulting instruction address: 0xc00000000038a110 cpu 0x50: Vector: 380 (Data Access Out of Range) at [c00000000692f650] pc: c00000000038a110: kmem_cache_alloc_node+0x2f0/0x350 lr: c00000000038a0fc: kmem_cache_alloc_node+0x2dc/0x350 sp: c00000000692f8d0 msr: 9000000000009033 dar: 8882f6ed90e9151a current = 0xc00000000698fd00 paca = 0xc00000000fab7000 softe: 0 irq_happened: 0x01 pid = 1762, comm = systemd-journal Linux version 4.15.0-15-generic (buildd@bos02-ppc64el-002) (gcc version 7.3.0 (Ubuntu 7.3.0-14ubuntu1)) #16-Ubuntu SMP Wed Apr 4 13:57:51 UTC 2018 (Ubuntu 4.15.0-15.16-generic 4.15.15) enter ? for help [c00000000692f8d0] c000000000389fd4 kmem_cache_alloc_node+0x1b4/0x350 (unreliable) [c00000000692f940] c000000000b2ec6c __alloc_skb+0x6c/0x220 [c00000000692f9a0] c000000000b30b6c alloc_skb_with_frags+0x7c/0x2e0 [c00000000692fa30] c000000000b247cc sock_alloc_send_pskb+0x29c/0x2c0 [c00000000692fae0] c000000000c5705c unix_dgram_sendmsg+0x15c/0x8f0 [c00000000692fbc0] c000000000b1ec64 sock_sendmsg+0x64/0x90 [c00000000692fbf0] c000000000b20abc ___sys_sendmsg+0x31c/0x390 [c00000000692fd90] c000000000b221ec __sys_sendmsg+0x5c/0xc0 [c00000000692fe30] c00000000000b184 system_call+0x58/0x6c --- Exception: c00 (System Call) at 000074826f6fa9c4 SP (7ffff5dc5510) is in userspace 50:mon> 50:mon> 10. Attached Host console logs I rebooted the host just to see if it would hit the issue again and this time I didn't even get to the login prompt but it crashed in the same location: 50:mon> r R00 = c000000000389fd4 R16 = c000200e0b20fdc0 R01 = c000200e0b20f8d0 R17 = 0000000000000048 R02 = c0000000016eb400 R18 = 000000000001fe80 R03 = 0000000000000001 R19 = 0000000000000000 R04 = 0048ca1cff37803d R20 = 0000000000000000 R05 = 0000000000000688 R21 = 0000000000000000 R06 = 0000000000000001 R22 = 0000000000000048 R07 = 0000000000000687 R23 = 4882d6e3c8b7ab55 R08 = 48ca1cff37802b68 R24 = c000200e5851df01 R09 = 0000000000000000 R25 = 8882f6ed90e67454 R10 = 0000000000000000 R26 = c000000000b2ec6c R11 = c000000000d10f78 R27 = c000000ff901ee00 R12 = 0000000000002000 R28 = ffffffffffffffff R13 = c00000000fab7000 R29 = 00000000015004c0 R14 = c000200e4c973fc8 R30 = c000200e5851df01 R15 = c000200e4c974238 R31 = c000000ff901ee00 pc = c00000000038a110 kmem_cache_alloc_node+0x2f0/0x350 cfar= c000000000016e1c arch_local_irq_restore+0x1c/0x90 lr = c00000000038a0fc kmem_cache_alloc_node+0x2dc/0x350 msr = 9000000000009033 cr = 28002844 ctr = c00000000061e1b0 xer = 0000000000000000 trap = 380 dar = 8882f6ed90e67454 dsisr = c000200e40bd8400 50:mon> t [c000200e0b20f8d0] c000000000389fd4 kmem_cache_alloc_node+0x1b4/0x350 (unreliable) [c000200e0b20f940] c000000000b2ec6c __alloc_skb+0x6c/0x220 [c000200e0b20f9a0] c000000000b30b6c alloc_skb_with_frags+0x7c/0x2e0 [c000200e0b20fa30] c000000000b247cc sock_alloc_send_pskb+0x29c/0x2c0 [c000200e0b20fae0] c000000000c56ae4 unix_stream_sendmsg+0x264/0x5c0 [c000200e0b20fbc0] c000000000b1ec64 sock_sendmsg+0x64/0x90 [c000200e0b20fbf0] c000000000b20abc ___sys_sendmsg+0x31c/0x390 [c000200e0b20fd90] c000000000b221ec __sys_sendmsg+0x5c/0xc0 [c000200e0b20fe30] c00000000000b184 system_call+0x58/0x6c --- Exception: c01 (System Call) at 00007d16a993a940 SP (7ffffbee2270) is in userspace Mirroring to Canonical to advise them that this might be possible regression. Didn't see any obvious changes in this area in the changelog published at https://launchpad.net/ubuntu/+source/linux/4.15.0-15.16 but it would be good to have Canonical help reviewing the deltas as we try to isolate this further. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu-power-systems/+bug/1762844/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp