[Kernel-packages] [Bug 1762844] Comment bridged from LTC Bugzilla

bugproxy Fri, 11 May 2018 09:53:17 -0700

------- Comment From dougm...@us.ibm.com 2018-05-02 14:39 EDT-------
The SAN incident in the previous dmesg log shows only a single port (WWPN) 
glitching. The logs from panics showed two ports glitching at the same time. 
Also, this incident did not show the port logging back in for about 8 minutes, 
whereas the panics showed immediate/concurrent login. So, I'm not certain if 
we've proven the fix yet.


------- Comment From kla...@br.ibm.com 2018-05-02 16:32 EDT-------
I think next steps here are:

1) apply all the known firmware workarounds (GH 1158)
2) Bring up system with Doug's recommendations  for log verbosity (comment 211 
and 215). Also capture the console output to a separate file if possible.
3) re-start the test using this same kernel, but with no stress on the host: 
proceed to restart the 3 guests with stress, and have a 4th guest migrating 
between boslcp3 and 4.

------- Comment From dougm...@us.ibm.com 2018-05-02 16:36 EDT-------
(In reply to comment #218)
> I think next steps here are:
>
> 1) apply all the known firmware workarounds (GH 1158)
> 2) Bring up system with Doug's recommendations  for log verbosity (comment
> 211 and 215). Also capture the console output to a separate file if possible.
> 3) re-start the test using this same kernel, but with no stress on the host:
> proceed to restart the 3 guests with stress, and have a 4th guest migrating
> between boslcp3 and 4.

Klaus, let's hold off on making more changes right now. I'd like to let
things run as-is a little longer.

------- Comment From indira.pr...@in.ibm.com 2018-05-02 23:21 EDT-------
Attached host boslcp3 host console tee logs.
Default Comment by Bridge

------- Comment From indira.pr...@in.ibm.com 2018-05-03 03:22 EDT-------
boslcp3 host console dumps messages related to qlogic driver.

Latest tee logs for boslcp3 host :

kte111.isst.aus.stglabs.ibm.com 9.3.111.155 [kte/don2rry]

kte111:/LOGS/boslcp3-host-may1.txt

[ipjoga@kte (AUS) ~]$ ls -l /LOGS/boslcp3-host-may1.txt
-rwxrwxr-x 1 ipjoga ipjoga 20811302 May  3 02:12 /LOGS/boslcp3-host-may1.txt

Regards,
Indira

------- Comment From dougm...@us.ibm.com 2018-05-03 08:20 EDT-------
There were a large number of SAN incidents in the evening, although none 
involved two ports at the same time. Still, many involved relogin while the 
logout was still being processed - so there is some confidence that the patches 
may be working.

There was a large period of SAN instability between May  2 21:42:09 and
21:58:47. This involved only one port (21:00:00:24:ff:7e:f6:fe). It
would be interesting if this could be traced back to some activity,
either on this machine or on the SAN (e.g. was migration being tested on
other machines at this point?).

We still have not seen the same situation that was associated with the
panics (two or more ports experiencing instability at the same time), so
it's not clear if we can conclude that the patches fix the original
problem.If we could find some trigger for the instability, we might be
able to orchestrate the situation originally seen.

------- Comment From indira.pr...@in.ibm.com 2018-05-04 11:10 EDT-------
We could not able to install 'sar' package due to 166588 prior patch. And also 
'xfs'  was being used on the system from the prior run.  To overcome both, we  
planned fresh installation . Installed latest ubutnu1804 kernel(4.15.0-20) on 
LSI disk and booted up with disk. Login prompt appears & gave credentials. 
Immediately in less than a minute, system dump messages and started rebooting. 
Its not allowing time to run anything on the console prompt.

Tried  multiple attempts to boot with the latest kernel & once logged in
system is rebooting with call traces as below.

Ubuntu 18.04 LTS boslcp3 hvc0

boslcp3 login: [   51.679446] sd 3:0:1:0: rejecting I/O to offline device
[   58.251326] Unable to handle kernel paging request for data at address 
0xbf52a78fa0cf2419
[   58.251413] Faulting instruction address: 0xc00000000038ae70
[   58.251462] Oops: Kernel access of bad area, sig: 11 [#1]
[   58.251500] LE SMP NR_CPUS=2048 NUMA PowerNV
[   58.251543] Modules linked in: rpcsec_gss_krb5(E) nfsv4(E) nfs(E) fscache(E) 
binfmt_misc(E) dm_service_time(E) dm_multipath(E) scsi_dh_rdac(E) 
scsi_dh_emc(E) scsi_dh_alua(E) joydev(E) input_leds(E) mac_hid(E) 
idt_89hpesx(E) at24(E) uio_pdrv_genirq(E) uio(E) vmx_crypto(E) ofpart(E) 
crct10dif_vpmsum(E) cmdlinepart(E) powernv_flash(E) mtd(E) opal_prd(E) 
ipmi_powernv(E) ibmpowernv(E) ipmi_devintf(E) ipmi_msghandler(E) nfsd(E) 
auth_rpcgss(E) nfs_acl(E) sch_fq_codel(E) lockd(E) grace(E) sunrpc(E) 
ip_tables(E) x_tables(E) autofs4(E) ses(E) enclosure(E) hid_generic(E) 
usbhid(E) hid(E) qla2xxx(E) ast(E) i2c_algo_bit(E) ttm(E) mpt3sas(E) ixgbe(E) 
drm_kms_helper(E) nvme_fc(E) syscopyarea(E) sysfillrect(E) nvme_fabrics(E) 
sysimgblt(E) fb_sys_fops(E) nvme_core(E) raid_class(E) crc32c_vpmsum(E) drm(E) 
i40e(E)
[   58.252067]  scsi_transport_sas(E) aacraid(E) scsi_transport_fc(E) mdio(E)
[   58.252120] CPU: 80 PID: 1740 Comm: ureadahead Tainted: G            E    
4.15.0-20-generic #21+bug166588
[   58.252186] NIP:  c00000000038ae70 LR: c00000000038ae5c CTR: c000000000621860
[   58.252245] REGS: c000000fd98b76c0 TRAP: 0380   Tainted: G            E     
(4.15.0-20-generic)
[   58.252309] MSR:  9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 24002844  
XER: 00000000
[   58.252373] CFAR: c000000000016e1c SOFTE: 1
[   58.252373] GPR00: c00000000038ad34 c000000fd98b7940 c0000000016eae00 
0000000000000001
[   58.252373] GPR04: 007f2daa2bd342ac 00000000000005ea 0000000000000001 
00000000000005e9
[   58.252373] GPR08: 7f2daa2bd34242b4 0000000000000000 0000000000000000 
0000000000000000
[   58.252373] GPR12: 0000000000002000 c00000000fab7000 c000000fd9d9f848 
c000000fd9d9fab8
[   58.252373] GPR16: c000000fd98b7c90 000000000000002a 000000000001fe80 
0000000000000000
[   58.252373] GPR20: 0000000000000000 0000000000000000 000000000000002a 
7f528781f8910018
[   58.252373] GPR24: c000200e585e2401 bf52a78fa0cf2419 c000000000b2142c 
c000000ff901ee00
[   58.252373] GPR28: ffffffffffffffff 00000000015004c0 c000200e585e2401 
c000000ff901ee00
[   58.252879] NIP [c00000000038ae70] kmem_cache_alloc_node+0x2f0/0x350
[   58.252927] LR [c00000000038ae5c] kmem_cache_alloc_node+0x2dc/0x350
[   58.252974] Call Trace:
[   58.252996] [c000000fd98b7940] [c00000000038ad34] 
kmem_cache_alloc_node+0x1b4/0x350 (unreliable)
[   58.253066] [c000000fd98b79b0] [c000000000b2142c] __alloc_skb+0x6c/0x220
[   58.253116] [c000000fd98b7a10] [c000000000b2332c] 
alloc_skb_with_frags+0x7c/0x2e0
[   58.253174] [c000000fd98b7aa0] [c000000000b16f8c] 
sock_alloc_send_pskb+0x29c/0x2c0
[   58.253233] [c000000fd98b7b50] [c000000000c492c4] 
unix_stream_sendmsg+0x264/0x5c0
[   58.253292] [c000000fd98b7c30] [c000000000b11424] sock_sendmsg+0x64/0x90
[   58.253342] [c000000fd98b7c60] [c000000000b11508] sock_write_iter+0xb8/0x120
[   58.253401] [c000000fd98b7d00] [c0000000003d0434] new_sync_write+0x104/0x160
[   58.253459] [c000000fd98b7d90] [c0000000003d3b78] vfs_write+0xd8/0x220
[   58.253509] [c000000fd98b7de0] [c0000000003d3e98] SyS_write+0x68/0x110
[   58.253560] [c000000fd98b7e30] [c00000000000b184] system_call+0x58/0x6c
[   58.253607] Instruction dump:
[   58.253637] 7c97ba78 fb210038 38a50001 7f19ba78 fb290000 f8aa0000 4bc8bfb1 
60000000
[   58.253698] 7fb8b840 419e0028 e93f0022 e91f0140 <7d59482a> 7d394a14 7d4a4278 
7fa95040
[   58.253760] ---[ end trace 21f1ccbedad3db06 ]---
[   58.360858] device-mapper: multipath: Reinstating path 65:240.
[   58.362107] sd 3:0:1:0: Power-on or device reset occurred
[   58.369695] sd 2:0:1:0: Power-on or device reset occurred
[   58.371943] sd 3:0:1:0: alua: port group 00 state A non-preferred supports 
tolusna
[   58.376534] sd 3:0:0:0: Power-on or device reset occurred
[   58.381190] sd 2:0:0:0: Power-on or device reset occurred
[   58.391738] sd 3:0:0:0: alua: port group 01 state N non-preferred supports 
tolusna
[   59.265054]

Attached boslcp3 host console logs
Please let us know if this is a different issue to be tracked via separate bug.

Regards,
Indira

------- Comment From cdead...@us.ibm.com 2018-05-05 10:31 EDT-------
Yesterday, the decision was made at Padma's daily KVM meeting to only track 
System Firmware Mustfix issues using the LC GA1 Mustfix label since that is all 
that applies to the Supermicro team. The OS Kernel/KVM issues will be managed 
with a spreadsheet tracked by the KVM team and also in the internal slack 
channel. Removing the Mustfix label.

------- Comment From dougm...@us.ibm.com 2018-05-05 13:23 EDT-------
The boslcp6 logs look characteristic of the qla2xxx issue (panic in 
process_one_work()). Don't have detailed qla2xxx logging so can't determine SAN 
disposition.

------- Comment From dougm...@us.ibm.com 2018-05-07 12:10 EDT-------
Of the "boslcp" systems, only 3 appear to have QLogic adapters. Of those, one 
has been running without the extended error logging and so collected no data, 
and one has been down (or non-functional) for about 36 hours. Of the data 
collected, though, there is no evidence of any SAN instability since Friday - 
before starting the patched kernels. This means that we have no new data on 
whether the patches fix the problem.

------- Comment From dougm...@us.ibm.com 2018-05-08 12:09 EDT-------
It appears that there were some SAN incidents yesterday on boslcp3, approx. 
times were May  7 12:44:54 through 14:28:17. All were for one port, so not 
exactly the situation I think caused the panic. If we could correlate these SAN 
incidents with other activity on neighboring systems, that might help.

[207374.827928] = first incident
[213578.181860] = last incident
[287293.677076] Tue May  8 10:56:52 CDT 2018

------- Comment From dougm...@us.ibm.com 2018-05-09 11:34 EDT-------
There was a period of SAN instability observed on boslcp1 this morning, at 
about May  9 05:01:28 to 05:51:56. This involved 2 ports simultaneously 
handling relogins. This was a Pegas kernel that should be susceptible to the 
panic, but no panic was seen. But since we don't know enough about the exact 
timing required to produce the panic, we can't say just what that means.

------- Comment From dougm...@us.ibm.com 2018-05-10 12:59 EDT-------
I have had some luck reproducing this, on ltc-boston113 (previously unable to 
reproduce there). I had altered the boot parameters to remove "quiet splash" 
and added "qla2xxx.logging=0x1e400000", and got the kworker panic during boot 
(did not even reach login prompt). I also hit this panic while booting the 
Pegas 1.1 installer, so it looks like Pegas is also affected. I am completing 
the Pegas install with qla2xxx blacklisted, and will characterize some more.

------- Comment From dougm...@us.ibm.com 2018-05-10 14:13 EDT-------
Being able to reproduce this on ltc-boston113 seems to have been a temporary 
condition. I can no longer reproduce there, Pegas or Ubuntu. Without some idea 
of what external conditions are causing this, it will be very difficult to 
pursue.

------- Comment From dougm...@us.ibm.com 2018-05-11 12:12 EDT-------
Some information coming in on the SAN where this reproduces. It appears that 
there is some undesirable configuration, where fast switches are backed by 
slower switches between host and disks. The current theory is that other 
activity on the fabric causes bottle-necks in the slow switches and results in 
the temporary loss of login. Working on a way to reproduce this on-demand.

But, if this is true, I think this probably is not likely to be hit by
customers. Seems like customers would not be mixing slow switches with
fast, especially in such a dysfunctional setup.

Still investigating, though, so nothing conclusive yet.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1762844

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into
  xmon after moving to 4.15.0-15.16 kernel

Status in The Ubuntu-power-systems project:
  Incomplete
Status in linux package in Ubuntu:
  Incomplete
Status in linux source package in Bionic:
  Incomplete

Bug description:
  Problem Description:
  ===================
  Host crashed & enters into xmon after updating to  4.15.0-15.16 kernel kernel.

  Steps to re-create:
  ==================

  1. boslcp3 is up with BMC:118 & PNOR: 20180330 levels
  2. Installed boslcp3 with latest kernel 
      4.15.0-13-generic 
  3. Enabled "-proposed" kernel in /etc/apt/sources.list file
  4. Ran sudo apt-get update & apt-get upgrade

  5. root@boslcp3:~# ls /boot
  abi-4.15.0-13-generic         retpoline-4.15.0-13-generic
  abi-4.15.0-15-generic         retpoline-4.15.0-15-generic
  config-4.15.0-13-generic      System.map-4.15.0-13-generic
  config-4.15.0-15-generic      System.map-4.15.0-15-generic
  grub                          vmlinux
  initrd.img                    vmlinux-4.15.0-13-generic
  initrd.img-4.15.0-13-generic  vmlinux-4.15.0-15-generic
  initrd.img-4.15.0-15-generic  vmlinux.old
  initrd.img.old

  6. Rebooted & booted with 4.15.0-15 kernel
  7. Enabled xmon by editing file "vi /etc/default/grub" and ran update-grub
  8. Rebooted host.
  9. Booted with 4.15.0-15  & provided root/password credentials in login 
prompt 

  10. Host crashed & enters into XMON state with 'Unable to handle
  kernel paging request'

  root@boslcp3:~# [   66.295233] Unable to handle kernel paging request for 
data at address 0x8882f6ed90e9151a
  [   66.295297] Faulting instruction address: 0xc00000000038a110
  cpu 0x50: Vector: 380 (Data Access Out of Range) at [c00000000692f650]
      pc: c00000000038a110: kmem_cache_alloc_node+0x2f0/0x350
      lr: c00000000038a0fc: kmem_cache_alloc_node+0x2dc/0x350
      sp: c00000000692f8d0
     msr: 9000000000009033
     dar: 8882f6ed90e9151a
    current = 0xc00000000698fd00
    paca    = 0xc00000000fab7000   softe: 0        irq_happened: 0x01
      pid   = 1762, comm = systemd-journal
  Linux version 4.15.0-15-generic (buildd@bos02-ppc64el-002) (gcc version 7.3.0 
(Ubuntu 7.3.0-14ubuntu1)) #16-Ubuntu SMP Wed Apr 4 13:57:51 UTC 2018 (Ubuntu 
4.15.0-15.16-generic 4.15.15)
  enter ? for help
  [c00000000692f8d0] c000000000389fd4 kmem_cache_alloc_node+0x1b4/0x350 
(unreliable)
  [c00000000692f940] c000000000b2ec6c __alloc_skb+0x6c/0x220
  [c00000000692f9a0] c000000000b30b6c alloc_skb_with_frags+0x7c/0x2e0
  [c00000000692fa30] c000000000b247cc sock_alloc_send_pskb+0x29c/0x2c0
  [c00000000692fae0] c000000000c5705c unix_dgram_sendmsg+0x15c/0x8f0
  [c00000000692fbc0] c000000000b1ec64 sock_sendmsg+0x64/0x90
  [c00000000692fbf0] c000000000b20abc ___sys_sendmsg+0x31c/0x390
  [c00000000692fd90] c000000000b221ec __sys_sendmsg+0x5c/0xc0
  [c00000000692fe30] c00000000000b184 system_call+0x58/0x6c
  --- Exception: c00 (System Call) at 000074826f6fa9c4
  SP (7ffff5dc5510) is in userspace
  50:mon>
  50:mon>

  10. Attached Host console logs

  I rebooted the host just to see if it would hit the issue again and
  this time I didn't even get to the login prompt but it crashed in the
  same location:

  50:mon> r
  R00 = c000000000389fd4   R16 = c000200e0b20fdc0
  R01 = c000200e0b20f8d0   R17 = 0000000000000048
  R02 = c0000000016eb400   R18 = 000000000001fe80
  R03 = 0000000000000001   R19 = 0000000000000000
  R04 = 0048ca1cff37803d   R20 = 0000000000000000
  R05 = 0000000000000688   R21 = 0000000000000000
  R06 = 0000000000000001   R22 = 0000000000000048
  R07 = 0000000000000687   R23 = 4882d6e3c8b7ab55
  R08 = 48ca1cff37802b68   R24 = c000200e5851df01
  R09 = 0000000000000000   R25 = 8882f6ed90e67454
  R10 = 0000000000000000   R26 = c000000000b2ec6c
  R11 = c000000000d10f78   R27 = c000000ff901ee00
  R12 = 0000000000002000   R28 = ffffffffffffffff
  R13 = c00000000fab7000   R29 = 00000000015004c0
  R14 = c000200e4c973fc8   R30 = c000200e5851df01
  R15 = c000200e4c974238   R31 = c000000ff901ee00
  pc  = c00000000038a110 kmem_cache_alloc_node+0x2f0/0x350
  cfar= c000000000016e1c arch_local_irq_restore+0x1c/0x90
  lr  = c00000000038a0fc kmem_cache_alloc_node+0x2dc/0x350
  msr = 9000000000009033   cr  = 28002844
  ctr = c00000000061e1b0   xer = 0000000000000000   trap =  380
  dar = 8882f6ed90e67454   dsisr = c000200e40bd8400
  50:mon> t
  [c000200e0b20f8d0] c000000000389fd4 kmem_cache_alloc_node+0x1b4/0x350 
(unreliable)
  [c000200e0b20f940] c000000000b2ec6c __alloc_skb+0x6c/0x220
  [c000200e0b20f9a0] c000000000b30b6c alloc_skb_with_frags+0x7c/0x2e0
  [c000200e0b20fa30] c000000000b247cc sock_alloc_send_pskb+0x29c/0x2c0
  [c000200e0b20fae0] c000000000c56ae4 unix_stream_sendmsg+0x264/0x5c0
  [c000200e0b20fbc0] c000000000b1ec64 sock_sendmsg+0x64/0x90
  [c000200e0b20fbf0] c000000000b20abc ___sys_sendmsg+0x31c/0x390
  [c000200e0b20fd90] c000000000b221ec __sys_sendmsg+0x5c/0xc0
  [c000200e0b20fe30] c00000000000b184 system_call+0x58/0x6c
  --- Exception: c01 (System Call) at 00007d16a993a940
  SP (7ffffbee2270) is in userspace

  Mirroring to Canonical to advise them that this might be possible
  regression. Didn't see any obvious changes in this area in the
  changelog published at
  https://launchpad.net/ubuntu/+source/linux/4.15.0-15.16 but it would
  be good to have Canonical help reviewing the deltas as we try to
  isolate this further.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1762844/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1762844] Comment bridged from LTC Bugzilla

Reply via email to