This is the debdiff with the retry/delay mechanism, for Eoan. I've
discussed with Cascardo and we agreed he will do the SRU to old releases
(X/B/C/D) after applying some other SRUs he's working now.

I'd like to thanks specially Hari, Murilo and Pavithra from IBM, that
reported, worked and proposed a solution for this issue!

** Description changed:

- == Comment: #0 - PAVITHRA R. PRAKASH <pavra...@in.ibm.com> - 2017-03-07 
05:00:29 ==
- ---Problem Description---
+ [Impact]
  
- Ubuntu 17.04: dump is not captured in remote host when kdump over ssh is
- configured on firestone.
+ * Kdump over network (like NFS mount or SSH dump) relies on network-
+ online target from systemd. Even so, there are some NICs that report
+ "Link Up" state but aren't ready to transmit packets. This is a
+ generally bad behavior that is credited probably to NIC firmware delays,
+ usually not fixable from drivers. Some adapters known to act like this
+ are bnx2x, tg3 and ixgbe.
  
- ---Steps to Reproduce---
+ * Kdump is a mechanism that may be a last resort to debug complex/hard
+ to reproduce issues, so it's interesting to increase its reliability /
+ resilience. We then propose here a solution/quirk to this issue on
+ network dump by adding a retry/delay mechanism; if it's a network dump,
+ kdump will retry some times and sleep between the attempts in order to
+ exclude the case of NICs that aren't ready yet but will soon be able to
+ transmit packets.
  
- 1. Configure kdump.
- 2. Check whether kdump is operational using ?# kdump-config show?.
- 3. Install ?kernel-debuginfo? and ?kernel-debuginfo-common? rpms.
- 4. Setup password less ssh connection, generate rsa key.
- # ssh-keygen -t rsa
- 5. verify id_rsa and id_rsa.pub are created under /root/.ssh/
- 6. Edit /etc/default/kdump-tools and add below entries.
- SSH="ubuntu@9.114.15.239"
- SSH_KEY=/root/.ssh/id_rsa
- 7. Propagate RSA key.
- # kdump-config propagate
- 8. Restart kdump service.
- # kdump-config load
- 9. Trigger Crash using below commands.
- # echo "1" > /proc/sys/kernel/sysrq
- # echo "c" > /proc/sysrq-trigger
- 10. Verify dump is available in remote server in configured path.
+ * Although first reported by IBM in PowerPC arch, the scope for this
+ issue is the NIC, and it was later reported in x86 arch too.
  
- Machine details
- ===========
+ [Test case]
  
- $ ipmitool -I lanplus -H  9.47.70.3 -U ADMIN -P admin sol activate
+ Usually it's difficult to naturally reproduce this issue in a deterministic 
way, but we have an artificial test case on comment #24 of this LP.
+ Also, we have a report from this bug in which the user managed to reproduce 
the problem consistently - it's fixed after testing our solution.
  
- $ ssh ubuntu@9.47.70.29
+ [Regression potential]
  
- PW: shriya101
- 
- 
- Attaching logs
- 
- == Comment: #1 - PAVITHRA R. PRAKASH <pavra...@in.ibm.com> - 2017-03-07
- 05:01:42 ==
- 
- 
- == Comment: #5 - PAVITHRA R. PRAKASH <pavra...@in.ibm.com> - 2017-03-07 
23:19:46 ==
- Hi, 
- 
- Attaching the logs.
- 
- Network info:
- 
- root@ltc-firep3:~# hwinfo --network
- 36: None 00.0: 10700 Loopback                                   
-   [Created at net.126]
-   Unique ID: ZsBS.GQNx7L4uPNA
-   SysFS ID: /class/net/lo
-   Hardware Class: network interface
-   Model: "Loopback network interface"
-   Device File: lo
-   Link detected: yes
-   Config Status: cfg=new, avail=yes, need=no, active=unknown
- 
- 37: None 00.0: 10701 Ethernet
-   [Created at net.126]
-   Unique ID: 2lHw.ndpeucax6V1
-   Parent ID: mIXc.aXC4wIvegH8
-   SysFS ID: /class/net/enP33p3s0f2
-   SysFS Device Link: 
/devices/pci0021:00/0021:00:00.0/0021:01:00.0/0021:02:01.0/0021:03:00.2
-   Hardware Class: network interface
-   Model: "Ethernet network interface"
-   Driver: "tg3"
-   Driver Modules: "tg3"
-   Device File: enP33p3s0f2
-   HW Address: 98:be:94:03:18:4a
-   Permanent HW Address: 98:be:94:03:18:4a
-   Link detected: no
-   Config Status: cfg=new, avail=yes, need=no, active=unknown
-   Attached to: #15 (Ethernet controller)
- 
- 38: None 00.0: 10701 Ethernet
-   [Created at net.126]
-   Unique ID: 7Onn.ndpeucax6V1
-   Parent ID: sx0U.aXC4wIvegH8
-   SysFS ID: /class/net/enP33p3s0f0
-   SysFS Device Link: 
/devices/pci0021:00/0021:00:00.0/0021:01:00.0/0021:02:01.0/0021:03:00.0
-   Hardware Class: network interface
-   Model: "Ethernet network interface"
-   Driver: "tg3"
-   Driver Modules: "tg3"
-   Device File: enP33p3s0f0
-   HW Address: 98:be:94:03:18:48
-   Permanent HW Address: 98:be:94:03:18:48
-   Link detected: yes
-   Config Status: cfg=new, avail=yes, need=no, active=unknown
-   Attached to: #16 (Ethernet controller)
- 
- 39: None 00.0: 10701 Ethernet
-   [Created at net.126]
-   Unique ID: VwX_.ndpeucax6V1
-   Parent ID: DUng.aXC4wIvegH8
-   SysFS ID: /class/net/enP33p3s0f3
-   SysFS Device Link: 
/devices/pci0021:00/0021:00:00.0/0021:01:00.0/0021:02:01.0/0021:03:00.3
-   Hardware Class: network interface
-   Model: "Ethernet network interface"
-   Driver: "tg3"
-   Driver Modules: "tg3"
-   Device File: enP33p3s0f3
-   HW Address: 98:be:94:03:18:4b
-   Permanent HW Address: 98:be:94:03:18:4b
-   Link detected: no
-   Config Status: cfg=new, avail=yes, need=no, active=unknown
-   Attached to: #25 (Ethernet controller)
- 
- 40: None 00.0: 10701 Ethernet
-   [Created at net.126]
-   Unique ID: bZ1s.ndpeucax6V1
-   Parent ID: J7HY.aXC4wIvegH8
-   SysFS ID: /class/net/enP33p3s0f1
-   SysFS Device Link: 
/devices/pci0021:00/0021:00:00.0/0021:01:00.0/0021:02:01.0/0021:03:00.1
-   Hardware Class: network interface
-   Model: "Ethernet network interface"
-   Driver: "tg3"
-   Driver Modules: "tg3"
-   Device File: enP33p3s0f1
-   HW Address: 98:be:94:03:18:49
-   Permanent HW Address: 98:be:94:03:18:49
-   Link detected: no
-   Config Status: cfg=new, avail=yes, need=no, active=unknown
-   Attached to: #4 (Ethernet controller)
- root@ltc-firep3:~# 
- 
- 
- Thanks,
- Pavithra
- 
- == Comment: #6 - PAVITHRA R. PRAKASH <pavra...@in.ibm.com> - 2017-03-07
- 23:20:47 ==
- 
- 
- == Comment: #7 - PAVITHRA R. PRAKASH <pavra...@in.ibm.com> - 2017-03-07 
23:21:27 ==
- 
- 
- == Comment: #8 - Urvashi Jawere <urjaw...@in.ibm.com> - 2017-03-08 02:48:15 ==
- I am able to see some errors in syslog ;
- 
- auxiliary
- Mar  7 04:57:44 ltc-firep3 systemd-resolved[3486]: DNSSEC validation failed 
for question 114.15.239:/home/ubuntu/test IN SOA: failed-auxiliary
- Mar  7 04:57:44 ltc-firep3 systemd-resolved[3486]: DNSSEC validation failed 
for question 9.114.15.239:/home/ubuntu/test IN DS: failed-auxiliary
- Mar  7 04:57:44 ltc-firep3 systemd-resolved[3486]: DNSSEC validation failed 
for question 9.114.15.239:/home/ubuntu/test IN SOA: failed-auxiliary
- Mar  7 04:57:44 ltc-firep3 systemd-resolved[3486]: DNSSEC validation failed 
for question 9.114.15.239:/home/ubuntu/test IN A: failed-auxiliary
- Mar  7 04:57:44 ltc-firep3 systemd-resolved[3486]: Server 9.12.16.2 does not 
support DNSSEC, downgrading to non-DNSSEC mode.
- Mar  7 04:57:44 ltc-firep3 kdump-config: /root/.ssh/id_rsa failed to be sent 
to ubuntu@9.114.15.239:/home/ubuntu/test
- Mar  7 04:58:04 ltc-firep3 systemd[1]: Reloading.
- Mar  7 04:59:15 ltc-firep3 systemd[1]: Reloading.
- Mar  7 04:59:16 ltc-firep3 kdump-config: propagated ssh key /root/.ssh/id_rsa 
to server ubuntu@9.114.15.239
- .
- .
- .
- 
- Mar  7 05:06:55 ltc-firep3 systemd[1]: Started Accounts Service.
- Mar  7 05:06:56 ltc-firep3 kdump-tools[3498]: Starting kdump-tools: Modified 
cmdline:root=UUID=1e76cfd5-988c-46f4-bdc4-39fe1ed01152 ro quiet splash irqpoll 
nr_cpus=1 nousb systemd.unit=kdump-tools.service ata_piix.prefer_ms_hyperv=0 
elfcorehdr=155136K
- Mar  7 05:06:57 ltc-firep3 kdump-tools[3498]:  * loaded kdump kernel
- Mar  7 05:06:57 ltc-firep3 kdump-tools: /sbin/kexec -p 
--command-line="root=UUID=1e76cfd5-988c-46f4-bdc4-39fe1ed01152 ro quiet splash 
irqpoll nr_cpus=1 nousb systemd.unit=kdump-tools.service 
ata_piix.prefer_ms_hyperv=0" --initrd=/var/lib/kdump/initrd.img 
/var/lib/kdump/vmlinuz
- Mar  7 05:06:57 ltc-firep3 kdump-tools: loaded kdump kernel
- Mar  7 05:06:57 ltc-firep3 systemd[1]: Started Kernel crash dump capture 
service.
- Mar  7 05:06:57 ltc-firep3 apport[3584]: ERROR: Cannot create report: [Errno 
17] File exists: '/var/crash/linux-image-4.10.0-9-generic-201703060521.crash'
- Mar  7 05:06:57 ltc-firep3 apport[3584]:    ...done.
- 
- == Comment: #18 - Hari Krishna Bathini <hbath...@in.ibm.com> - 2017-03-28 
06:55:20 ==
- Looks like tg3 module was not needed after all. Interesting thing though is
- even after enP34p1s0f0 is up (ifup) and network.online target is reached,
- network was not really active. It took about 30 seconds, after reaching 
- network.online target, for the network to be active, even on a normal boot.
- Adding this wait time in kdump script, before saving dump, ensured that
- vmcore is captured successful. Attaching the log for the same..
- 
- Not sure why enP34p1s0f0 is taking that long to configure/initialize. Even so,
- this delay should be part of ifup/network-online.target if it is inevitable,
- so that network is pingable after network-online.target
-  
- Thanks
- Hari
- 
- == Comment: #19 - Hari Krishna Bathini <hbath...@in.ibm.com> - 2017-03-28 
07:01:52 ==
- The workaround snippet adding delay in kdump script:
- 
- 
- --- kdump-config.orig 2017-03-28 03:35:17.753542107 -0500
- +++ kdump-config      2017-03-28 06:59:22.887576623 -0500
- @@ -761,6 +761,7 @@
-       KDUMP_DMESGFILE="$KDUMP_STAMPDIR/dmesg.$KDUMP_STAMP"
-       ERROR=0
-  
- +     sleep 30
-       ssh -i $KDUMP_SSH_KEY $KDUMP_REMOTE_HOST mkdir -p $KDUMP_STAMPDIR
-       ERROR=$?
-       # If remote connections fails, no need to continue
- 
- ---
- 
- Thanks
- Hari
- 
- == Comment: #20 - PAVITHRA R. PRAKASH <pavra...@in.ibm.com> - 2017-03-30 
01:33:56 ==
- (In reply to comment #19)
- > The workaround snippet adding delay in kdump script:
- > 
- > 
- > --- kdump-config.orig       2017-03-28 03:35:17.753542107 -0500
- > +++ kdump-config    2017-03-28 06:59:22.887576623 -0500
- > @@ -761,6 +761,7 @@
- >     KDUMP_DMESGFILE="$KDUMP_STAMPDIR/dmesg.$KDUMP_STAMP"
- >     ERROR=0
- >  
- > +   sleep 30
- >     ssh -i $KDUMP_SSH_KEY $KDUMP_REMOTE_HOST mkdir -p $KDUMP_STAMPDIR
- >     ERROR=$?
- >     # If remote connections fails, no need to continue
- > 
- > ---
- > 
- > Thanks
- > Hari
- 
- With above workaround dump captured successfully in remote host.
- 
- Thanks,
- Pavithra
- 
- == Comment: #22 - Hari Krishna Bathini <hbath...@in.ibm.com> - 2017-04-10 
22:14:27 ==
- (In reply to comment #18)
- > Created attachment 117088 [details]
- > Console log of successful dump capture after adding a time delay of 'sleep
- > 30'
- > 
- > Looks like tg3 module was not needed after all. Interesting thing though is
- > even after enP34p1s0f0 is up (ifup) and network.online target is reached,
- > network was not really active. It took about 30 seconds, after reaching 
- > network.online target, for the network to be active, even on a normal boot.
- > Adding this wait time in kdump script, before saving dump, ensured that
- > vmcore is captured successful. Attaching the log for the same..
- > 
- > Not sure why enP34p1s0f0 is taking that long to configure/initialize. Even
- > so,
- > this delay should be part of ifup/network-online.target if it is inevitable,
- > so that network is pingable after network-online.target
- 
- Hi Canonical,
- 
- Since this falls outside the realm of kdump, should we add a NET_WAIT_TIME 
field
- in /etc/default/kdump-tools file that defaults to 0 but can be changed when 
the
- user sees timing troubles?
- 
- Thanks
- Hari
+ There's not a clear regression potential here since it's just a retry/delay 
mechanism. Some potential problems may come from bad coding in the script.
+ The delay between attempts is only 3 sec per iteration, so it shouldn't block 
the kdump progress for a high amount of time at once.

** Patch added: "lp1681909_eoan.debdiff"
   
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1681909/+attachment/5275117/+files/lp1681909_eoan.debdiff

** Changed in: makedumpfile (Ubuntu Xenial)
       Status: Confirmed => In Progress

** Changed in: makedumpfile (Ubuntu Bionic)
       Status: Confirmed => In Progress

** Changed in: makedumpfile (Ubuntu Cosmic)
       Status: Confirmed => In Progress

** Changed in: makedumpfile (Ubuntu Disco)
       Status: Confirmed => In Progress

** Changed in: makedumpfile (Ubuntu Eoan)
       Status: Confirmed => In Progress

** Changed in: makedumpfile (Ubuntu Disco)
     Assignee: Guilherme G. Piccoli (gpiccoli) => Thadeu Lima de Souza Cascardo 
(cascardo)

** Changed in: makedumpfile (Ubuntu Cosmic)
     Assignee: Guilherme G. Piccoli (gpiccoli) => Thadeu Lima de Souza Cascardo 
(cascardo)

** Changed in: makedumpfile (Ubuntu Bionic)
     Assignee: Guilherme G. Piccoli (gpiccoli) => Thadeu Lima de Souza Cascardo 
(cascardo)

** Changed in: makedumpfile (Ubuntu Xenial)
     Assignee: Guilherme G. Piccoli (gpiccoli) => Thadeu Lima de Souza Cascardo 
(cascardo)

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to makedumpfile in Ubuntu.
https://bugs.launchpad.net/bugs/1681909

Title:
  kdump is not captured in remote host when kdump over ssh is configured

Status in The Ubuntu-power-systems project:
  Confirmed
Status in makedumpfile package in Ubuntu:
  In Progress
Status in makedumpfile source package in Xenial:
  In Progress
Status in makedumpfile source package in Bionic:
  In Progress
Status in makedumpfile source package in Cosmic:
  In Progress
Status in makedumpfile source package in Disco:
  In Progress
Status in makedumpfile source package in Eoan:
  In Progress

Bug description:
  [Impact]

  * Kdump over network (like NFS mount or SSH dump) relies on network-
  online target from systemd. Even so, there are some NICs that report
  "Link Up" state but aren't ready to transmit packets. This is a
  generally bad behavior that is credited probably to NIC firmware
  delays, usually not fixable from drivers. Some adapters known to act
  like this are bnx2x, tg3 and ixgbe.

  * Kdump is a mechanism that may be a last resort to debug complex/hard
  to reproduce issues, so it's interesting to increase its reliability /
  resilience. We then propose here a solution/quirk to this issue on
  network dump by adding a retry/delay mechanism; if it's a network
  dump, kdump will retry some times and sleep between the attempts in
  order to exclude the case of NICs that aren't ready yet but will soon
  be able to transmit packets.

  * Although first reported by IBM in PowerPC arch, the scope for this
  issue is the NIC, and it was later reported in x86 arch too.

  [Test case]

  Usually it's difficult to naturally reproduce this issue in a deterministic 
way, but we have an artificial test case on comment #24 of this LP.
  Also, we have a report from this bug in which the user managed to reproduce 
the problem consistently - it's fixed after testing our solution.

  [Regression potential]

  There's not a clear regression potential here since it's just a retry/delay 
mechanism. Some potential problems may come from bad coding in the script.
  The delay between attempts is only 3 sec per iteration, so it shouldn't block 
the kdump progress for a high amount of time at once.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1681909/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to