This is the debdiff with the retry/delay mechanism, for Eoan. I've discussed with Cascardo and we agreed he will do the SRU to old releases (X/B/C/D) after applying some other SRUs he's working now.
I'd like to thanks specially Hari, Murilo and Pavithra from IBM, that reported, worked and proposed a solution for this issue! ** Description changed: - == Comment: #0 - PAVITHRA R. PRAKASH <pavra...@in.ibm.com> - 2017-03-07 05:00:29 == - ---Problem Description--- + [Impact] - Ubuntu 17.04: dump is not captured in remote host when kdump over ssh is - configured on firestone. + * Kdump over network (like NFS mount or SSH dump) relies on network- + online target from systemd. Even so, there are some NICs that report + "Link Up" state but aren't ready to transmit packets. This is a + generally bad behavior that is credited probably to NIC firmware delays, + usually not fixable from drivers. Some adapters known to act like this + are bnx2x, tg3 and ixgbe. - ---Steps to Reproduce--- + * Kdump is a mechanism that may be a last resort to debug complex/hard + to reproduce issues, so it's interesting to increase its reliability / + resilience. We then propose here a solution/quirk to this issue on + network dump by adding a retry/delay mechanism; if it's a network dump, + kdump will retry some times and sleep between the attempts in order to + exclude the case of NICs that aren't ready yet but will soon be able to + transmit packets. - 1. Configure kdump. - 2. Check whether kdump is operational using ?# kdump-config show?. - 3. Install ?kernel-debuginfo? and ?kernel-debuginfo-common? rpms. - 4. Setup password less ssh connection, generate rsa key. - # ssh-keygen -t rsa - 5. verify id_rsa and id_rsa.pub are created under /root/.ssh/ - 6. Edit /etc/default/kdump-tools and add below entries. - SSH="ubuntu@9.114.15.239" - SSH_KEY=/root/.ssh/id_rsa - 7. Propagate RSA key. - # kdump-config propagate - 8. Restart kdump service. - # kdump-config load - 9. Trigger Crash using below commands. - # echo "1" > /proc/sys/kernel/sysrq - # echo "c" > /proc/sysrq-trigger - 10. Verify dump is available in remote server in configured path. + * Although first reported by IBM in PowerPC arch, the scope for this + issue is the NIC, and it was later reported in x86 arch too. - Machine details - =========== + [Test case] - $ ipmitool -I lanplus -H 9.47.70.3 -U ADMIN -P admin sol activate + Usually it's difficult to naturally reproduce this issue in a deterministic way, but we have an artificial test case on comment #24 of this LP. + Also, we have a report from this bug in which the user managed to reproduce the problem consistently - it's fixed after testing our solution. - $ ssh ubuntu@9.47.70.29 + [Regression potential] - PW: shriya101 - - - Attaching logs - - == Comment: #1 - PAVITHRA R. PRAKASH <pavra...@in.ibm.com> - 2017-03-07 - 05:01:42 == - - - == Comment: #5 - PAVITHRA R. PRAKASH <pavra...@in.ibm.com> - 2017-03-07 23:19:46 == - Hi, - - Attaching the logs. - - Network info: - - root@ltc-firep3:~# hwinfo --network - 36: None 00.0: 10700 Loopback - [Created at net.126] - Unique ID: ZsBS.GQNx7L4uPNA - SysFS ID: /class/net/lo - Hardware Class: network interface - Model: "Loopback network interface" - Device File: lo - Link detected: yes - Config Status: cfg=new, avail=yes, need=no, active=unknown - - 37: None 00.0: 10701 Ethernet - [Created at net.126] - Unique ID: 2lHw.ndpeucax6V1 - Parent ID: mIXc.aXC4wIvegH8 - SysFS ID: /class/net/enP33p3s0f2 - SysFS Device Link: /devices/pci0021:00/0021:00:00.0/0021:01:00.0/0021:02:01.0/0021:03:00.2 - Hardware Class: network interface - Model: "Ethernet network interface" - Driver: "tg3" - Driver Modules: "tg3" - Device File: enP33p3s0f2 - HW Address: 98:be:94:03:18:4a - Permanent HW Address: 98:be:94:03:18:4a - Link detected: no - Config Status: cfg=new, avail=yes, need=no, active=unknown - Attached to: #15 (Ethernet controller) - - 38: None 00.0: 10701 Ethernet - [Created at net.126] - Unique ID: 7Onn.ndpeucax6V1 - Parent ID: sx0U.aXC4wIvegH8 - SysFS ID: /class/net/enP33p3s0f0 - SysFS Device Link: /devices/pci0021:00/0021:00:00.0/0021:01:00.0/0021:02:01.0/0021:03:00.0 - Hardware Class: network interface - Model: "Ethernet network interface" - Driver: "tg3" - Driver Modules: "tg3" - Device File: enP33p3s0f0 - HW Address: 98:be:94:03:18:48 - Permanent HW Address: 98:be:94:03:18:48 - Link detected: yes - Config Status: cfg=new, avail=yes, need=no, active=unknown - Attached to: #16 (Ethernet controller) - - 39: None 00.0: 10701 Ethernet - [Created at net.126] - Unique ID: VwX_.ndpeucax6V1 - Parent ID: DUng.aXC4wIvegH8 - SysFS ID: /class/net/enP33p3s0f3 - SysFS Device Link: /devices/pci0021:00/0021:00:00.0/0021:01:00.0/0021:02:01.0/0021:03:00.3 - Hardware Class: network interface - Model: "Ethernet network interface" - Driver: "tg3" - Driver Modules: "tg3" - Device File: enP33p3s0f3 - HW Address: 98:be:94:03:18:4b - Permanent HW Address: 98:be:94:03:18:4b - Link detected: no - Config Status: cfg=new, avail=yes, need=no, active=unknown - Attached to: #25 (Ethernet controller) - - 40: None 00.0: 10701 Ethernet - [Created at net.126] - Unique ID: bZ1s.ndpeucax6V1 - Parent ID: J7HY.aXC4wIvegH8 - SysFS ID: /class/net/enP33p3s0f1 - SysFS Device Link: /devices/pci0021:00/0021:00:00.0/0021:01:00.0/0021:02:01.0/0021:03:00.1 - Hardware Class: network interface - Model: "Ethernet network interface" - Driver: "tg3" - Driver Modules: "tg3" - Device File: enP33p3s0f1 - HW Address: 98:be:94:03:18:49 - Permanent HW Address: 98:be:94:03:18:49 - Link detected: no - Config Status: cfg=new, avail=yes, need=no, active=unknown - Attached to: #4 (Ethernet controller) - root@ltc-firep3:~# - - - Thanks, - Pavithra - - == Comment: #6 - PAVITHRA R. PRAKASH <pavra...@in.ibm.com> - 2017-03-07 - 23:20:47 == - - - == Comment: #7 - PAVITHRA R. PRAKASH <pavra...@in.ibm.com> - 2017-03-07 23:21:27 == - - - == Comment: #8 - Urvashi Jawere <urjaw...@in.ibm.com> - 2017-03-08 02:48:15 == - I am able to see some errors in syslog ; - - auxiliary - Mar 7 04:57:44 ltc-firep3 systemd-resolved[3486]: DNSSEC validation failed for question 114.15.239:/home/ubuntu/test IN SOA: failed-auxiliary - Mar 7 04:57:44 ltc-firep3 systemd-resolved[3486]: DNSSEC validation failed for question 9.114.15.239:/home/ubuntu/test IN DS: failed-auxiliary - Mar 7 04:57:44 ltc-firep3 systemd-resolved[3486]: DNSSEC validation failed for question 9.114.15.239:/home/ubuntu/test IN SOA: failed-auxiliary - Mar 7 04:57:44 ltc-firep3 systemd-resolved[3486]: DNSSEC validation failed for question 9.114.15.239:/home/ubuntu/test IN A: failed-auxiliary - Mar 7 04:57:44 ltc-firep3 systemd-resolved[3486]: Server 9.12.16.2 does not support DNSSEC, downgrading to non-DNSSEC mode. - Mar 7 04:57:44 ltc-firep3 kdump-config: /root/.ssh/id_rsa failed to be sent to ubuntu@9.114.15.239:/home/ubuntu/test - Mar 7 04:58:04 ltc-firep3 systemd[1]: Reloading. - Mar 7 04:59:15 ltc-firep3 systemd[1]: Reloading. - Mar 7 04:59:16 ltc-firep3 kdump-config: propagated ssh key /root/.ssh/id_rsa to server ubuntu@9.114.15.239 - . - . - . - - Mar 7 05:06:55 ltc-firep3 systemd[1]: Started Accounts Service. - Mar 7 05:06:56 ltc-firep3 kdump-tools[3498]: Starting kdump-tools: Modified cmdline:root=UUID=1e76cfd5-988c-46f4-bdc4-39fe1ed01152 ro quiet splash irqpoll nr_cpus=1 nousb systemd.unit=kdump-tools.service ata_piix.prefer_ms_hyperv=0 elfcorehdr=155136K - Mar 7 05:06:57 ltc-firep3 kdump-tools[3498]: * loaded kdump kernel - Mar 7 05:06:57 ltc-firep3 kdump-tools: /sbin/kexec -p --command-line="root=UUID=1e76cfd5-988c-46f4-bdc4-39fe1ed01152 ro quiet splash irqpoll nr_cpus=1 nousb systemd.unit=kdump-tools.service ata_piix.prefer_ms_hyperv=0" --initrd=/var/lib/kdump/initrd.img /var/lib/kdump/vmlinuz - Mar 7 05:06:57 ltc-firep3 kdump-tools: loaded kdump kernel - Mar 7 05:06:57 ltc-firep3 systemd[1]: Started Kernel crash dump capture service. - Mar 7 05:06:57 ltc-firep3 apport[3584]: ERROR: Cannot create report: [Errno 17] File exists: '/var/crash/linux-image-4.10.0-9-generic-201703060521.crash' - Mar 7 05:06:57 ltc-firep3 apport[3584]: ...done. - - == Comment: #18 - Hari Krishna Bathini <hbath...@in.ibm.com> - 2017-03-28 06:55:20 == - Looks like tg3 module was not needed after all. Interesting thing though is - even after enP34p1s0f0 is up (ifup) and network.online target is reached, - network was not really active. It took about 30 seconds, after reaching - network.online target, for the network to be active, even on a normal boot. - Adding this wait time in kdump script, before saving dump, ensured that - vmcore is captured successful. Attaching the log for the same.. - - Not sure why enP34p1s0f0 is taking that long to configure/initialize. Even so, - this delay should be part of ifup/network-online.target if it is inevitable, - so that network is pingable after network-online.target - - Thanks - Hari - - == Comment: #19 - Hari Krishna Bathini <hbath...@in.ibm.com> - 2017-03-28 07:01:52 == - The workaround snippet adding delay in kdump script: - - - --- kdump-config.orig 2017-03-28 03:35:17.753542107 -0500 - +++ kdump-config 2017-03-28 06:59:22.887576623 -0500 - @@ -761,6 +761,7 @@ - KDUMP_DMESGFILE="$KDUMP_STAMPDIR/dmesg.$KDUMP_STAMP" - ERROR=0 - - + sleep 30 - ssh -i $KDUMP_SSH_KEY $KDUMP_REMOTE_HOST mkdir -p $KDUMP_STAMPDIR - ERROR=$? - # If remote connections fails, no need to continue - - --- - - Thanks - Hari - - == Comment: #20 - PAVITHRA R. PRAKASH <pavra...@in.ibm.com> - 2017-03-30 01:33:56 == - (In reply to comment #19) - > The workaround snippet adding delay in kdump script: - > - > - > --- kdump-config.orig 2017-03-28 03:35:17.753542107 -0500 - > +++ kdump-config 2017-03-28 06:59:22.887576623 -0500 - > @@ -761,6 +761,7 @@ - > KDUMP_DMESGFILE="$KDUMP_STAMPDIR/dmesg.$KDUMP_STAMP" - > ERROR=0 - > - > + sleep 30 - > ssh -i $KDUMP_SSH_KEY $KDUMP_REMOTE_HOST mkdir -p $KDUMP_STAMPDIR - > ERROR=$? - > # If remote connections fails, no need to continue - > - > --- - > - > Thanks - > Hari - - With above workaround dump captured successfully in remote host. - - Thanks, - Pavithra - - == Comment: #22 - Hari Krishna Bathini <hbath...@in.ibm.com> - 2017-04-10 22:14:27 == - (In reply to comment #18) - > Created attachment 117088 [details] - > Console log of successful dump capture after adding a time delay of 'sleep - > 30' - > - > Looks like tg3 module was not needed after all. Interesting thing though is - > even after enP34p1s0f0 is up (ifup) and network.online target is reached, - > network was not really active. It took about 30 seconds, after reaching - > network.online target, for the network to be active, even on a normal boot. - > Adding this wait time in kdump script, before saving dump, ensured that - > vmcore is captured successful. Attaching the log for the same.. - > - > Not sure why enP34p1s0f0 is taking that long to configure/initialize. Even - > so, - > this delay should be part of ifup/network-online.target if it is inevitable, - > so that network is pingable after network-online.target - - Hi Canonical, - - Since this falls outside the realm of kdump, should we add a NET_WAIT_TIME field - in /etc/default/kdump-tools file that defaults to 0 but can be changed when the - user sees timing troubles? - - Thanks - Hari + There's not a clear regression potential here since it's just a retry/delay mechanism. Some potential problems may come from bad coding in the script. + The delay between attempts is only 3 sec per iteration, so it shouldn't block the kdump progress for a high amount of time at once. ** Patch added: "lp1681909_eoan.debdiff" https://bugs.launchpad.net/ubuntu-power-systems/+bug/1681909/+attachment/5275117/+files/lp1681909_eoan.debdiff ** Changed in: makedumpfile (Ubuntu Xenial) Status: Confirmed => In Progress ** Changed in: makedumpfile (Ubuntu Bionic) Status: Confirmed => In Progress ** Changed in: makedumpfile (Ubuntu Cosmic) Status: Confirmed => In Progress ** Changed in: makedumpfile (Ubuntu Disco) Status: Confirmed => In Progress ** Changed in: makedumpfile (Ubuntu Eoan) Status: Confirmed => In Progress ** Changed in: makedumpfile (Ubuntu Disco) Assignee: Guilherme G. Piccoli (gpiccoli) => Thadeu Lima de Souza Cascardo (cascardo) ** Changed in: makedumpfile (Ubuntu Cosmic) Assignee: Guilherme G. Piccoli (gpiccoli) => Thadeu Lima de Souza Cascardo (cascardo) ** Changed in: makedumpfile (Ubuntu Bionic) Assignee: Guilherme G. Piccoli (gpiccoli) => Thadeu Lima de Souza Cascardo (cascardo) ** Changed in: makedumpfile (Ubuntu Xenial) Assignee: Guilherme G. Piccoli (gpiccoli) => Thadeu Lima de Souza Cascardo (cascardo) -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to makedumpfile in Ubuntu. https://bugs.launchpad.net/bugs/1681909 Title: kdump is not captured in remote host when kdump over ssh is configured Status in The Ubuntu-power-systems project: Confirmed Status in makedumpfile package in Ubuntu: In Progress Status in makedumpfile source package in Xenial: In Progress Status in makedumpfile source package in Bionic: In Progress Status in makedumpfile source package in Cosmic: In Progress Status in makedumpfile source package in Disco: In Progress Status in makedumpfile source package in Eoan: In Progress Bug description: [Impact] * Kdump over network (like NFS mount or SSH dump) relies on network- online target from systemd. Even so, there are some NICs that report "Link Up" state but aren't ready to transmit packets. This is a generally bad behavior that is credited probably to NIC firmware delays, usually not fixable from drivers. Some adapters known to act like this are bnx2x, tg3 and ixgbe. * Kdump is a mechanism that may be a last resort to debug complex/hard to reproduce issues, so it's interesting to increase its reliability / resilience. We then propose here a solution/quirk to this issue on network dump by adding a retry/delay mechanism; if it's a network dump, kdump will retry some times and sleep between the attempts in order to exclude the case of NICs that aren't ready yet but will soon be able to transmit packets. * Although first reported by IBM in PowerPC arch, the scope for this issue is the NIC, and it was later reported in x86 arch too. [Test case] Usually it's difficult to naturally reproduce this issue in a deterministic way, but we have an artificial test case on comment #24 of this LP. Also, we have a report from this bug in which the user managed to reproduce the problem consistently - it's fixed after testing our solution. [Regression potential] There's not a clear regression potential here since it's just a retry/delay mechanism. Some potential problems may come from bad coding in the script. The delay between attempts is only 3 sec per iteration, so it shouldn't block the kdump progress for a high amount of time at once. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu-power-systems/+bug/1681909/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp