Public bug reported:

== Comment: #0 - HARSHA THYAGARAJA - 2016-11-03 08:05:59 ==
---Problem Description---
kdump over nfs did not generate complete vmcore
 
---uname output---
Linux ltciofvtr-firestone1 4.8.0-26-generic #28-Ubuntu SMP Tue Oct 18 14:41:40 
UTC 2016 ppc64le ppc64le ppc64le GNU/Linux
 
Machine Type = PowerNV (Baremetal) - Firestone 
 
---Steps to Reproduce---
 1. Setup NFS
2. Trigger crash: echo c > /proc/sysrq-trigger


== Comment: #6 - Kevin W. Rudd - 2016-11-04 16:30:49 ==

Hi Harsha.

It looks like the base kdump NFS functionality works just fine.  The
known issue with makedumpfile is causing it to drop back to using "cp"
to transfer the entire, non-compressed /proc/vmcore image.  That's a
rather large amount of data to send over to the remote server, and it
appears to be sending back an I/O error after the first 122G.

Further debug would need to be done to determine if this is a client-
side or server-side issue.  I recommend first bringing your remote NFS
server up to the current release as it is currently a bit down-rev.

== Comment: #8 - HARSHA THYAGARAJA  - 2016-11-10 02:02:31 ==

Hi Kevin,
I updated my peer to Ubuntu 16.10 and still saw the same observation. 
A snippet of the problem at hand is pasted below. 

[   20.610748] kdump-tools[4559]: Starting kdump-tools:  * Mounting NFS 
mountpoint 150.1.1.20:/home/tools ...
[   53.400516] kdump-tools[4559]:  * Dumping to NFS mountpoint 
150.1.1.20:/home/tools/201611100158
[   53.409242] kdump-tools[4559]:  * running makedumpfile -c -d 31 /proc/vmcore 
/mnt/var/crash/9.47.84.18-201611100158/dump-incomplete
[   53.526593] kdump-tools[4559]: get_mem_map: Can't distinguish the memory 
type.
[   53.527154] kdump-tools[4559]: The kernel version is not supported.
[   53.527488] kdump-tools[4559]: The makedumpfile operation may be incomplete.
[   53.527813] kdump-tools[4559]: makedumpfile Failed.
[   53.528117] kdump-tools[4559]:  * kdump-tools: makedumpfile failed, falling 
back to 'cp'
[   90.754092] kdump-tools[4559]: cp: error writing 
'/mnt/var/crash/9.47.84.18-201611100158/vmcore-incomplete': Input/output error
[   90.754857] kdump-tools[4559]:  * kdump-tools: failed to save vmcore in 
/mnt/var/crash/9.47.84.18-201611100158
[   90.756155] kdump-tools[4559]:  * running makedumpfile --dump-dmesg 
/proc/vmcore /mnt/var/crash/9.47.84.18-201611100158/dmesg.201611100158
[   90.758731] kdump-tools[4559]: get_mem_map: Can't distinguish the memory 
type.
[   90.759089] kdump-tools[4559]: The kernel version is not supported.
[   90.759436] kdump-tools[4559]: The makedumpfile operation may be incomplete.
[   90.759780] kdump-tools[4559]: makedumpfile Failed.
[   90.760094] kdump-tools[4559]:  * kdump-tools: makedumpfile --dump-dmesg 
failed. dmesg content will be unavailable
[   90.760668] kdump-tools[4559]:  * kdump-tools: failed to save dmesg content 
in /mnt/var/crash/9.47.84.18-201611100158
[   90.846117] kdump-tools[4559]: Thu, 10 Nov 2016 01:59:56 -0500
[   90.886629] kdump-tools[4559]: Failed to read reboot parameter file: No such 
file or directory
[   90.887070] kdump-tools[4559]: Rebooting.

== Comment: #13 - Kevin W. Rudd  - 2016-11-11 17:12:33 ==

I was able to replicate this with debugging at both the kdump client and
remote NFS server.  The server was perfectly happy with the data coming
at it, and appeared to be processing a COMMIT request from the client
when the client shut down the connection.

Looking at the client-side logs after a failure showed that it was
logging "server ... not responding" messages, and bailed on the
connection within the span of just a few seconds.

This appears to be due to a very over-aggressive timeout being specified
in /usr/sbin/kdump-config:

mount -t nfs -o nolock -o tcp -o soft -o timeo=5 -o retrans=5 $NFS
$KDUMP_COREDIR

The timeo value is deciseconds, and "5" is far too aggressive for this
type of connection.  From my observations, the COMMIT was not issued
until about 60G was transferred, and most remote servers will take a lot
longer than 5 tenths of a second to flush that amount of data and
respond to the COMMIT.

I'm not sure what problem specifying this timeo value was supposed to
address, but it would be better to leave the timeo value at its default
for a tcp connection (let the TCP protocol handle any communication
timeouts on its own).  When I modified kdump-config to use the default
timeo of 600, the kdump process transferred the entire vmcore without
error.

** Affects: makedumpfile (Ubuntu)
     Importance: Undecided
     Assignee: Taco Screen team (taco-screen-team)
         Status: New


** Tags: architecture-ppc64le bugnameltc-148148 severity-high 
targetmilestone-inin1610

** Tags added: architecture-ppc64le bugnameltc-148148 severity-high
targetmilestone-inin1610

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1641235

Title:
  Ubuntu 16.10: kdump over nfs did not generate complete vmcore

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/makedumpfile/+bug/1641235/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to