Thanks everyone for trying to tackle this long-standing issue. fwiw,
here's my $0.02 no how we could proceed:

Someone should draft a special case page for makedumpfile:
https://wiki.ubuntu.com/StableReleaseUpdates#Documentation_for_Special_Cases
I'm happy to review/provide feedback, but I'd rather someone who would be 
carrying out the plan drive it.

As others have mentioned, testing is the hard part, and we need to
define what will be tested in the special case documentation. Since
makedumpfile is really just a filter, I don't think we need to (or
reasonably could) boot a bunch of systems in different configs and
generate crashdumps for every new update. Rather, i think we could build
a repository of representative, unfiltered, /proc/vmcore files that
focal's existing makedumpfile can parse. Then we can just check that all
of those files can still be parsed by the proposed makedumpfile. With
some scripting and a multi-architecture cloud, this could be automated.
In fact, if this vmcore repo were online, we could implement this an
autopkgtest (w/ needs-internet set). But we should also do at least one
end-to-end kdump, just to make sure the kdump-tools->makedumpfile
interface hasn't been broken.

What is a representative sample? One of each of the current LTS and HWE
kernels on amd64, arm64, ppc64el and s390x seems like an obvious start
(or the subset of those that actually work today). I don't think the
machine type is as important, VMs should be fine IMO. If we know of
examples where different machines expose structures differently in a way
that makedumpfile cares about, then perhaps add those as well. Once a
new makedumpfile lands that adds support for a new HWE kernel, we should
probably then update the repo w/ vmcore samples from that kernel, so we
can make sure the next update doesn't regress that support (probably
convenient to do when verifying the SRU, since I imagine we'd be testing
that it works w/ the new HWE kernel then anyway).

It'd be good to note in the special case request that kdump-tools does fall 
back to a raw /proc/vmcore file cp if makedumpfile fails, which can mitigate 
regressions for a subset of users 
 - those with the necessary disk space and lack of time constraints.

While I agree that crash falls into the same category, I don't think it
necessarily needs to happen at the same time. Obviously users running
focal need to dump their vmcore using focal - bug for developers
debugging a crash, I don't think it is to onerous to use a newer version
of Ubuntu. Again, I'm no saying we *shouldn't* add crash to the special
case, it just seems like a makedumpfile exception is significantly more
important.

Finally, I don't think we need to commit to a frequency of backports, or
a point at which they will stop. Rather we can just stick to agreeing on
how it *can* be done when someone has the time/interest in doing it.
Guillherme's LTS->LTS+1 scheme sounds like a reasonable pattern to shoot
for, but if that doesn't happen every time, we're still improving the
situation over the status quo.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to makedumpfile in Ubuntu.
https://bugs.launchpad.net/bugs/1970672

Title:
  makedumpfile falls back to cp with "__vtop4_x86_64: Can't get a valid
  pmd_pte."

Status in makedumpfile package in Ubuntu:
  New

Bug description:
  [Impact] 
   * On Focal with an HWE (>=5.12) kernel, makedumpfile can sometimes fail with 
"__vtop4_x86_64: Can't get a valid pmd_pte."

   * makedumpfile falls back to cp for the dump, resulting in extremely
  large vmcores. This can impact both collection and analysis due to
  lack of space for the resulting vmcore.

   * This is fixed in upstream commit present in versions 1.7.0 and 1.7.1:
  
https://github.com/makedumpfile/makedumpfile/commit/646456862df8926ba10dd7330abf3bf0f887e1b6

  commit 646456862df8926ba10dd7330abf3bf0f887e1b6
  Author: Kazuhito Hagio <k-hagio...@nec.com>
  Date:   Wed May 26 14:31:26 2021 +0900

      [PATCH] Increase SECTION_MAP_LAST_BIT to 5
      
      * Required for kernel 5.12
      
      Kernel commit 1f90a3477df3 ("mm: teach pfn_to_online_page() about
      ZONE_DEVICE section collisions") added a section flag
      (SECTION_TAINT_ZONE_DEVICE) and causes makedumpfile an error on
      some machines like this:
      
        __vtop4_x86_64: Can't get a valid pmd_pte.
        readmem: Can't convert a virtual address(ffffe2bdc2000000) to physical 
address.
        readmem: type_addr: 0, addr:ffffe2bdc2000000, size:32768
        __exclude_unnecessary_pages: Can't read the buffer of struct page.
        create_2nd_bitmap: Can't exclude unnecessary pages.
      
      Increase SECTION_MAP_LAST_BIT to 5 to fix this.  The bit had not
      been used until the change, so we can just increase the value.
      
      Signed-off-by: Kazuhito Hagio <k-hagio...@nec.com>

  [Test Plan]
   * Confirm that makedumpfile works as expected by triggering a kdump.

   * Confirm that the patched makedumpfile works as expected on a system
  known to experience the issue.

   * Confirm that the patched makedumpfile is able to work with a cp-
  generated known affected vmcore to compress it. The unpatched version
  fails.

  [Where problems could occur]

   * This change could adversely affect the collection/compression of
  vmcores during a kdump situation resulting in fallback to cp.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/makedumpfile/+bug/1970672/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to