This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:
apport-collect 1921211 and then change the status of the bug to 'Confirmed'. If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'. This change has been made by an automated script, maintained by the Ubuntu Kernel Team. ** Changed in: linux (Ubuntu) Status: New => Incomplete ** Tags added: xenial -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1921211 Title: Taking a memory dump of user mode process on Xenial hosts causes bugcheck/kernel panic and core dump Status in linux package in Ubuntu: Incomplete Status in linux source package in Xenial: In Progress Bug description: [Impact] We have some Ubuntu 16.04 hosts (in Hyper-V) being used for testing some Ubuntu 20.04 container. As part of the testing we were attempting to take a memory dump of a container running SQL Server with Ubuntu 20.04 on the Ubuntu 16.04 host we started seeing kernel panic and core dump. It started happening after a specific Xenial kernel update on the host. 4.4.0-204-generic - Systems that are crashing 4.4.0-201-generic - Systems that are able to capture dump Note from the developer indicates following logging showing up. ---- Now the following is output right after I attempt to start the dump. (gdb, attach ###, generate-core-file /var/opt/mssql/log/rdorr.delme.core) [Fri Mar 19 20:01:38 2021] systemd-journald[581]: Successfully sent stream file descriptor to service manager. [Fri Mar 19 20:01:41 2021] cni0: port 9(vethdec5d2b7) entered forwarding state [Fri Mar 19 20:02:42 2021] systemd-journald[581]: Successfully sent stream file descriptor to service manager. [Fri Mar 19 20:03:04 2021] ------------[ cut here ]------------ [Fri Mar 19 20:03:04 2021] kernel BUG at /build/linux-qlAbvR/linux-4.4.0/mm/memory.c:3214! [Fri Mar 19 20:03:04 2021] invalid opcode: 0000 [#1] SMP [Fri Mar 19 20:03:04 2021] Modules linked in: veth vxlan ip6_udp_tunnel udp_tunnel xt_statistic xt_nat ipt_REJECT nf_reject_ipv4 xt_tcpudp ip_vs_sh ip_vs_wrr ip_vs_rr ip_vs libcrc32c ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6_tables xt_comment xt_mark xt_conntrack ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype iptable_filter iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ip_tables x_tables br_netfilter bridge stp llc aufs overlay nls_utf8 isofs crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd input_leds serio_raw i2c_piix4 hv_balloon hyperv_fb 8250_fintek joydev mac_hid autofs4 hid_generic hv_utils hid_hyperv ptp hv_netvsc hid hv_storvsc pps_core [Fri Mar 19 20:03:04 2021] hyperv_keyboard scsi_transport_fc psmouse pata_acpi hv_vmbus floppy fjes [Fri Mar 19 20:03:04 2021] CPU: 1 PID: 24869 Comm: gdb Tainted: G W 4.4.0-204-generic #236-Ubuntu [Fri Mar 19 20:03:04 2021] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090007 05/18/2018 [Fri Mar 19 20:03:04 2021] task: ffff880db9229c80 ti: ffff880d93b9c000 task.ti: ffff880d93b9c000 [Fri Mar 19 20:03:04 2021] RIP: 0010:[<ffffffff811cd93e>] [<ffffffff811cd93e>] handle_mm_fault+0x13de/0x1b80 [Fri Mar 19 20:03:04 2021] RSP: 0018:ffff880d93b9fc28 EFLAGS: 00010246 [Fri Mar 19 20:03:04 2021] RAX: 0000000000000100 RBX: 0000000000000000 RCX: 0000000000000120 [Fri Mar 19 20:03:04 2021] RDX: ffff880ea635f3e8 RSI: 00003ffffffff000 RDI: 0000000000000000 [Fri Mar 19 20:03:04 2021] RBP: ffff880d93b9fce8 R08: 00003ff32179a120 R09: 000000000000007d [Fri Mar 19 20:03:04 2021] R10: ffff8800000003e8 R11: 00000000000003e8 R12: ffff8800ea672708 [Fri Mar 19 20:03:04 2021] R13: 0000000000000000 R14: 000000010247d000 R15: ffff8800f27fe400 [Fri Mar 19 20:03:04 2021] FS: 00007fdc26061600(0000) GS:ffff881025640000(0000) knlGS:0000000000000000 [Fri Mar 19 20:03:04 2021] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [Fri Mar 19 20:03:04 2021] CR2: 000055e3a0011290 CR3: 0000000d93ba4000 CR4: 0000000000160670 [Fri Mar 19 20:03:04 2021] Stack: [Fri Mar 19 20:03:04 2021] ffffffff81082929 fffffffffffffffd ffffffff81082252 ffff880d93b9fca8 [Fri Mar 19 20:03:04 2021] ffffffff811c7bca ffff8800f27fe400 000000010247d000 ffff880e74a88090 [Fri Mar 19 20:03:04 2021] 000000003a98d7f0 ffff880e00000001 ffff8800000003e8 0000000000000017 [Fri Mar 19 20:03:04 2021] Call Trace: [Fri Mar 19 20:03:04 2021] [<ffffffff81082929>] ? mm_access+0x79/0xa0 [Fri Mar 19 20:03:04 2021] [<ffffffff81082252>] ? mmput+0x12/0x130 [Fri Mar 19 20:03:04 2021] [<ffffffff811c7bca>] ? follow_page_pte+0x1ca/0x3d0 [Fri Mar 19 20:03:04 2021] [<ffffffff811c7fe4>] ? follow_page_mask+0x214/0x3a0 [Fri Mar 19 20:03:04 2021] [<ffffffff811c82a0>] __get_user_pages+0x130/0x680 [Fri Mar 19 20:03:04 2021] [<ffffffff8122b248>] ? path_openat+0x348/0x1360 [Fri Mar 19 20:03:04 2021] [<ffffffff811c8b74>] get_user_pages+0x34/0x40 [Fri Mar 19 20:03:04 2021] [<ffffffff811c90f4>] __access_remote_vm+0xe4/0x2d0 [Fri Mar 19 20:03:04 2021] [<ffffffff811ef6ac>] ? alloc_pages_current+0x8c/0x110 [Fri Mar 19 20:03:04 2021] [<ffffffff811cfe3f>] access_remote_vm+0x1f/0x30 [Fri Mar 19 20:03:04 2021] [<ffffffff8128d3fa>] mem_rw.isra.16+0xfa/0x190 [Fri Mar 19 20:03:04 2021] [<ffffffff8128d4c8>] mem_read+0x18/0x20 [Fri Mar 19 20:03:04 2021] [<ffffffff8121c89b>] __vfs_read+0x1b/0x40 [Fri Mar 19 20:03:04 2021] [<ffffffff8121d016>] vfs_read+0x86/0x130 [Fri Mar 19 20:03:04 2021] [<ffffffff8121df65>] SyS_pread64+0x95/0xb0 [Fri Mar 19 20:03:04 2021] [<ffffffff8186acdb>] entry_SYSCALL_64_fastpath+0x22/0xd0 [Fri Mar 19 20:03:04 2021] Code: d4 ee ff ff 48 8b 7d 98 89 45 88 e8 2d c7 fd ff 8b 45 88 89 c3 e9 be ee ff ff 48 8b bd 70 ff ff ff e8 c7 cf 69 00 e9 ad ee ff ff <0f> 0b 4c 89 e7 4c 89 9d 70 ff ff ff e8 f1 c9 00 00 85 c0 4c 8b [Fri Mar 19 20:03:04 2021] RIP [<ffffffff811cd93e>] handle_mm_fault+0x13de/0x1b80 [Fri Mar 19 20:03:04 2021] RSP <ffff880d93b9fc28> [Fri Mar 19 20:03:04 2021] ---[ end trace 9d28a7e662aea7df ]--- [Fri Mar 19 20:03:04 2021] systemd-journald[581]: Compressed data object 806 -> 548 using XZ ------------------------ We think the following code may be relevant to the crashing behavior. I think this is the relevant source for Ubuntu 4.4.0-204 (BTW, are you sure this is Ubuntu 20.04? 4.4.0 is a Xenial kernel): memory.c\mm - ~ubuntu-kernel/ubuntu/+source/linux/+git/xenial - [no description] (launchpad.net) static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long addr, pte_t pte, pte_t *ptep, pmd_t *pmd) { ... /* A PROT_NONE fault should not end up here */ BUG_ON(!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE))); Line 3214 We see following fix but we are not certain if it's relevant yet. This is interesting… mm: check VMA flags to avoid invalid PROT_NONE NUMA balancing · torvalds/linux@38e0885 · GitHub mm: check VMA flags to avoid invalid PROT_NONE NUMA balancing · torvalds/linux@38e0885 · GitHub mm: check VMA flags to avoid invalid PROT_NONE NUMA balancing The NUMA balancing logic uses an arch-specific PROT_NONE page table flag defined by pte_protnone() or pmd_protnone() to mark PTEs or huge page PMDs respectively as requiring balancing upon a subsequent page fault. User-defined PROT_NONE memory regions which also have this flag set will not normally invoke the NUMA balancing code as do_page_fault() will send a segfault to the process before handle_mm_fault() is even called. However if access_remote_vm() is invoked to access a PROT_NONE region of memory, handle_mm_fault() is called via faultin_page() and __get_user_pages() without any access checks being performed, meaning the NUMA balancing logic is incorrectly invoked on a non-NUMA memory region. A simple means of triggering this problem is to access PROT_NONE mmap'd memory using /proc/self/mem which reliably results in the NUMA handling functions being invoked when CONFIG_NUMA_BALANCING is set. This issue was reported in bugzilla (issue 99101) which includes some simple repro code. There are BUG_ON() checks in do_numa_page() and do_huge_pmd_numa_page() added at commit c0e7cad to avoid accidentally provoking strange behavior by attempting to apply NUMA balancing to pages that are in fact PROT_NONE. The BUG_ON()'s are consistently triggered by the repro. This patch moves the PROT_NONE check into mm/memory.c rather than invoking BUG_ON() as faulting in these pages via faultin_page() is a valid reason for reaching the NUMA check with the PROT_NONE page table flag set and is therefore not always a bug. Link: https://bugzilla.kernel.org/show_bug.cgi?id=99101 We need help in understanding how to prevent core dump/kernel panic while taking memory dump of a focal container on a xenial host. [Test Plan] Testing on an 16.04 Azure instance, follow the steps: $ echo 'GRUB_FLAVOUR_ORDER="generic"' | sudo tee -a /etc/default/grub.d/99-custom.cfg $ sudo apt install linux-generic $ sudo reboot # login again and confirm the system is booted with the 4.4 kernel $ sudo apt install docker.io gdb $ sudo docker pull mcr.microsoft.com/mssql/server:2019-latest $ sudo docker run -e "ACCEPT_EULA=Y" -e "SA_PASSWORD=<YourStrong@Passw0rd>" \ -p 1433:1433 --name sql1 -h sql1 \ -d mcr.microsoft.com/mssql/server:2019-latest $ps -ef | grep sqlservr sudo gdb -p $PID -ex generate-core-file # A kernel BUG should be triggered [Where problems could occur] The patches touches the mm subsystem and because of that there's always the potential for significant regressions and in this case a revert and a re-spin would probably be necessary. On the other hand however, this patch is included into the mainline kernel since 4.8 without problems. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1921211/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp