Hi Jiatong, Thanks for emailing me, happy to answer questions anytime.
> 1. why linux-hwe-4.15.0 source code is used? If you look closely at the oops in the description, the customer I was working with was running: 4.15.0-106-generic #107~16.04.1-Ubuntu This is the Xenial (16.04) HWE kernel. I was using the linux-hwe-4.15.0 source code to make sure the debug symbols used for the debug symbol package matched exactly. In your case: 4.15.0-72-generic #81-Ubuntu you are running the 4.15 kernel on normal Bionic (18.04), so we can use the normal linux-4.15.0 source code. > 2. we are using linux-4.15.0-unsigned and by skimming through the source code, looks like try_get_page is not defined at that time? Yes! You are correct, the original mainline 4.15 kernel did not have try_get_page() defined at: https://elixir.bootlin.com/linux/v4.15/source/mm/gup.c#L156 But if you look closely at the actual kernel sources for 4.15.0-72-generic: https://git.launchpad.net/~ubuntu- kernel/ubuntu/+source/linux/+git/bionic/tree/mm/gup.c?h=Ubuntu-4.15.0-72.81#n156 We see that try_get_page() is there. That is because we backported: commit 8fde12ca79aff9b5ba951fce1a2641901b8d8e64 Author: Linus Torvalds <torva...@linux-foundation.org> Date: Thu Apr 11 10:49:19 2019 -0700 Subject: mm: prevent get_user_pages() from overflowing page refcount Link:https://github.com/torvalds/linux/commit/8fde12ca79aff9b5ba951fce1a2641901b8d8e64 Ubuntu 4.15 backport link: https://paste.ubuntu.com/p/2bF5WWQy2r/ That commit first turned up in 4.15.0-59-generic, via upstream-stable. Anyway, let's have a look at your stack trace: 4.15.0-72-generic #81-Ubuntu RIP: 0010:follow_page_pte+0x663/0x6d0 I downloaded the debug symbols: http://ddebs.ubuntu.com/ubuntu/pool/main/l/linux/linux-image- unsigned-4.15.0-72-generic-dbgsym_4.15.0-72.81_amd64.ddeb Extracted them: dpkg -x linux-image-unsigned-4.15.0-72-generic- dbgsym_4.15.0-72.81_amd64.ddeb debug and looked up: $ eu-addr2line -e ./vmlinux-4.15.0-72-generic -f follow_page_pte+0x663 try_get_page inlined at /build/linux-E6MDAa/linux-4.15.0/mm/gup.c:156 in follow_page_pte /build/linux-E6MDAa/linux-4.15.0/mm/gup.c:138 We see that you hit try_get_page() in mm/gup.c:156 155 if (flags & FOLL_GET) { 156 if (unlikely(!try_get_page(page))) { 157 page = ERR_PTR(-ENOMEM); 158 goto out; 159 } Looking at try_get_page() in include/linux/mm.h: 854 static inline __must_check bool try_get_page(struct page *page) 855 { 856 page = compound_head(page); 857 if (WARN_ON_ONCE(page_ref_count(page) <= 0)) 858 return false; 859 page_ref_inc(page); 860 return true; 861 } We see that you hit the exact same WARN_ON_ONCE for the page_ref_count(page) <= 0). So, whatever page you are trying to access, has its reference counter in the negatives, which suggests that has either wrapped around, or has been decremented too many times. Looking at your error log, I can't tell for sure if it is the zero_page, but its quite likely going to be. The zero_page is a frequently used page in the system, and it is used outside of ksm, it's just that ksm is a heavy user of the zero_page. If you are constantly allocating large amounts of new memory, you will be be using the zero_page similar to ksm, and the reference counter will eventually overflow. I think there is a good chance that the fix I submitted in 4.15.0-118-generic will solve your problems. Please do a "apt update" and "apt upgrade" and upgrade to a newer kernel, the newer the better, and it will most likely fix the problem. Let me know if you have any more questions. Thanks, Matthew -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1837810 Title: KVM: Fix zero_page reference counter overflow when using KSM on KVM compute host Status in linux package in Ubuntu: Fix Released Status in linux source package in Bionic: Fix Released Status in linux source package in Focal: Fix Released Bug description: BugLink: https://bugs.launchpad.net/bugs/1837810 [Impact] We are seeing a problem on OpenStack compute nodes, and KVM hosts, where a kernel oops is generated, and all running KVM machines are placed into the pause state. This is caused by the kernel's reserved zero_page reference counter overflowing from a positive number to a negative number, and hitting a (WARN_ON_ONCE(page_ref_count(page) <= 0)) condition in try_get_page(). This only happens if the machine has Kernel Samepage Mapping (KSM) enabled, with "use_zero_pages" turned on. Each time a new VM starts and the kernel does a KSM merge run during a EPT violation, the reference counter for the zero_page is incremented in try_async_pf() and never decremented. Eventually, the reference counter will overflow, causing the KVM subsystem to fail. Syslog: error : qemuMonitorJSONCheckError:392 : internal error: unable to execute QEMU command 'cont': Resetting the Virtual Machine is required QEMU Logs: error: kvm run failed Bad address EAX=000afe00 EBX=0000000b ECX=00000080 EDX=00000cfe ESI=0003fe00 EDI=000afe00 EBP=00000007 ESP=00006d74 EIP=000ee344 EFL=00010002 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0 ES =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA] CS =0008 00000000 ffffffff 00c09b00 DPL=0 CS32 [-RA] SS =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA] DS =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA] FS =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA] GS =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA] LDT=0000 00000000 0000ffff 00008200 DPL=0 LDT TR =0000 00000000 0000ffff 00008b00 DPL=0 TSS32-busy GDT= 000f7040 00000037 IDT= 000f707e 00000000 CR0=00000011 CR2=00000000 CR3=00000000 CR4=00000000 DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000 DR6=00000000ffff0ff0 DR7=0000000000000400 EFER=0000000000000000 Code=c3 57 56 b8 00 fe 0a 00 be 00 fe 03 00 b9 80 00 00 00 89 c7 <f3> a5 a1 00 80 03 00 8b 15 04 80 03 00 a3 00 80 0a 00 89 15 04 80 0a 00 b8 ae e2 00 00 31 Kernel Oops: [ 167.695986] WARNING: CPU: 1 PID: 3016 at /build/linux-hwe-FEhT7y/linux-hwe-4.15.0/include/linux/mm.h:852 follow_page_pte+0x6f4/0x710 [ 167.696023] CPU: 1 PID: 3016 Comm: CPU 0/KVM Tainted: G OE 4.15.0-106-generic #107~16.04.1-Ubuntu [ 167.696023] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1 04/01/2014 [ 167.696025] RIP: 0010:follow_page_pte+0x6f4/0x710 [ 167.696026] RSP: 0018:ffffa81802023908 EFLAGS: 00010286 [ 167.696027] RAX: ffffed8786e33a80 RBX: ffffed878c6d21b0 RCX: 0000000080000000 [ 167.696027] RDX: 0000000000000000 RSI: 00003ffffffff000 RDI: 80000001b8cea225 [ 167.696028] RBP: ffffa81802023970 R08: 80000001b8cea225 R09: ffff90c4d55fa340 [ 167.696028] R10: 0000000000000000 R11: 0000000000000000 R12: ffffed8786e33a80 [ 167.696029] R13: 0000000000000326 R14: ffff90c4db94fc50 R15: ffff90c4d55fa340 [ 167.696030] FS: 00007f6a7798c700(0000) GS:ffff90c4edc80000(0000) knlGS:0000000000000000 [ 167.696030] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 167.696031] CR2: 0000000000000000 CR3: 0000000315580002 CR4: 0000000000162ee0 [ 167.696033] Call Trace: [ 167.696047] follow_pmd_mask+0x273/0x630 [ 167.696049] follow_page_mask+0x178/0x230 [ 167.696051] __get_user_pages+0xb8/0x740 [ 167.696052] get_user_pages+0x42/0x50 [ 167.696068] __gfn_to_pfn_memslot+0x18b/0x3b0 [kvm] [ 167.696079] ? mmu_set_spte+0x1dd/0x3a0 [kvm] [ 167.696090] try_async_pf+0x66/0x220 [kvm] [ 167.696101] tdp_page_fault+0x14b/0x2b0 [kvm] [ 167.696104] ? vmexit_fill_RSB+0x10/0x40 [kvm_intel] [ 167.696114] kvm_mmu_page_fault+0x62/0x180 [kvm] [ 167.696117] handle_ept_violation+0xbc/0x160 [kvm_intel] [ 167.696119] vmx_handle_exit+0xa5/0x580 [kvm_intel] [ 167.696129] vcpu_enter_guest+0x414/0x1260 [kvm] [ 167.696138] ? kvm_arch_vcpu_load+0x4d/0x280 [kvm] [ 167.696148] kvm_arch_vcpu_ioctl_run+0xd9/0x3d0 [kvm] [ 167.696157] ? kvm_arch_vcpu_ioctl_run+0xd9/0x3d0 [kvm] [ 167.696165] kvm_vcpu_ioctl+0x33a/0x610 [kvm] [ 167.696166] ? do_futex+0x129/0x590 [ 167.696171] ? __switch_to+0x34c/0x4e0 [ 167.696174] ? __switch_to_asm+0x35/0x70 [ 167.696176] do_vfs_ioctl+0xa4/0x600 [ 167.696177] SyS_ioctl+0x79/0x90 [ 167.696180] ? exit_to_usermode_loop+0xa5/0xd0 [ 167.696181] do_syscall_64+0x73/0x130 [ 167.696182] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 [ 167.696184] RIP: 0033:0x7f6a80482007 [ 167.696184] RSP: 002b:00007f6a7798b8b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 167.696185] RAX: ffffffffffffffda RBX: 000000000000ae80 RCX: 00007f6a80482007 [ 167.696185] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 0000000000000016 [ 167.696186] RBP: 000055fe135f3240 R08: 000055fe118be530 R09: 0000000000000001 [ 167.696186] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 [ 167.696187] R13: 00007f6a85852000 R14: 0000000000000000 R15: 000055fe135f3240 [ 167.696188] Code: 4d 63 e6 e9 f2 fc ff ff 4c 89 45 d0 48 8b 47 10 e8 22 f0 9e 00 4c 8b 45 d0 e9 89 fc ff ff 4c 89 e7 e8 81 3f fd ff e9 aa fc ff ff <0f> 0b 49 c7 c4 f4 ff ff ff e9 c1 fc ff ff 0f 1f 40 00 66 2e 0f [ 167.696200] ---[ end trace 7573f6868ea8f069 ]--- [Fix] This was fixed in 5.6-rc1 with the following commit: commit 7df003c85218b5f5b10a7f6418208f31e813f38f Author: Zhuang Yanying <ann.zhuangyany...@huawei.com> Date: Sat Oct 12 11:37:31 2019 +0800 Subject: KVM: fix overflow of zero page refcount with ksm running Link: https://github.com/torvalds/linux/commit/7df003c85218b5f5b10a7f6418208f31e813f38f The fix adds a check to see if the Page Frame Number (pfn) is linked to the zero page, and if it is, treats it as reserved. This has the effect that put_page() is no longer called on the zero_page, and reference counting is no longer needed. This is a clean cherry pick to Bionic and Focal kernels. [Testcase] Create a new KVM host, and make sure it has plenty of ram. 16gb should be okay. Install KVM packages: $ sudo apt install -y qemu-kvm libvirt-bin qemu-utils genisoimage virtinst Enable Kernel Samepage Mapping, and use_zero_pages: $ echo 10000 | sudo tee /sys/kernel/mm/ksm/pages_to_scan $ echo 1 | sudo tee /sys/kernel/mm/ksm/run $ echo 1 | sudo tee /sys/kernel/mm/ksm/use_zero_pages I wrote a script which creates and destroys xenial KVM VMs in a infinite loop: https://paste.ubuntu.com/p/CvRTsDkdC7/ Save the script to disk, and execute it: $ chmod +x ksm_refcnt_overflow.sh $ ./ksm_refcnt_overflow.sh Each time a VM is created and destroyed the reference counter will increase. I wrote a kernel module which exposes a /proc interface, which we can use to look at the value of the zero_page reference counter. It works by taking the memory allocated for the zero page: empty_zero_page, which is defined in arch/x86/include/asm/pgtable.h, running virt_to_page() to get the page struct, which we can then dereference to get _refcount; https://paste.ubuntu.com/p/MJMN8jMVds/ Save the module to disk, create its Makefile from the included documentation, and build it: $ make $ sudo insmod zero_page_refcount.ko From there, we can examine the reference counter with: $ cat /proc/zero_page_refcount Zero Page Refcount: 0x687 or 1671 $ cat /proc/zero_page_refcount Zero Page Refcount: 0x846 or 2118 $ cat /proc/zero_page_refcount Zero Page Refcount: 0x9f8 or 2552 $ cat /proc/zero_page_refcount Zero Page Refcount: 0xcb2 or 3250 We see it steadily increase. Instead of waiting months for it to overflow, I implemented a /proc entry to set it to near overflow. You can use it with: $ cat /proc/zero_page_refcount_set Zero Page Refcount set to 0x1FFFFFFFFF000 After that, wait a few seconds and the reference counter will overflow: $ cat /proc/zero_page_refcount Zero Page Refcount: 0x7fffff16 or 2147483414 $ cat /proc/zero_page_refcount Zero Page Refcount: 0x80000000 or -2147483648 All VMs will become paused: $ virsh list Id Name State ---------------------------------------------------- 1 instance-0 paused 2 instance-1 paused QEMU will error out, and the kernel will oops with the messages in the impact section. I built a test kernel, which is available here: https://launchpad.net/~mruffell/+archive/ubuntu/sf290373-test If you install the test kernel and try reproduce, you will notice the reference counter is never incremented past 1: $ cat /proc/zero_page_refcount Zero Page Refcount: 0x1 or 1 $ cat /proc/zero_page_refcount Zero Page Refcount: 0x1 or 1 $ cat /proc/zero_page_refcount Zero Page Refcount: 0x1 or 1 This resolves the problem. [Regression Potential] While the change itself seems simple, it changes how the kernel treats the zero_page. The zero_page is important, since it is just a page full of 0's. Each time memory is allocated which is all 0s, the kernel sets it to use the zero_page to save memory. When an application writes to the buffer, a EPT violation happens, and the kernel does a COW to new pages to hold the data. The change is limited to how the KVM subsystem handles the zero_page. This will not break the entire kernel if a regression occurs, only KVM. If a regression were to occur, users could turn off KSM and disable KSM use_zero_pages until a fix is ready, as this particular use of zero_pages is limited to KSM. The fix landed in upstream 5.6, and has not been backported to stable kernels. I have read a bit of the paging code, especially around where the zero_page is used, and where its reference counters were being incorrectly incremented. I think the fix is correct, and I believe it won't cause any regressions. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1837810/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp