[Kernel-packages] [Bug 1837810] Re: KVM: Fix zero_page reference counter overflow when using KSM on KVM compute host

Matthew Ruffell Sun, 16 May 2021 23:12:20 -0700

Hi Jiatong,

Thanks for emailing me, happy to answer questions anytime.


> 1. why linux-hwe-4.15.0 source code is used?

If you look closely at the oops in the description, the customer I was
working with was running:

4.15.0-106-generic #107~16.04.1-Ubuntu
 
This is the Xenial (16.04) HWE kernel. I was using the linux-hwe-4.15.0 source 
code to make sure the debug symbols used for the debug symbol package matched 
exactly.

In your case:

4.15.0-72-generic #81-Ubuntu

you are running the 4.15 kernel on normal Bionic (18.04), so we can use
the normal linux-4.15.0 source code.

> 2. we are using linux-4.15.0-unsigned and by skimming through the
source code, looks like try_get_page is not defined at that time?

Yes! You are correct, the original mainline 4.15 kernel did not have
try_get_page() defined at:

https://elixir.bootlin.com/linux/v4.15/source/mm/gup.c#L156

But if you look closely at the actual kernel sources for
4.15.0-72-generic:

https://git.launchpad.net/~ubuntu-
kernel/ubuntu/+source/linux/+git/bionic/tree/mm/gup.c?h=Ubuntu-4.15.0-72.81#n156

We see that try_get_page() is there. That is because we backported:

commit 8fde12ca79aff9b5ba951fce1a2641901b8d8e64
Author: Linus Torvalds <torva...@linux-foundation.org>
Date:   Thu Apr 11 10:49:19 2019 -0700
Subject: mm: prevent get_user_pages() from overflowing page refcount
Link:https://github.com/torvalds/linux/commit/8fde12ca79aff9b5ba951fce1a2641901b8d8e64

Ubuntu 4.15 backport link: https://paste.ubuntu.com/p/2bF5WWQy2r/

That commit first turned up in 4.15.0-59-generic, via upstream-stable.

Anyway, let's have a look at your stack trace:

4.15.0-72-generic #81-Ubuntu
RIP: 0010:follow_page_pte+0x663/0x6d0

I downloaded the debug symbols:

http://ddebs.ubuntu.com/ubuntu/pool/main/l/linux/linux-image-
unsigned-4.15.0-72-generic-dbgsym_4.15.0-72.81_amd64.ddeb

Extracted them:

dpkg -x linux-image-unsigned-4.15.0-72-generic-
dbgsym_4.15.0-72.81_amd64.ddeb debug

and looked up:

$ eu-addr2line -e ./vmlinux-4.15.0-72-generic -f follow_page_pte+0x663
try_get_page inlined at /build/linux-E6MDAa/linux-4.15.0/mm/gup.c:156 in 
follow_page_pte
/build/linux-E6MDAa/linux-4.15.0/mm/gup.c:138

We see that you hit try_get_page() in mm/gup.c:156

 155     if (flags & FOLL_GET) {
 156         if (unlikely(!try_get_page(page))) {
 157             page = ERR_PTR(-ENOMEM);
 158             goto out;
 159         }
 
Looking at try_get_page() in include/linux/mm.h:

 854 static inline __must_check bool try_get_page(struct page *page)
 855 {
 856     page = compound_head(page);
 857     if (WARN_ON_ONCE(page_ref_count(page) <= 0))
 858         return false;
 859     page_ref_inc(page);
 860     return true;
 861 }
 
We see that you hit the exact same WARN_ON_ONCE for the page_ref_count(page) <= 
0).

So, whatever page you are trying to access, has its reference counter in
the negatives, which suggests that has either wrapped around, or has
been decremented too many times.

Looking at your error log, I can't tell for sure if it is the zero_page,
but its quite likely going to be. The zero_page is a frequently used
page in the system, and it is used outside of ksm, it's just that ksm is
a heavy user of the zero_page. If you are constantly allocating large
amounts of new memory, you will be be using the zero_page similar to
ksm, and the reference counter will eventually overflow.

I think there is a good chance that the fix I submitted in
4.15.0-118-generic will solve your problems. Please do a "apt update"
and "apt upgrade" and upgrade to a newer kernel, the newer the better,
and it will most likely fix the problem.

Let me know if you have any more questions.

Thanks,
Matthew

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1837810

Title:
  KVM: Fix zero_page reference counter overflow when using KSM on KVM
  compute host

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Bionic:
  Fix Released
Status in linux source package in Focal:
  Fix Released

Bug description:
  BugLink: https://bugs.launchpad.net/bugs/1837810

  [Impact]

  We are seeing a problem on OpenStack compute nodes, and KVM hosts,
  where a kernel oops is generated, and all running KVM machines are
  placed into the pause state.

  This is caused by the kernel's reserved zero_page reference counter
  overflowing from a positive number to a negative number, and hitting a
  (WARN_ON_ONCE(page_ref_count(page) <= 0)) condition in try_get_page().

  This only happens if the machine has Kernel Samepage Mapping (KSM)
  enabled, with "use_zero_pages" turned on. Each time a new VM starts
  and the kernel does a KSM merge run during a EPT violation, the
  reference counter for the zero_page is incremented in try_async_pf()
  and never decremented. Eventually, the reference counter will
  overflow, causing the KVM subsystem to fail.

  Syslog:
  error : qemuMonitorJSONCheckError:392 : internal error: unable to execute 
QEMU command 'cont': Resetting the Virtual Machine is required

  QEMU Logs:
  error: kvm run failed Bad address
  EAX=000afe00 EBX=0000000b ECX=00000080 EDX=00000cfe
  ESI=0003fe00 EDI=000afe00 EBP=00000007 ESP=00006d74
  EIP=000ee344 EFL=00010002 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0
  ES =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
  CS =0008 00000000 ffffffff 00c09b00 DPL=0 CS32 [-RA]
  SS =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
  DS =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
  FS =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
  GS =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
  LDT=0000 00000000 0000ffff 00008200 DPL=0 LDT
  TR =0000 00000000 0000ffff 00008b00 DPL=0 TSS32-busy
  GDT=     000f7040 00000037
  IDT=     000f707e 00000000
  CR0=00000011 CR2=00000000 CR3=00000000 CR4=00000000
  DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 
DR3=0000000000000000 
  DR6=00000000ffff0ff0 DR7=0000000000000400
  EFER=0000000000000000
  Code=c3 57 56 b8 00 fe 0a 00 be 00 fe 03 00 b9 80 00 00 00 89 c7 <f3> a5 a1 
00 80 03 00 8b 15 04 80 03 00 a3 00 80 0a 00 89 15 04 80 0a 00 b8 ae e2 00 00 31

  Kernel Oops:

  [  167.695986] WARNING: CPU: 1 PID: 3016 at 
/build/linux-hwe-FEhT7y/linux-hwe-4.15.0/include/linux/mm.h:852 
follow_page_pte+0x6f4/0x710
  [  167.696023] CPU: 1 PID: 3016 Comm: CPU 0/KVM Tainted: G           OE    
4.15.0-106-generic #107~16.04.1-Ubuntu
  [  167.696023] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.13.0-1ubuntu1 04/01/2014
  [  167.696025] RIP: 0010:follow_page_pte+0x6f4/0x710
  [  167.696026] RSP: 0018:ffffa81802023908 EFLAGS: 00010286
  [  167.696027] RAX: ffffed8786e33a80 RBX: ffffed878c6d21b0 RCX: 
0000000080000000
  [  167.696027] RDX: 0000000000000000 RSI: 00003ffffffff000 RDI: 
80000001b8cea225
  [  167.696028] RBP: ffffa81802023970 R08: 80000001b8cea225 R09: 
ffff90c4d55fa340
  [  167.696028] R10: 0000000000000000 R11: 0000000000000000 R12: 
ffffed8786e33a80
  [  167.696029] R13: 0000000000000326 R14: ffff90c4db94fc50 R15: 
ffff90c4d55fa340
  [  167.696030] FS:  00007f6a7798c700(0000) GS:ffff90c4edc80000(0000) 
knlGS:0000000000000000
  [  167.696030] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [  167.696031] CR2: 0000000000000000 CR3: 0000000315580002 CR4: 
0000000000162ee0
  [  167.696033] Call Trace:
  [  167.696047]  follow_pmd_mask+0x273/0x630
  [  167.696049]  follow_page_mask+0x178/0x230
  [  167.696051]  __get_user_pages+0xb8/0x740
  [  167.696052]  get_user_pages+0x42/0x50
  [  167.696068]  __gfn_to_pfn_memslot+0x18b/0x3b0 [kvm]
  [  167.696079]  ? mmu_set_spte+0x1dd/0x3a0 [kvm]
  [  167.696090]  try_async_pf+0x66/0x220 [kvm]
  [  167.696101]  tdp_page_fault+0x14b/0x2b0 [kvm]
  [  167.696104]  ? vmexit_fill_RSB+0x10/0x40 [kvm_intel]
  [  167.696114]  kvm_mmu_page_fault+0x62/0x180 [kvm]
  [  167.696117]  handle_ept_violation+0xbc/0x160 [kvm_intel]
  [  167.696119]  vmx_handle_exit+0xa5/0x580 [kvm_intel]
  [  167.696129]  vcpu_enter_guest+0x414/0x1260 [kvm]
  [  167.696138]  ? kvm_arch_vcpu_load+0x4d/0x280 [kvm]
  [  167.696148]  kvm_arch_vcpu_ioctl_run+0xd9/0x3d0 [kvm]
  [  167.696157]  ? kvm_arch_vcpu_ioctl_run+0xd9/0x3d0 [kvm]
  [  167.696165]  kvm_vcpu_ioctl+0x33a/0x610 [kvm]
  [  167.696166]  ? do_futex+0x129/0x590
  [  167.696171]  ? __switch_to+0x34c/0x4e0
  [  167.696174]  ? __switch_to_asm+0x35/0x70
  [  167.696176]  do_vfs_ioctl+0xa4/0x600
  [  167.696177]  SyS_ioctl+0x79/0x90
  [  167.696180]  ? exit_to_usermode_loop+0xa5/0xd0
  [  167.696181]  do_syscall_64+0x73/0x130
  [  167.696182]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
  [  167.696184] RIP: 0033:0x7f6a80482007
  [  167.696184] RSP: 002b:00007f6a7798b8b8 EFLAGS: 00000246 ORIG_RAX: 
0000000000000010
  [  167.696185] RAX: ffffffffffffffda RBX: 000000000000ae80 RCX: 
00007f6a80482007
  [  167.696185] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 
0000000000000016
  [  167.696186] RBP: 000055fe135f3240 R08: 000055fe118be530 R09: 
0000000000000001
  [  167.696186] R10: 0000000000000000 R11: 0000000000000246 R12: 
0000000000000000
  [  167.696187] R13: 00007f6a85852000 R14: 0000000000000000 R15: 
000055fe135f3240
  [  167.696188] Code: 4d 63 e6 e9 f2 fc ff ff 4c 89 45 d0 48 8b 47 10 e8 22 f0 
9e 00 4c 8b 45 d0 e9 89 fc ff ff 4c 89 e7 e8 81 3f fd ff e9 aa fc ff ff <0f> 0b 
49 c7 c4 f4 ff ff ff e9 c1 fc ff ff 0f 1f 40 00 66 2e 0f 
  [  167.696200] ---[ end trace 7573f6868ea8f069 ]---

  [Fix]

  This was fixed in 5.6-rc1 with the following commit:

  commit 7df003c85218b5f5b10a7f6418208f31e813f38f
  Author: Zhuang Yanying <ann.zhuangyany...@huawei.com>
  Date:   Sat Oct 12 11:37:31 2019 +0800
  Subject:  KVM: fix overflow of zero page refcount with ksm running
  Link: 
https://github.com/torvalds/linux/commit/7df003c85218b5f5b10a7f6418208f31e813f38f
 

  The fix adds a check to see if the Page Frame Number (pfn) is linked
  to the zero page, and if it is, treats it as reserved. This has the
  effect that put_page() is no longer called on the zero_page, and
  reference counting is no longer needed.

  This is a clean cherry pick to Bionic and Focal kernels.

  [Testcase]

  Create a new KVM host, and make sure it has plenty of ram. 16gb should
  be okay.

  Install KVM packages:

  $ sudo apt install -y qemu-kvm libvirt-bin qemu-utils genisoimage
  virtinst

  Enable Kernel Samepage Mapping, and use_zero_pages:

  $ echo 10000 | sudo tee /sys/kernel/mm/ksm/pages_to_scan
  $ echo 1 | sudo tee /sys/kernel/mm/ksm/run
  $ echo 1 | sudo tee /sys/kernel/mm/ksm/use_zero_pages

  I wrote a script which creates and destroys xenial KVM VMs in a infinite loop:
  https://paste.ubuntu.com/p/CvRTsDkdC7/

  Save the script to disk, and execute it:

  $ chmod +x ksm_refcnt_overflow.sh
  $ ./ksm_refcnt_overflow.sh

  Each time a VM is created and destroyed the reference counter will
  increase.

  I wrote a kernel module which exposes a /proc interface, which we can
  use to look at the value of the zero_page reference counter. It works
  by taking the memory allocated for the zero page: empty_zero_page,
  which is defined in arch/x86/include/asm/pgtable.h, running
  virt_to_page() to get the page struct, which we can then dereference
  to get _refcount;

  https://paste.ubuntu.com/p/MJMN8jMVds/

  Save the module to disk, create its Makefile from the included
  documentation, and build it:

  $ make
  $ sudo insmod zero_page_refcount.ko 

  From there, we can examine the reference counter with:

  $ cat /proc/zero_page_refcount
  Zero Page Refcount: 0x687 or 1671
  $ cat /proc/zero_page_refcount
  Zero Page Refcount: 0x846 or 2118
  $ cat /proc/zero_page_refcount
  Zero Page Refcount: 0x9f8 or 2552
  $ cat /proc/zero_page_refcount
  Zero Page Refcount: 0xcb2 or 3250 

  We see it steadily increase. Instead of waiting months for it to
  overflow, I implemented a /proc entry to set it to near overflow. You
  can use it with:

  $ cat /proc/zero_page_refcount_set
  Zero Page Refcount set to 0x1FFFFFFFFF000 

  After that, wait a few seconds and the reference counter will
  overflow:

  $ cat /proc/zero_page_refcount
  Zero Page Refcount: 0x7fffff16 or 2147483414
  $ cat /proc/zero_page_refcount
  Zero Page Refcount: 0x80000000 or -2147483648 

  All VMs will become paused:

  $ virsh list
  Id Name State
  ----------------------------------------------------
  1 instance-0 paused
  2 instance-1 paused 

  QEMU will error out, and the kernel will oops with the messages in the
  impact section.

  I built a test kernel, which is available here:

  https://launchpad.net/~mruffell/+archive/ubuntu/sf290373-test

  If you install the test kernel and try reproduce, you will notice the
  reference counter is never incremented past 1:

  $ cat /proc/zero_page_refcount
  Zero Page Refcount: 0x1 or 1
  $ cat /proc/zero_page_refcount
  Zero Page Refcount: 0x1 or 1
  $ cat /proc/zero_page_refcount
  Zero Page Refcount: 0x1 or 1 

  This resolves the problem.

  [Regression Potential]

  While the change itself seems simple, it changes how the kernel treats
  the zero_page. The zero_page is important, since it is just a page
  full of 0's. Each time memory is allocated which is all 0s, the kernel
  sets it to use the zero_page to save memory. When an application
  writes to the buffer, a EPT violation happens, and the kernel does a
  COW to new pages to hold the data.

  The change is limited to how the KVM subsystem handles the zero_page.
  This will not break the entire kernel if a regression occurs, only
  KVM.

  If a regression were to occur, users could turn off KSM and disable
  KSM use_zero_pages until a fix is ready, as this particular use of
  zero_pages is limited to KSM.

  The fix landed in upstream 5.6, and has not been backported to stable
  kernels.

  I have read a bit of the paging code, especially around where the
  zero_page is used, and where its reference counters were being
  incorrectly incremented.

  I think the fix is correct, and I believe it won't cause any
  regressions.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1837810/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1837810] Re: KVM: Fix zero_page reference counter overflow when using KSM on KVM compute host

Reply via email to