On Tue, May 04, 2021 at 01:04:17PM +0200, David Hildenbrand wrote: > On 04.05.21 12:32, Daniel P. Berrangé wrote: > > On Tue, May 04, 2021 at 12:21:25PM +0200, David Hildenbrand wrote: > > > On 04.05.21 12:09, Daniel P. Berrangé wrote: > > > > On Wed, Apr 28, 2021 at 03:37:48PM +0200, David Hildenbrand wrote: > > > > > Let's support RAM_NORESERVE via MAP_NORESERVE on Linux. The flag has > > > > > no > > > > > effect on most shared mappings - except for hugetlbfs and anonymous > > > > > memory. > > > > > > > > > > Linux man page: > > > > > "MAP_NORESERVE: Do not reserve swap space for this mapping. When > > > > > swap > > > > > space is reserved, one has the guarantee that it is possible to > > > > > modify > > > > > the mapping. When swap space is not reserved one might get SIGSEGV > > > > > upon a write if no physical memory is available. See also the > > > > > discussion > > > > > of the file /proc/sys/vm/overcommit_memory in proc(5). In kernels > > > > > before > > > > > 2.6, this flag had effect only for private writable mappings." > > > > > > > > > > Note that the "guarantee" part is wrong with memory overcommit in > > > > > Linux. > > > > > > > > > > Also, in Linux hugetlbfs is treated differently - we configure > > > > > reservation > > > > > of huge pages from the pool, not reservation of swap space (huge pages > > > > > cannot be swapped). > > > > > > > > > > The rough behavior is [1]: > > > > > a) !Hugetlbfs: > > > > > > > > > > 1) Without MAP_NORESERVE *or* with memory overcommit under Linux > > > > > disabled ("/proc/sys/vm/overcommit_memory == 2"), the following > > > > > accounting/reservation happens: > > > > > For a file backed map > > > > > SHARED or READ-only - 0 cost (the file is the map not swap) > > > > > PRIVATE WRITABLE - size of mapping per instance > > > > > > > > > > For an anonymous or /dev/zero map > > > > > SHARED - size of mapping > > > > > PRIVATE READ-only - 0 cost (but of little use) > > > > > PRIVATE WRITABLE - size of mapping per instance > > > > > > > > > > 2) With MAP_NORESERVE, no accounting/reservation happens. > > > > > > > > > > b) Hugetlbfs: > > > > > > > > > > 1) Without MAP_NORESERVE, huge pages are reserved. > > > > > > > > > > 2) With MAP_NORESERVE, no huge pages are reserved. > > > > > > > > > > Note: With "/proc/sys/vm/overcommit_memory == 0", we were already able > > > > > to configure it for !hugetlbfs globally; this toggle now allows > > > > > configuring it more fine-grained, not for the whole system. > > > > > > > > > > The target use case is virtio-mem, which dynamically exposes memory > > > > > inside a large, sparse memory area to the VM. > > > > > > > > Can you explain this use case in more real world terms, as I'm not > > > > understanding what a mgmt app would actually do with this in > > > > practice ? > > > > > > Let's consider huge pages for simplicity. Assume you have 128 free huge > > > pages in your hypervisor that you want to dynamically assign to VMs. > > > > > > Further assume you have two VMs running. A workflow could look like > > > > > > 1. Assign all huge pages to VM 0 > > > 2. Reassign 64 huge pages to VM 1 > > > 3. Reassign another 32 huge pages to VM 1 > > > 4. Reasssign 16 huge pages to VM 0 > > > 5. ... > > > > > > Basically what we're used to doing with "ordinary" memory. > > > > What does this look like in terms of the memory backend configuration > > when you boot VM 0 and VM 1 ? > > > > Are you saying that we boot both VMs with > > > > -object hostmem-memfd,size=128G,hugetlb=yes,hugetlbsize=1G,reserve=off > > > > and then we have another property set on 'virtio-mem' to tell it > > how much/little of that 128 G, to actually give to the guest ? > > How do we change that at runtime ? > > Roughly, yes. We only special-case memory backends managed by virtio-mem > devices. > > An advanced example for a single VM could look like this: > > sudo build/qemu-system-x86_64 \ > ... \ > -m 4G,maxmem=64G \ > -smp sockets=2,cores=2 \ > -object hostmem-memfd,id=bmem0,size=2G,hugetlb=yes,hugetlbsize=2M \ > -numa node,nodeid=0,cpus=0-1,memdev=bmem0 \ > -object hostmem-memfd,id=bmem1,size=2G,hugetlb=yes,hugetlbsize=2M \ > -numa node,nodeid=1,cpus=2-3,memdev=bmem1 \ > ... \ > -object > hostmem-memfd,id=mem0,size=30G,hugetlb=yes,hugetlbsize=2M,reserve=off \ > -device virtio-mem-pci,id=vmem0,memdev=mem0,node=0,requested-size=0G \ > -object > hostmem-memfd,id=mem1,size=30G,hugetlb=yes,hugetlbsize=2M,reserve=off \ > -device virtio-mem-pci,id=vmem1,memdev=mem1,node=1,requested-size=0G \ > ... \ > > We can request a size change by adjusting the "requested-size" property > (e.g., via qom-set) > and observe the current size by reading the "size" property (e.g., qom-get). > Think of > it as an advanced device-local memory balloon mixed with the concept of a > memory hotplug.
Ok, so in this example, the initial GB of RAM has normal reserve=on so if there's insufficient hugepages we'll see the startup failure IIUC. What happens when we set qom-set requested-size=10GB at runtime, but there are only 8 GB of hugepages left available ? Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|