On 24.02.20 23:49, Peter Xu wrote:
> On Fri, Feb 21, 2020 at 05:42:04PM +0100, David Hildenbrand wrote:
>> When we partially change mappings (esp., mmap over parts of an existing
>> mmap like qemu_ram_remap() does) where we have a userfaultfd handler
>> registered, the handler will implicitly be unregistered from the parts that
>> changed.
>>
>> Trying to place pages onto mappings where there is no longer a handler
>> registered will fail. Let's make sure that any waiter is woken up - we
>> have to do that manually.
>>
>> Let's also document how UFFDIO_UNREGISTER will handle this scenario.
>>
>> This is mainly a preparation for RAM blocks with resizable allcoations,
>> where the mapping of the invalid RAM range will change. The source will
>> keep sending pages that are outside of the new (shrunk) RAM size. We have
>> to treat these pages like they would have been migrated, but can
>> essentially simply drop the content (ignore the placement error).
>>
>> Keep printing a warning on EINVAL, to avoid hiding other (programming)
>> issues. ENOENT is unique.
>>
>> Cc: "Dr. David Alan Gilbert" <[email protected]>
>> Cc: Juan Quintela <[email protected]>
>> Cc: Peter Xu <[email protected]>
>> Cc: Andrea Arcangeli <[email protected]>
>> Signed-off-by: David Hildenbrand <[email protected]>
>> ---
>> migration/postcopy-ram.c | 37 +++++++++++++++++++++++++++++++++++++
>> 1 file changed, 37 insertions(+)
>>
>> diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
>> index c68caf4e42..f023830b9a 100644
>> --- a/migration/postcopy-ram.c
>> +++ b/migration/postcopy-ram.c
>> @@ -506,6 +506,12 @@ static int cleanup_range(RAMBlock *rb, void *opaque)
>> range_struct.start = (uintptr_t)host_addr;
>> range_struct.len = length;
>>
>> + /*
>> + * In case the mapping was partially changed since we enabled userfault
>> + * (e.g., via qemu_ram_remap()), the userfaultfd handler was already
>> removed
>> + * for the mappings that changed. Unregistering will, however, still
>> work
>> + * and ignore mappings without a registered handler.
>> + */
>
> Ideally we should still only unregister what we have registered.
> After all we do have this information because we know what we
> registered, we know what has unmapped (in your new resize() hook, when
> postcopy_state==RUNNING).
Not in the case of qemu_ram_remap(). And whatever you propose will
require synchronization (see my other mail) and more complicated
handling than this. uffd allows you to handle races with mmap changes in
a very elegant way (e.g., -ENOENT, or unregisterignoring changed mappings).
>
> An extreme example is when we register with pages in range [A, B),
> then shrink it to [A, C), then we mapped something else within [C, B)
> (note, with virtio-mem logically B can be very big and C can be very
> small, it means [B, C) can cover quite some address space). Then if:
>
> - [C, B) memory type is not compatible with uffd, or
That will never happen in the near future. Without resizable allocations:
- All memory is either anonymous or from a single fd
In addition, right now, only anonymous memory can be used for resizable
RAM. However, with resizable allocations we could have:
- All used_length memory is either anonymous or from a single fd
- All remaining memory is either anonymous or from a single fd
Everything else does not make any sense IMHO and I don't think this is
relevant long term. You cannot arbitrarily map things into the
used_length part of a RAMBlock. That would contradict to its page_size
and its fd. E.g., you would break qemu_ram_remap().
>
> - [C, B) could be registered with uffd again due to some other
> reason (so far QEMU should not have such a reason)
Any code that wants to make use of uffd properly has to synchronize
against postcopy code either way IMHO. It just doesn't work otherwise.
E.g., once I would use it to protect unplugged memory in virtio-mem
(something I am looking into right now and teaching QEMU not to touch
all RAMBlock memory is complicated :) ), virtio-mem would unregister any
uffd handler when notified that postcopy will start, and re-register
after postcopy finished.
[...]
>>
>> +static int qemu_ufd_wake_ioctl(int userfault_fd, void *host_addr,
>> + uint64_t pagesize)
>> +{
>> + struct uffdio_range range = {
>> + .start = (uint64_t)(uintptr_t)host_addr,
>> + .len = pagesize,
>> + };
>> +
>> + return ioctl(userfault_fd, UFFDIO_WAKE, &range);
>> +}
>> +
>> static int qemu_ufd_copy_ioctl(int userfault_fd, void *host_addr,
>> void *from_addr, uint64_t pagesize, RAMBlock
>> *rb)
>> {
>> @@ -1198,6 +1215,26 @@ static int qemu_ufd_copy_ioctl(int userfault_fd, void
>> *host_addr,
>> zero_struct.mode = 0;
>> ret = ioctl(userfault_fd, UFFDIO_ZEROPAGE, &zero_struct);
>> }
>> +
>> + /*
>> + * When the mapping gets partially changed (e.g., qemu_ram_remap())
>> before
>> + * we try to place a page, the userfaultfd handler will be removed for
>> the
>> + * changed mappings and placing pages will fail. We can safely ignore
>> this,
>> + * because mappings that changed on the destination don't need data
>> from the
>> + * source (e.g., qemu_ram_remap()). Wake up any waiter waiting for that
>> page
>> + * (unlikely but possible). Waking up waiters is always possible, even
>> + * without a registered userfaultfd handler.
>> + *
>> + * Old kernels report EINVAL, new kernels report ENOENT in case there is
>> + * no longer a userfaultfd handler for a mapping.
>> + */
>> + if (ret && (errno == ENOENT || errno == EINVAL)) {
>> + if (errno == EINVAL) {
>> + warn_report("%s: Failed to place page %p. Waking up any
>> waiters.",
>> + __func__, host_addr);
>> + }
>> + ret = qemu_ufd_wake_ioctl(userfault_fd, host_addr, pagesize);
>
> ... if with above information (takes notes on where we registered
> uffd), I think we don't need to capture error, but we can simply skip
> those outliers.
I think we could skip them in general. Nobody should be touching that
memory. But debugging e.g., a SEGFAULT is easier than debugging some
sleeping thread IMHO.
--
Thanks,
David / dhildenb