On 25/06/2024 15:53, Peter Xu wrote:
> On Tue, Jun 25, 2024 at 12:38:50PM +0100, Joao Martins wrote:
>> On 24/06/2024 20:41, Peter Xu wrote:
>>> On Fri, Jun 21, 2024 at 07:32:21AM -0700, Elena Ufimtseva wrote:
>>>> @@ -2659,6 +2698,18 @@ qemu_loadvm_section_start_full(QEMUFile *f,
>>>> MigrationIncomingState *mis,
>> >> if (!check_section_footer(f, se)) {
>>>> return -EINVAL;
>>>> @@ -2714,6 +2765,19 @@ qemu_loadvm_section_part_end(QEMUFile *f,
>>>> MigrationIncomingState *mis,
>>>> se->instance_id, end_ts - start_ts);
>>>> }
>>>>
>>>> + if (migrate_switchover_abort() && type == QEMU_VM_SECTION_END &&
>>>> + mis->downtime_start) {
>>>> + mis->downtime_now = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
>>>> + uint64_t dst_downtime = mis->downtime_now - mis->downtime_start;
>>>> + if (mis->src_downtime + dst_downtime >= mis->abort_limit) {
>>>> + error_report("Shutdown destination migration, migration
>>>> abort_limit (%lu ms)"
>>>> + "was reached.", mis->abort_limit);
>>>> + trace_qemu_loadvm_downtime_abort(mis->abort_limit,
>>>> dst_downtime,
>>>> + mis->src_downtime);
>>>> + return -EINVAL;
>>>> + }
>>>> + }
>>>
>>> So this traps both iteration and non-iteration phase. What if the downtime
>>> was caused by after these, or unluckily at the last non-iterator device
>>> state?
>>>
>>> After trying to think about it, I figured this is not easy to do right.
>>> Also, I start to doubt whether it's definitely a good idea on having this
>>> in the first place.
>>>
>>> Firstly, I'm wondering how we should treat this new feature
>>> v.s. downtime_limit. It's about how the user should set both.
>>>
>>> If this is about "cancel migration if downtime more than xxx",
>>> then.. AFAICT that's exactly what downtime_limit was "designed" to be..
>>> It's just that downtime_limit says the other way round, as: "don't
>>> switchover if the downtime will be more than xxx".
>>>
>>> Then, if the user has option to set both these parameters, what would be
>>> the right thing to do? Shouldn't they simply always set both parameters to
>>> be the same value already? But if so, what's the point on having two?
>>>
>>> This caused me to think whether the 2nd parameter is needed at all, instead
>>> whether we should simply make downtime_limit more accurate, so that it will
>>> start to account more things than before. It won't be accurate, but the
>>> same case here: making this new feature "accurate" can also be very hard.
>>>
>>
>> The way I think about it is that current downtime-limit captures nicely the
>> data
>> part as the only calculations it cares about is how much outstanding data it
>> sends to the destination (be it VF device state or RAM). This second
>> parameter
>> captures what is *not* related to data, i.e. costs of hypervisor quiescing
>> the
>> VM or added latencies in hypervisor *in addition* to sending outstanding
>> data to
>> destination.
>>
>> If we were to merge this all into a single parameter (aka downtime-limit) we
>> are
>> possibility artificially increasing the downtime thanks to relaxing the
>> oustanding data part, as opposed to trying to capture the switchover cost
>> (hence
>> the name the parameter) that can't be reflected in the former equation.
>>
>> So I feel like this needs two parameters whereby this second new parameter
>> covers 'switchover cost' (hence the name) with current migration algorithm.
>
> Then the question is how should we suggest the user to specify these two
> parameters.
>
> The cover letter used:
>
> migrate_set_parameter downtime-limit 300
> migrate_set_parameter switchover-limit 10
>
> But it does look like a testing sample only and not valid in real world
> setup, as logically the new limit should be larger than the old one,
> afaict. If the new limit is larger, how larger should it be?
>
> So far I can understand how it works intenrally, but even with that I don't
> know how I should set this parameter, e.g., when downtime_limit used to be
> 300ms, I'd say 500ms could be a suitable value, but is it? In that case,
> perhaps the 500ms is the "real" limit that I don't want a VM to be halted
> for more than 500ms, but in that case the user should have already setup
> downtime_limit to 500ms.
>
Yeap -- I think you're right that it presents a UX confusion on what should a
user should set.
> I also don't know how should an user understand all these details and
> figure out how to set these. Note that currently we definitely allow the
> user to specify downtime_limit and it's so far understandable, even though
> the user may assume that contains the whole blackout phase (that's also
> what we wish it can achieve..), rather than the fact that it only takes
> ram/vfio/... reported remaining data into account v.s. the bandwidth.
>
> Maybe we could consider introducing the parameters/caps in some other form,
> so that we can keep the downtime_limit to be "the total downtime", instead
> of a new definition of downtime. E.g., it's at least not fair indeed to
> take the whole "downtime_limit" for data transfers, so maybe some ratio
> parameter can be introduced to say how much portion of that downtime can be
> used for data transfer
I like this idea -- it fixes another problem that downtime-limit (solely) makes
the algorithm be too optimistic and just utilize the whole bandwidth (e.g.
migrating after the first dirty sync depending on downtime-limit)
e.g. "iterable-downtime" (or "precopy-downtime" for lack of a better idea) for
such proerty
Then when we set downtime-limit it can represent the whole thing including
blackout as it is the case today and we configure it by minimizing
iterable-downtime to give enough headroom for switchover.
>, and then it might be ok to work with the new cap
> introduced so that when total downtime is over the limit it will abort the
> migration (but without a new "max downtime" definition which might be
> confusing).
>
Yes, improving 'downtime-limit' with a sub parameter above for RAM/state
downtime, this migration capability then becomes a boolean instead of a value
where it's more like 'downtime-limit-strict' and going above downtime limit is
enforced/aborted is also performed like these patches. To prevent essentially
this:
> IOW, I wonder whether there can
> be case where user specifies these parameters, migration simply keeps
> switching over and keep aborting, requiring a retry. That's pretty
> unwanted. We may simply doesn't allow that switchover to happen at all.
(...)
> Said that, I also wonder whether you have thought about improving
> downtime_limit itself. It'll be definitely better if we simply don't
> switchover at all if that won't make it.
The device-state multifd scaling is a take on improving switchover phase, and we
will keep improving it whenever we find things... but the switchover itself
can't be 'precomputed' into a downtime number equation ahead of time to
encompass all possible latencies/costs. Part of the reason that at least we
couldn't think of a way besides this proposal here, which at the core it's meant
to bounds check switchover. Even without taking into account VFs/HW[0], it is
simply not considered how long it might take and giving some sort of downtime
buffer coupled with enforcement that can be enforced helps not violating
migration SLAs.
[0]
https://lore.kernel.org/qemu-devel/[email protected]/