On 05/09/2023 20:38, Peter Xu wrote: > Migration bandwidth is a very important value to live migration. It's > because it's one of the major factors that we'll make decision on when to > switchover to destination in a precopy process. > > This value is currently estimated by QEMU during the whole live migration > process by monitoring how fast we were sending the data. This can be the > most accurate bandwidth if in the ideal world, where we're always feeding > unlimited data to the migration channel, and then it'll be limited to the > bandwidth that is available. > > However in reality it may be very different, e.g., over a 10Gbps network we > can see query-migrate showing migration bandwidth of only a few tens of > MB/s just because there are plenty of other things the migration thread > might be doing. For example, the migration thread can be busy scanning > zero pages, or it can be fetching dirty bitmap from other external dirty > sources (like vhost or KVM). It means we may not be pushing data as much > as possible to migration channel, so the bandwidth estimated from "how many > data we sent in the channel" can be dramatically inaccurate sometimes. > > With that, the decision to switchover will be affected, by assuming that we > may not be able to switchover at all with such a low bandwidth, but in > reality we can. > > The migration may not even converge at all with the downtime specified, > with that wrong estimation of bandwidth, keeping iterations forever with a > low estimation of bandwidth. > > The issue is QEMU itself may not be able to avoid those uncertainties on > measuing the real "available migration bandwidth". At least not something > I can think of so far. > > One way to fix this is when the user is fully aware of the available > bandwidth, then we can allow the user to help providing an accurate value. > > For example, if the user has a dedicated channel of 10Gbps for migration > for this specific VM, the user can specify this bandwidth so QEMU can > always do the calculation based on this fact, trusting the user as long as > specified. It may not be the exact bandwidth when switching over (in which > case qemu will push migration data as fast as possible), but much better > than QEMU trying to wildly guess, especially when very wrong. > > A new parameter "avail-switchover-bandwidth" is introduced just for this. > So when the user specified this parameter, instead of trusting the > estimated value from QEMU itself (based on the QEMUFile send speed), it > trusts the user more by using this value to decide when to switchover, > assuming that we'll have such bandwidth available then. > > Note that specifying this value will not throttle the bandwidth for > switchover yet, so QEMU will always use the full bandwidth possible for > sending switchover data, assuming that should always be the most important > way to use the network at that time. > > This can resolve issues like "unconvergence migration" which is caused by > hilarious low "migration bandwidth" detected for whatever reason. > > Reported-by: Zhiyi Guo <[email protected]> > Signed-off-by: Peter Xu <[email protected]>
FWIW, Reviewed-by: Joao Martins <[email protected]> > --- > qapi/migration.json | 11 +++++++++++ > migration/migration.h | 2 +- > migration/options.h | 2 ++ > migration/migration-hmp-cmds.c | 14 ++++++++++++++ > migration/migration.c | 24 +++++++++++++++++++++--- > migration/options.c | 25 +++++++++++++++++++++++++ > migration/trace-events | 2 +- > 7 files changed, 75 insertions(+), 5 deletions(-) > > diff --git a/qapi/migration.json b/qapi/migration.json > index eeb1878c4f..49c36ec9c0 100644 > --- a/qapi/migration.json > +++ b/qapi/migration.json > @@ -766,6 +766,16 @@ > # @max-bandwidth: to set maximum speed for migration. maximum speed > # in bytes per second. (Since 2.8) > # > +# @avail-switchover-bandwidth: to set the available bandwidth that > +# migration can use during switchover phase. NOTE! This does not > +# limit the bandwidth during switchover, but only for calculations when > +# making decisions to switchover. By default, this value is zero, > +# which means QEMU will estimate the bandwidth automatically. This can > +# be set when the estimated value is not accurate, while the user is > +# able to guarantee such bandwidth is available when switching over. > +# When specified correctly, this can make the switchover decision much > +# more accurate. (Since 8.2) > +# > # @downtime-limit: set maximum tolerated downtime for migration. > # maximum downtime in milliseconds (Since 2.8) > # > @@ -856,6 +866,7 @@ > '*tls-hostname': 'StrOrNull', > '*tls-authz': 'StrOrNull', > '*max-bandwidth': 'size', > + '*avail-switchover-bandwidth': 'size', > '*downtime-limit': 'uint64', > '*x-checkpoint-delay': { 'type': 'uint32', > 'features': [ 'unstable' ] }, > diff --git a/migration/migration.h b/migration/migration.h > index 6eea18db36..ce910c1db2 100644 > --- a/migration/migration.h > +++ b/migration/migration.h > @@ -283,7 +283,7 @@ struct MigrationState { > /* > * The final stage happens when the remaining data is smaller than > * this threshold; it's calculated from the requested downtime and > - * measured bandwidth > + * measured bandwidth, or avail-switchover-bandwidth if specified. > */ > int64_t threshold_size; > > diff --git a/migration/options.h b/migration/options.h > index 4591545c62..d78b437e82 100644 > --- a/migration/options.h > +++ b/migration/options.h > @@ -83,6 +83,7 @@ typedef enum { > MIGRATION_PARAMETER_TLS_HOSTNAME, > MIGRATION_PARAMETER_TLS_AUTHZ, > MIGRATION_PARAMETER_MAX_BANDWIDTH, > + MIGRATION_PARAMETER_AVAIL_SWITCHOVER_BANDWIDTH, > MIGRATION_PARAMETER_DOWNTIME_LIMIT, > MIGRATION_PARAMETER_X_CHECKPOINT_DELAY, > MIGRATION_PARAMETER_BLOCK_INCREMENTAL, > @@ -128,6 +129,7 @@ int migrate_decompress_threads(void); > uint64_t migrate_downtime_limit(void); > uint8_t migrate_max_cpu_throttle(void); > uint64_t migrate_max_bandwidth(void); > +uint64_t migrate_avail_switchover_bandwidth(void); > uint64_t migrate_max_postcopy_bandwidth(void); > int migrate_multifd_channels(void); > MultiFDCompression migrate_multifd_compression(void); > diff --git a/migration/migration-hmp-cmds.c b/migration/migration-hmp-cmds.c > index 0a35a87b7e..07a98a4636 100644 > --- a/migration/migration-hmp-cmds.c > +++ b/migration/migration-hmp-cmds.c > @@ -327,6 +327,10 @@ void hmp_info_migrate_parameters(Monitor *mon, const > QDict *qdict) > monitor_printf(mon, "%s: %" PRIu64 " bytes/second\n", > MigrationParameter_str(MIGRATION_PARAMETER_MAX_BANDWIDTH), > params->max_bandwidth); > + assert(params->has_avail_switchover_bandwidth); > + monitor_printf(mon, "%s: %" PRIu64 " bytes/second\n", > + > MigrationParameter_str(MIGRATION_PARAMETER_AVAIL_SWITCHOVER_BANDWIDTH), > + params->avail_switchover_bandwidth); > assert(params->has_downtime_limit); > monitor_printf(mon, "%s: %" PRIu64 " ms\n", > MigrationParameter_str(MIGRATION_PARAMETER_DOWNTIME_LIMIT), > @@ -577,6 +581,16 @@ void hmp_migrate_set_parameter(Monitor *mon, const QDict > *qdict) > } > p->max_bandwidth = valuebw; > break; > + case MIGRATION_PARAMETER_AVAIL_SWITCHOVER_BANDWIDTH: > + p->has_avail_switchover_bandwidth = true; > + ret = qemu_strtosz_MiB(valuestr, NULL, &valuebw); > + if (ret < 0 || valuebw > INT64_MAX > + || (size_t)valuebw != valuebw) { > + error_setg(&err, "Invalid size %s", valuestr); > + break; > + } > + p->avail_switchover_bandwidth = valuebw; > + break; > case MIGRATION_PARAMETER_DOWNTIME_LIMIT: > p->has_downtime_limit = true; > visit_type_size(v, param, &p->downtime_limit, &err); > diff --git a/migration/migration.c b/migration/migration.c > index 5528acb65e..85be4f019b 100644 > --- a/migration/migration.c > +++ b/migration/migration.c > @@ -2684,17 +2684,33 @@ static void migration_update_counters(MigrationState > *s, > { > uint64_t transferred, transferred_pages, time_spent; > uint64_t current_bytes; /* bytes transferred since the beginning */ > + uint64_t switchover_bw; > + /* Expected bandwidth when switching over to destination QEMU */ > + double expected_bw_per_ms; > double bandwidth; > > if (current_time < s->iteration_start_time + BUFFER_DELAY) { > return; > } > > + switchover_bw = migrate_avail_switchover_bandwidth(); > current_bytes = migration_transferred_bytes(s->to_dst_file); > transferred = current_bytes - s->iteration_initial_bytes; > time_spent = current_time - s->iteration_start_time; > bandwidth = (double)transferred / time_spent; > - s->threshold_size = bandwidth * migrate_downtime_limit(); > + > + if (switchover_bw) { > + /* > + * If the user specified a switchover bandwidth, let's trust the > + * user so that can be more accurate than what we estimated. > + */ > + expected_bw_per_ms = switchover_bw / 1000; > + } else { > + /* If the user doesn't specify bandwidth, we use the estimated */ > + expected_bw_per_ms = bandwidth; > + } > + > + s->threshold_size = expected_bw_per_ms * migrate_downtime_limit(); > > s->mbps = (((double) transferred * 8.0) / > ((double) time_spent / 1000.0)) / 1000.0 / 1000.0; > @@ -2711,7 +2727,7 @@ static void migration_update_counters(MigrationState *s, > if (stat64_get(&mig_stats.dirty_pages_rate) && > transferred > 10000) { > s->expected_downtime = > - stat64_get(&mig_stats.dirty_bytes_last_sync) / bandwidth; > + stat64_get(&mig_stats.dirty_bytes_last_sync) / > expected_bw_per_ms; > } > > migration_rate_reset(s->to_dst_file); > @@ -2719,7 +2735,9 @@ static void migration_update_counters(MigrationState *s, > update_iteration_initial_status(s); > > trace_migrate_transferred(transferred, time_spent, > - bandwidth, s->threshold_size); > + /* Both in unit bytes/ms */ > + bandwidth, switchover_bw / 1000, > + s->threshold_size); > } > > static bool migration_can_switchover(MigrationState *s) > diff --git a/migration/options.c b/migration/options.c > index c9b90d932d..65d0b58fa9 100644 > --- a/migration/options.c > +++ b/migration/options.c > @@ -101,6 +101,7 @@ const char > *MigrationParameter_string[MIGRATION_PARAMETER__MAX] = { > [MIGRATION_PARAMETER_TLS_HOSTNAME] = "tls-hostname", > [MIGRATION_PARAMETER_TLS_AUTHZ] = "tls-authz", > [MIGRATION_PARAMETER_MAX_BANDWIDTH] = "max-bandwidth", > + [MIGRATION_PARAMETER_AVAIL_SWITCHOVER_BANDWIDTH] = > "avail-switchover-bandwidth", > [MIGRATION_PARAMETER_DOWNTIME_LIMIT] = "downtime-limit", > [MIGRATION_PARAMETER_X_CHECKPOINT_DELAY] = "x-checkpoint-delay", > [MIGRATION_PARAMETER_BLOCK_INCREMENTAL] = "block-incremental", > @@ -176,6 +177,8 @@ Property migration_properties[] = { > parameters.cpu_throttle_tailslow, false), > DEFINE_PROP_SIZE("x-max-bandwidth", MigrationState, > parameters.max_bandwidth, MAX_THROTTLE), > + DEFINE_PROP_SIZE("avail-switchover-bandwidth", MigrationState, > + parameters.avail_switchover_bandwidth, 0), > DEFINE_PROP_UINT64("x-downtime-limit", MigrationState, > parameters.downtime_limit, > DEFAULT_MIGRATE_SET_DOWNTIME), > @@ -868,6 +871,13 @@ uint64_t migrate_max_bandwidth(void) > return s->parameters.max_bandwidth; > } > > +uint64_t migrate_avail_switchover_bandwidth(void) > +{ > + MigrationState *s = migrate_get_current(); > + > + return s->parameters.avail_switchover_bandwidth; > +} > + > uint64_t migrate_max_postcopy_bandwidth(void) > { > MigrationState *s = migrate_get_current(); > @@ -1004,6 +1014,8 @@ MigrationParameters *qmp_query_migrate_parameters(Error > **errp) > params->tls_authz = QAPI_CLONE(StrOrNull, s->parameters.tls_authz); > params->has_max_bandwidth = true; > params->max_bandwidth = s->parameters.max_bandwidth; > + params->has_avail_switchover_bandwidth = true; > + params->avail_switchover_bandwidth = > s->parameters.avail_switchover_bandwidth; > params->has_downtime_limit = true; > params->downtime_limit = s->parameters.downtime_limit; > params->has_x_checkpoint_delay = true; > @@ -1144,6 +1156,15 @@ bool migrate_params_check(MigrationParameters *params, > Error **errp) > return false; > } > > + if (params->has_avail_switchover_bandwidth && > + (params->avail_switchover_bandwidth > SIZE_MAX)) { > + error_setg(errp, QERR_INVALID_PARAMETER_VALUE, > + "avail_switchover_bandwidth", > + "an integer in the range of 0 to "stringify(SIZE_MAX) > + " bytes/second"); > + return false; > + } > + > if (params->has_downtime_limit && > (params->downtime_limit > MAX_MIGRATE_DOWNTIME)) { > error_setg(errp, QERR_INVALID_PARAMETER_VALUE, > @@ -1320,6 +1341,10 @@ static void migrate_params_apply(MigrationParameters > *params, Error **errp) > } > } > > + if (params->has_avail_switchover_bandwidth) { > + s->parameters.avail_switchover_bandwidth = > params->avail_switchover_bandwidth; > + } > + > if (params->has_downtime_limit) { > s->parameters.downtime_limit = params->downtime_limit; > } > diff --git a/migration/trace-events b/migration/trace-events > index 4666f19325..ebd88de3d0 100644 > --- a/migration/trace-events > +++ b/migration/trace-events > @@ -185,7 +185,7 @@ source_return_path_thread_shut(uint32_t val) "0x%x" > source_return_path_thread_resume_ack(uint32_t v) "%"PRIu32 > source_return_path_thread_switchover_acked(void) "" > migration_thread_low_pending(uint64_t pending) "%" PRIu64 > -migrate_transferred(uint64_t transferred, uint64_t time_spent, uint64_t > bandwidth, uint64_t size) "transferred %" PRIu64 " time_spent %" PRIu64 " > bandwidth %" PRIu64 " max_size %" PRId64 > +migrate_transferred(uint64_t transferred, uint64_t time_spent, uint64_t > bandwidth, uint64_t avail_bw, uint64_t size) "transferred %" PRIu64 " > time_spent %" PRIu64 " bandwidth %" PRIu64 " switchover_bw %" PRIu64 " > max_size %" PRId64 > process_incoming_migration_co_end(int ret, int ps) "ret=%d postcopy-state=%d" > process_incoming_migration_co_postcopy_end_main(void) "" > postcopy_preempt_enabled(bool value) "%d"
