Hi Akihiko, It took some time to set up the machines; apologies for the delay in response.
On Mon, Sep 01, 2025 at 02:12:54AM +0900, Akihiko Odaki wrote: > On 2025/09/01 1:38, Arun Menon wrote: > > Hi, > > > > On Mon, Sep 01, 2025 at 01:04:40AM +0900, Akihiko Odaki wrote: > > > On 2025/09/01 0:45, Arun Menon wrote: > > > > Hi Akihiko, > > > > Thanks for the review. > > > > > > > > On Sat, Aug 30, 2025 at 02:58:05PM +0900, Akihiko Odaki wrote: > > > > > On 2025/08/30 5:01, Arun Menon wrote: > > > > > > This is an incremental step in converting vmstate loading > > > > > > code to report error via Error objects instead of directly > > > > > > printing it to console/monitor. > > > > > > It is ensured that qemu_loadvm_state() must report an error > > > > > > in errp, in case of failure. > > > > > > > > > > > > When postcopy live migration runs, the device states are loaded by > > > > > > both the qemu coroutine process_incoming_migration_co() and the > > > > > > postcopy_ram_listen_thread(). Therefore, it is important that the > > > > > > coroutine also reports the error in case of failure, with > > > > > > error_report_err(). Otherwise, the source qemu will not display > > > > > > any errors before going into the postcopy pause state. > > > > > > > > > > > > Reviewed-by: Marc-André Lureau <[email protected]> > > > > > > Reviewed-by: Fabiano Rosas <[email protected]> > > > > > > Signed-off-by: Arun Menon <[email protected]> > > > > > > --- > > > > > > migration/migration.c | 9 +++++---- > > > > > > migration/savevm.c | 30 ++++++++++++++++++------------ > > > > > > migration/savevm.h | 2 +- > > > > > > 3 files changed, 24 insertions(+), 17 deletions(-) > > > > > > > > > > > > diff --git a/migration/migration.c b/migration/migration.c > > > > > > index > > > > > > 10c216d25dec01f206eacad2edd24d21f00e614c..c6768d88f45c870c7fad9b9957300766ff69effc > > > > > > 100644 > > > > > > --- a/migration/migration.c > > > > > > +++ b/migration/migration.c > > > > > > @@ -881,7 +881,7 @@ process_incoming_migration_co(void *opaque) > > > > > > MIGRATION_STATUS_ACTIVE); > > > > > > mis->loadvm_co = qemu_coroutine_self(); > > > > > > - ret = qemu_loadvm_state(mis->from_src_file); > > > > > > + ret = qemu_loadvm_state(mis->from_src_file, &local_err); > > > > > > mis->loadvm_co = NULL; > > > > > > > > > > > > trace_vmstate_downtime_checkpoint("dst-precopy-loadvm-completed"); > > > > > > @@ -908,7 +908,8 @@ process_incoming_migration_co(void *opaque) > > > > > > } > > > > > > if (ret < 0) { > > > > > > - error_setg(&local_err, "load of migration failed: %s", > > > > > > strerror(-ret)); > > > > > > + error_prepend(&local_err, "load of migration failed: %s: ", > > > > > > + strerror(-ret)); > > > > > > goto fail; > > > > > > } > > > > > > @@ -924,13 +925,13 @@ fail: > > > > > > migrate_set_state(&mis->state, MIGRATION_STATUS_ACTIVE, > > > > > > MIGRATION_STATUS_FAILED); > > > > > > migrate_set_error(s, local_err); > > > > > > - error_free(local_err); > > > > > > + error_report_err(local_err); > > > > > > > > > > This is problematic because it results in duplicate error reports when > > > > > !mis->exit_on_error; in that case the query-migrate QMP command > > > > > reports the > > > > > error and this error reporting is redundant. > > > > > > > > If I comment this change, then all of the errors propagated up to now, > > > > using > > > > error_setg() will not be reported. This is the place where it is > > > > finally reported, > > > > when qemu_loadvm_state() fails. In other words, all the error_reports() > > > > we removed > > > > from all the files, replacing them with error_setg(), will finally be > > > > reported here > > > > using error_report_err(). > > > > > > My understanding of the code without these two changes is: > > > - If the migrate-incoming QMP command is used with false as > > > exit-on-error, this function will not report the error but > > > the query-migrate QMP command will report the error. > > > - Otherwise, this function reports the error. > > > > With my limited experience in testing, I have a question, > > So there are 2 scenarios, > > 1. running the virsh migrate command on the source host. Something like the > > following, > > virsh -c 'qemu:///system' migrate --live --verbose --domain guest-vm > > --desturi qemu+ssh://10.6.120.20/system > > OR for postcopy-ram, > > virsh migrate guest-vm --live qemu+ssh://10.6.120.20/system --verbose > > --postcopy --timeout 10 --timeout-postcopy > > > > 2. Using QMP commands, performing a migration from source to destination. > > Running something like the following on the destination: > > { > > "execute": "migrate-incoming", > > "arguments": { > > "uri": "tcp:127.0.0.1:7777", > > "exit-on-error": false > > } > > } > > { > > "execute": "migrate-incoming", > > "arguments": { > > "uri": "tcp:127.0.0.1:7777", > > "exit-on-error": false > > } > > } > > and the somthing like the following on source: > > { > > "execute": "migrate", > > "arguments": { > > "uri": "tcp:127.0.0.1:7777" > > } > > } > > {"execute" : "query-migrate"} > > > > In 1, previously, the user used to get an error message on migration > > failure. > > This was because there were error_report() calls in all of the files. > > Now that they are replaced with error_setg() and the error is stored in > > errp, > > we need to display that using error_report_err(). Hence I introduced an > > error_report_err() > > call in the fail section. > > > > In 2, we have 2 QMP sessions, one for the source and another for the > > destination. > > The QMP command migrate will be issued on the source, and the errp will be > > set. > > I did not understand the part where the message will be displayed because > > of the > > error_report_err() call. I did not see such a message on failure scenario > > on both > > the sessions. > > If the user wants to check for errors, then the destination qemu will not > > exit > > (exit-on-error = false ) and we can retrieve it using {"execute" : > > "query-migrate"} > > > > Aren't the 2 scenarios different by nature? > > In 1, doesn't libvirt query the error with query-migrate and print it? Ideally it should find the the error, and print the whole thing. It does work in the normal scenario. However, the postcopy scenario does not show the same result, which is mentioned in the commit message. > > In any case, it would be nice if you describe how libvirt interacts with > QEMU in 1. Please find below the difference in the command output at source, when we run a live migration with postcopy enabled. ========= With the current changes: [root@dell-per750-42 qemu-priv]# virsh migrate-setspeed guest-vm 1 [root@dell-per750-42 build]# virsh migrate guest-vm --live qemu+ssh://10.6.120.9/system --verbose --postcopy --timeout 10 --timeout-postcopy [email protected]'s password: Migration: [ 1.26 %]error: internal error: QEMU unexpectedly closed the monitor (vm='guest-vm'): 2025-09-03T06:19:15.076547Z qemu-system-x86_64: -accel kvm: warning: Number of SMP cpus requested (2) exceeds the recommended cpus supported by KVM (1) 2025-09-03T06:19:15.076586Z qemu-system-x86_64: -accel kvm: warning: Number of hotpluggable cpus requested (2) exceeds the recommended cpus supported by KVM (1) 2025-09-03T06:19:27.776715Z qemu-system-x86_64: load of migration failed: Input/output error: error while loading state for instance 0x0 of device 'tpm-emulator': post load hook failed for: tpm-emulator, version_id: 0, minimum_version: 0, ret: -5: tpm-emulator: Setting the stateblob (type 1) failed with a TPM error 0x21 decryption error [root@dell-per750-42 build]# ========= Without the current changes: [root@dell-per750-42 qemu-priv]# virsh migrate-setspeed guest-vm 1 [root@dell-per750-42 qemu-priv]# virsh migrate guest-vm --live qemu+ssh://10.6.120.9/system --verbose --postcopy --timeout 10 --timeout-postcopy [email protected]'s password: Migration: [ 1.28 %]error: internal error: QEMU unexpectedly closed the monitor (vm='guest-vm'): 2025-09-03T06:26:17.733786Z qemu-system-x86_64: -accel kvm: warning: Number of SMP cpus requested (2) exceeds the recommended cpus supported by KVM (1) 2025-09-03T06:26:17.733830Z qemu-system-x86_64: -accel kvm: warning: Number of hotpluggable cpus requested (2) exceeds the recommended cpus supported by KVM (1) [root@dell-per750-42 qemu-priv]# ========= The original behavior was to print the error to the console regardless of whether the migration is normal or postcopy. The source machine goes in to a paused state after this. > > Regards, > Akihiko Odaki > Regards, Arun Menon
