Hi Akihiko, On Sat, Sep 06, 2025 at 05:22:31AM +0200, Akihiko Odaki wrote: > On 2025/09/03 8:47, Arun Menon wrote: > > Hi Akihiko, > > > > It took some time to set up the machines; apologies for the delay in > > response. > > > > On Mon, Sep 01, 2025 at 02:12:54AM +0900, Akihiko Odaki wrote: > > > On 2025/09/01 1:38, Arun Menon wrote: > > > > Hi, > > > > > > > > On Mon, Sep 01, 2025 at 01:04:40AM +0900, Akihiko Odaki wrote: > > > > > On 2025/09/01 0:45, Arun Menon wrote: > > > > > > Hi Akihiko, > > > > > > Thanks for the review. > > > > > > > > > > > > On Sat, Aug 30, 2025 at 02:58:05PM +0900, Akihiko Odaki wrote: > > > > > > > On 2025/08/30 5:01, Arun Menon wrote: > > > > > > > > This is an incremental step in converting vmstate loading > > > > > > > > code to report error via Error objects instead of directly > > > > > > > > printing it to console/monitor. > > > > > > > > It is ensured that qemu_loadvm_state() must report an error > > > > > > > > in errp, in case of failure. > > > > > > > > > > > > > > > > When postcopy live migration runs, the device states are loaded > > > > > > > > by > > > > > > > > both the qemu coroutine process_incoming_migration_co() and the > > > > > > > > postcopy_ram_listen_thread(). Therefore, it is important that > > > > > > > > the > > > > > > > > coroutine also reports the error in case of failure, with > > > > > > > > error_report_err(). Otherwise, the source qemu will not display > > > > > > > > any errors before going into the postcopy pause state. > > > > > > > > > > > > > > > > Reviewed-by: Marc-André Lureau <[email protected]> > > > > > > > > Reviewed-by: Fabiano Rosas <[email protected]> > > > > > > > > Signed-off-by: Arun Menon <[email protected]> > > > > > > > > --- > > > > > > > > migration/migration.c | 9 +++++---- > > > > > > > > migration/savevm.c | 30 ++++++++++++++++++------------ > > > > > > > > migration/savevm.h | 2 +- > > > > > > > > 3 files changed, 24 insertions(+), 17 deletions(-) > > > > > > > > > > > > > > > > diff --git a/migration/migration.c b/migration/migration.c > > > > > > > > index > > > > > > > > 10c216d25dec01f206eacad2edd24d21f00e614c..c6768d88f45c870c7fad9b9957300766ff69effc > > > > > > > > 100644 > > > > > > > > --- a/migration/migration.c > > > > > > > > +++ b/migration/migration.c > > > > > > > > @@ -881,7 +881,7 @@ process_incoming_migration_co(void *opaque) > > > > > > > > MIGRATION_STATUS_ACTIVE); > > > > > > > > mis->loadvm_co = qemu_coroutine_self(); > > > > > > > > - ret = qemu_loadvm_state(mis->from_src_file); > > > > > > > > + ret = qemu_loadvm_state(mis->from_src_file, &local_err); > > > > > > > > mis->loadvm_co = NULL; > > > > > > > > > > > > > > > > trace_vmstate_downtime_checkpoint("dst-precopy-loadvm-completed"); > > > > > > > > @@ -908,7 +908,8 @@ process_incoming_migration_co(void *opaque) > > > > > > > > } > > > > > > > > if (ret < 0) { > > > > > > > > - error_setg(&local_err, "load of migration failed: %s", > > > > > > > > strerror(-ret)); > > > > > > > > + error_prepend(&local_err, "load of migration failed: > > > > > > > > %s: ", > > > > > > > > + strerror(-ret)); > > > > > > > > goto fail; > > > > > > > > } > > > > > > > > @@ -924,13 +925,13 @@ fail: > > > > > > > > migrate_set_state(&mis->state, MIGRATION_STATUS_ACTIVE, > > > > > > > > MIGRATION_STATUS_FAILED); > > > > > > > > migrate_set_error(s, local_err); > > > > > > > > - error_free(local_err); > > > > > > > > + error_report_err(local_err); > > > > > > > > > > > > > > This is problematic because it results in duplicate error reports > > > > > > > when > > > > > > > !mis->exit_on_error; in that case the query-migrate QMP command > > > > > > > reports the > > > > > > > error and this error reporting is redundant. > > > > > > > > > > > > If I comment this change, then all of the errors propagated up to > > > > > > now, using > > > > > > error_setg() will not be reported. This is the place where it is > > > > > > finally reported, > > > > > > when qemu_loadvm_state() fails. In other words, all the > > > > > > error_reports() we removed > > > > > > from all the files, replacing them with error_setg(), will finally > > > > > > be reported here > > > > > > using error_report_err(). > > > > > > > > > > My understanding of the code without these two changes is: > > > > > - If the migrate-incoming QMP command is used with false as > > > > > exit-on-error, this function will not report the error but > > > > > the query-migrate QMP command will report the error. > > > > > - Otherwise, this function reports the error. > > > > > > > > With my limited experience in testing, I have a question, > > > > So there are 2 scenarios, > > > > 1. running the virsh migrate command on the source host. Something like > > > > the following, > > > > virsh -c 'qemu:///system' migrate --live --verbose --domain > > > > guest-vm --desturi qemu+ssh://10.6.120.20/system > > > > OR for postcopy-ram, > > > > virsh migrate guest-vm --live qemu+ssh://10.6.120.20/system > > > > --verbose --postcopy --timeout 10 --timeout-postcopy > > > > > > > > 2. Using QMP commands, performing a migration from source to > > > > destination. > > > > Running something like the following on the destination: > > > > { > > > > "execute": "migrate-incoming", > > > > "arguments": { > > > > "uri": "tcp:127.0.0.1:7777", > > > > "exit-on-error": false > > > > } > > > > } > > > > { > > > > "execute": "migrate-incoming", > > > > "arguments": { > > > > "uri": "tcp:127.0.0.1:7777", > > > > "exit-on-error": false > > > > } > > > > } > > > > and the somthing like the following on source: > > > > { > > > > "execute": "migrate", > > > > "arguments": { > > > > "uri": "tcp:127.0.0.1:7777" > > > > } > > > > } > > > > {"execute" : "query-migrate"} > > > > > > > > In 1, previously, the user used to get an error message on migration > > > > failure. > > > > This was because there were error_report() calls in all of the files. > > > > Now that they are replaced with error_setg() and the error is stored in > > > > errp, > > > > we need to display that using error_report_err(). Hence I introduced an > > > > error_report_err() > > > > call in the fail section. > > > > > > > > In 2, we have 2 QMP sessions, one for the source and another for the > > > > destination. > > > > The QMP command migrate will be issued on the source, and the errp will > > > > be set. > > > > I did not understand the part where the message will be displayed > > > > because of the > > > > error_report_err() call. I did not see such a message on failure > > > > scenario on both > > > > the sessions. > > > > If the user wants to check for errors, then the destination qemu will > > > > not exit > > > > (exit-on-error = false ) and we can retrieve it using {"execute" : > > > > "query-migrate"} > > > > > > > > Aren't the 2 scenarios different by nature? > > > > > > In 1, doesn't libvirt query the error with query-migrate and print it? > > > > Ideally it should find the the error, and print the whole thing. It does > > work > > in the normal scenario. However, the postcopy scenario does not show the > > same result, > > which is mentioned in the commit message. > > > > > > > > In any case, it would be nice if you describe how libvirt interacts with > > > QEMU in 1. > > > > Please find below the difference in the command output at source, when we > > run a live migration > > with postcopy enabled. > > > > ========= > > With the current changes: > > [root@dell-per750-42 qemu-priv]# virsh migrate-setspeed guest-vm 1 > > > > [root@dell-per750-42 build]# virsh migrate guest-vm --live > > qemu+ssh://10.6.120.9/system --verbose --postcopy --timeout 10 > > --timeout-postcopy > > [email protected]'s password: > > Migration: [ 1.26 %]error: internal error: QEMU unexpectedly closed the > > monitor (vm='guest-vm'): 2025-09-03T06:19:15.076547Z qemu-system-x86_64: > > -accel kvm: warning: Number of SMP cpus requested (2) exceeds the > > recommended cpus supported by KVM (1) > > 2025-09-03T06:19:15.076586Z qemu-system-x86_64: -accel kvm: warning: Number > > of hotpluggable cpus requested (2) exceeds the recommended cpus supported > > by KVM (1) > > 2025-09-03T06:19:27.776715Z qemu-system-x86_64: load of migration failed: > > Input/output error: error while loading state for instance 0x0 of device > > 'tpm-emulator': post load hook failed for: tpm-emulator, version_id: 0, > > minimum_version: 0, ret: -5: tpm-emulator: Setting the stateblob (type 1) > > failed with a TPM error 0x21 decryption error > > > > [root@dell-per750-42 build]# > > > > ========= > > > > Without the current changes: > > [root@dell-per750-42 qemu-priv]# virsh migrate-setspeed guest-vm 1 > > > > [root@dell-per750-42 qemu-priv]# virsh migrate guest-vm --live > > qemu+ssh://10.6.120.9/system --verbose --postcopy --timeout 10 > > --timeout-postcopy > > [email protected]'s password: > > Migration: [ 1.28 %]error: internal error: QEMU unexpectedly closed the > > monitor (vm='guest-vm'): 2025-09-03T06:26:17.733786Z qemu-system-x86_64: > > -accel kvm: warning: Number of SMP cpus requested (2) exceeds the > > recommended cpus supported by KVM (1) > > 2025-09-03T06:26:17.733830Z qemu-system-x86_64: -accel kvm: warning: Number > > of hotpluggable cpus requested (2) exceeds the recommended cpus supported > > by KVM (1) > > > > [root@dell-per750-42 qemu-priv]# > > > > ========= > > The original behavior was to print the error to the console regardless of > > whether the migration is normal or postcopy. > > This was true for messages in qemu_loadvm_state(), but the message "load of > migration failed" was printed or queried with query-migrate, not both. We > should think of which behavior is more appropriate, and I think we should > avoid duplicate reports. > > > The source machine goes in to a paused state after this. > > The output is informative. It implies the destination machine exited, and it > makes sense to print error messages as it is done for > mis->exit_on_error. I wonder if it is possible to detect the condition and > treat it identically to mis->exit_on_error.
I see that we want to catch a specific scenario in postcopy ram migration where the destination abruptly exits without a graceful shutdown, thus failing to inform the source the reason for its failure through a 'query-migrate' even though 'exit-on-error' was set to false on the destination. However, I am not sure how to reliably detect the specific error condition of such a connection close that you have described. Given that this is a large patch series already, could we keep the current change as is for now? >From what I can tell, the additional log message "load of migration failed" is not a breaking change and will not cause a crash. We can develop a more elegant solution to handle the issue of duplication in a separate patch. > > Regards, > Akihiko Odaki > Regards, Arun Menon
