Hi Akihiko,

On Sat, Sep 06, 2025 at 05:22:31AM +0200, Akihiko Odaki wrote:
> On 2025/09/03 8:47, Arun Menon wrote:
> > Hi Akihiko,
> > 
> > It took some time to set up the machines; apologies for the delay in 
> > response.
> > 
> > On Mon, Sep 01, 2025 at 02:12:54AM +0900, Akihiko Odaki wrote:
> > > On 2025/09/01 1:38, Arun Menon wrote:
> > > > Hi,
> > > > 
> > > > On Mon, Sep 01, 2025 at 01:04:40AM +0900, Akihiko Odaki wrote:
> > > > > On 2025/09/01 0:45, Arun Menon wrote:
> > > > > > Hi Akihiko,
> > > > > > Thanks for the review.
> > > > > > 
> > > > > > On Sat, Aug 30, 2025 at 02:58:05PM +0900, Akihiko Odaki wrote:
> > > > > > > On 2025/08/30 5:01, Arun Menon wrote:
> > > > > > > > This is an incremental step in converting vmstate loading
> > > > > > > > code to report error via Error objects instead of directly
> > > > > > > > printing it to console/monitor.
> > > > > > > > It is ensured that qemu_loadvm_state() must report an error
> > > > > > > > in errp, in case of failure.
> > > > > > > > 
> > > > > > > > When postcopy live migration runs, the device states are loaded 
> > > > > > > > by
> > > > > > > > both the qemu coroutine process_incoming_migration_co() and the
> > > > > > > > postcopy_ram_listen_thread(). Therefore, it is important that 
> > > > > > > > the
> > > > > > > > coroutine also reports the error in case of failure, with
> > > > > > > > error_report_err(). Otherwise, the source qemu will not display
> > > > > > > > any errors before going into the postcopy pause state.
> > > > > > > > 
> > > > > > > > Reviewed-by: Marc-AndrĂ© Lureau <[email protected]>
> > > > > > > > Reviewed-by: Fabiano Rosas <[email protected]>
> > > > > > > > Signed-off-by: Arun Menon <[email protected]>
> > > > > > > > ---
> > > > > > > >      migration/migration.c |  9 +++++----
> > > > > > > >      migration/savevm.c    | 30 ++++++++++++++++++------------
> > > > > > > >      migration/savevm.h    |  2 +-
> > > > > > > >      3 files changed, 24 insertions(+), 17 deletions(-)
> > > > > > > > 
> > > > > > > > diff --git a/migration/migration.c b/migration/migration.c
> > > > > > > > index 
> > > > > > > > 10c216d25dec01f206eacad2edd24d21f00e614c..c6768d88f45c870c7fad9b9957300766ff69effc
> > > > > > > >  100644
> > > > > > > > --- a/migration/migration.c
> > > > > > > > +++ b/migration/migration.c
> > > > > > > > @@ -881,7 +881,7 @@ process_incoming_migration_co(void *opaque)
> > > > > > > >                            MIGRATION_STATUS_ACTIVE);
> > > > > > > >          mis->loadvm_co = qemu_coroutine_self();
> > > > > > > > -    ret = qemu_loadvm_state(mis->from_src_file);
> > > > > > > > +    ret = qemu_loadvm_state(mis->from_src_file, &local_err);
> > > > > > > >          mis->loadvm_co = NULL;
> > > > > > > >          
> > > > > > > > trace_vmstate_downtime_checkpoint("dst-precopy-loadvm-completed");
> > > > > > > > @@ -908,7 +908,8 @@ process_incoming_migration_co(void *opaque)
> > > > > > > >          }
> > > > > > > >          if (ret < 0) {
> > > > > > > > -        error_setg(&local_err, "load of migration failed: %s", 
> > > > > > > > strerror(-ret));
> > > > > > > > +        error_prepend(&local_err, "load of migration failed: 
> > > > > > > > %s: ",
> > > > > > > > +                      strerror(-ret));
> > > > > > > >              goto fail;
> > > > > > > >          }
> > > > > > > > @@ -924,13 +925,13 @@ fail:
> > > > > > > >          migrate_set_state(&mis->state, MIGRATION_STATUS_ACTIVE,
> > > > > > > >                            MIGRATION_STATUS_FAILED);
> > > > > > > >          migrate_set_error(s, local_err);
> > > > > > > > -    error_free(local_err);
> > > > > > > > +    error_report_err(local_err);
> > > > > > > 
> > > > > > > This is problematic because it results in duplicate error reports 
> > > > > > > when
> > > > > > > !mis->exit_on_error; in that case the query-migrate QMP command 
> > > > > > > reports the
> > > > > > > error and this error reporting is redundant.
> > > > > > 
> > > > > > If I comment this change, then all of the errors propagated up to 
> > > > > > now, using
> > > > > > error_setg() will not be reported. This is the place where it is 
> > > > > > finally reported,
> > > > > > when qemu_loadvm_state() fails. In other words, all the 
> > > > > > error_reports() we removed
> > > > > > from all the files, replacing them with error_setg(), will finally 
> > > > > > be reported here
> > > > > > using error_report_err().
> > > > > 
> > > > > My understanding of the code without these two changes is:
> > > > > - If the migrate-incoming QMP command is used with false as
> > > > >     exit-on-error, this function will not report the error but
> > > > >     the query-migrate QMP command will report the error.
> > > > > - Otherwise, this function reports the error.
> > > > 
> > > > With my limited experience in testing, I have a question,
> > > > So there are 2 scenarios,
> > > > 1. running the virsh migrate command on the source host. Something like 
> > > > the following,
> > > >     virsh -c 'qemu:///system' migrate --live --verbose --domain 
> > > > guest-vm --desturi qemu+ssh://10.6.120.20/system
> > > >     OR for postcopy-ram,
> > > >     virsh migrate guest-vm --live qemu+ssh://10.6.120.20/system 
> > > > --verbose --postcopy --timeout 10 --timeout-postcopy
> > > > 
> > > > 2. Using QMP commands, performing a migration from source to 
> > > > destination.
> > > >     Running something like the following on the destination:
> > > >     {
> > > >       "execute": "migrate-incoming",
> > > >       "arguments": {
> > > >         "uri": "tcp:127.0.0.1:7777",
> > > >         "exit-on-error": false
> > > >       }
> > > >     }
> > > >     {
> > > >       "execute": "migrate-incoming",
> > > >       "arguments": {
> > > >         "uri": "tcp:127.0.0.1:7777",
> > > >         "exit-on-error": false
> > > >       }
> > > >     }
> > > >     and the somthing like the following on source:
> > > >     {
> > > >       "execute": "migrate",
> > > >       "arguments": {
> > > >         "uri": "tcp:127.0.0.1:7777"
> > > >       }
> > > >     }
> > > >     {"execute" : "query-migrate"}
> > > > 
> > > > In 1, previously, the user used to get an error message on migration 
> > > > failure.
> > > > This was because there were error_report() calls in all of the files.
> > > > Now that they are replaced with error_setg() and the error is stored in 
> > > > errp,
> > > > we need to display that using error_report_err(). Hence I introduced an 
> > > > error_report_err()
> > > > call in the fail section.
> > > > 
> > > > In 2, we have 2 QMP sessions, one for the source and another for the 
> > > > destination.
> > > > The QMP command migrate will be issued on the source, and the errp will 
> > > > be set.
> > > > I did not understand the part where the message will be displayed 
> > > > because of the
> > > > error_report_err() call. I did not see such a message on failure 
> > > > scenario on both
> > > > the sessions.
> > > > If the user wants to check for errors, then the destination qemu will 
> > > > not exit
> > > > (exit-on-error = false ) and we can retrieve it using {"execute" : 
> > > > "query-migrate"}
> > > > 
> > > > Aren't the 2 scenarios different by nature?
> > > 
> > > In 1, doesn't libvirt query the error with query-migrate and print it?
> > 
> > Ideally it should find the the error, and print the whole thing. It does 
> > work
> > in the normal scenario. However, the postcopy scenario does not show the 
> > same result,
> > which is mentioned in the commit message.
> > 
> > > 
> > > In any case, it would be nice if you describe how libvirt interacts with
> > > QEMU in 1.
> > 
> > Please find below the difference in the command output at source, when we 
> > run a live migration
> > with postcopy enabled.
> > 
> > =========
> > With the current changes:
> > [root@dell-per750-42 qemu-priv]# virsh migrate-setspeed guest-vm 1
> > 
> > [root@dell-per750-42 build]# virsh migrate guest-vm --live 
> > qemu+ssh://10.6.120.9/system --verbose --postcopy --timeout 10 
> > --timeout-postcopy
> > [email protected]'s password:
> > Migration: [ 1.26 %]error: internal error: QEMU unexpectedly closed the 
> > monitor (vm='guest-vm'): 2025-09-03T06:19:15.076547Z qemu-system-x86_64: 
> > -accel kvm: warning: Number of SMP cpus requested (2) exceeds the 
> > recommended cpus supported by KVM (1)
> > 2025-09-03T06:19:15.076586Z qemu-system-x86_64: -accel kvm: warning: Number 
> > of hotpluggable cpus requested (2) exceeds the recommended cpus supported 
> > by KVM (1)
> > 2025-09-03T06:19:27.776715Z qemu-system-x86_64: load of migration failed: 
> > Input/output error: error while loading state for instance 0x0 of device 
> > 'tpm-emulator': post load hook failed for: tpm-emulator, version_id: 0, 
> > minimum_version: 0, ret: -5: tpm-emulator: Setting the stateblob (type 1) 
> > failed with a TPM error 0x21 decryption error
> > 
> > [root@dell-per750-42 build]#
> > 
> > =========
> > 
> > Without the current changes:
> > [root@dell-per750-42 qemu-priv]# virsh migrate-setspeed guest-vm 1
> > 
> > [root@dell-per750-42 qemu-priv]# virsh migrate guest-vm --live 
> > qemu+ssh://10.6.120.9/system --verbose --postcopy --timeout 10 
> > --timeout-postcopy
> > [email protected]'s password:
> > Migration: [ 1.28 %]error: internal error: QEMU unexpectedly closed the 
> > monitor (vm='guest-vm'): 2025-09-03T06:26:17.733786Z qemu-system-x86_64: 
> > -accel kvm: warning: Number of SMP cpus requested (2) exceeds the 
> > recommended cpus supported by KVM (1)
> > 2025-09-03T06:26:17.733830Z qemu-system-x86_64: -accel kvm: warning: Number 
> > of hotpluggable cpus requested (2) exceeds the recommended cpus supported 
> > by KVM (1)
> > 
> > [root@dell-per750-42 qemu-priv]#
> > 
> > =========
> > The original behavior was to print the error to the console regardless of 
> > whether the migration is normal or postcopy.
> 
> This was true for messages in qemu_loadvm_state(), but the message "load of
> migration failed" was printed or queried with query-migrate, not both. We
> should think of which behavior is more appropriate, and I think we should
> avoid duplicate reports.
> 
> > The source machine goes in to a paused state after this.
> 
> The output is informative. It implies the destination machine exited, and it
> makes sense to print error messages as it is done for
> mis->exit_on_error. I wonder if it is possible to detect the condition and
> treat it identically to mis->exit_on_error.

I see that we want to catch a specific scenario in postcopy ram migration
where the destination abruptly exits without a graceful shutdown,
thus failing to inform the source the reason for its failure through a
'query-migrate' even though 'exit-on-error' was set to false on the destination.

However, I am not sure how to reliably detect the specific error condition of
such a connection close that you have described. Given that this is a large
patch series already, could we keep the current change as is for now?
>From what I can tell, the additional log message "load of migration failed"
is not a breaking change and will not cause a crash. We can develop a more
elegant solution to handle the issue of duplication in a separate patch.

> 
> Regards,
> Akihiko Odaki
> 

Regards,
Arun Menon


Reply via email to