On Wed, Aug 02, 2023 at 05:04:45PM -0300, Fabiano Rosas wrote:
> Peter Xu <[email protected]> writes:
>
> >> + if (await_return_path_close_on_source(s)) {
> >> + trace_migration_return_path_pause_err();
> >> + return MIG_THR_ERR_FATAL;
> >> + }
> >
> > I see that here on return path failures we'll bail out, and actually it's
> > against the instinction (that when pause it should have failed, so it's
> > weird why it's returning 0).
> >
> > So how about above suggestion, plus here we just call
> > await_return_path_close_on_source(), without caring about the retval?
>
> So you are suggesting to remove the knowledge of the retry entirely from
> the thread. It just reports the error and the postcopy_pause takes the
> responsibility of ignoring it when we want to retry... It could be
> clearer that way indeed.
That error doesn't really important IMHO here, because the to-dst-file
should have already errored out anyway.
I just think it cleaner if we reset rp_error only until the new thread
created.
>
> >> +
> >> migrate_set_state(&s->state, s->state,
> >> MIGRATION_STATUS_POSTCOPY_PAUSED);
> >>
> >> @@ -2566,12 +2542,6 @@ static MigThrError postcopy_pause(MigrationState *s)
> >> if (s->state == MIGRATION_STATUS_POSTCOPY_RECOVER) {
> >> /* Woken up by a recover procedure. Give it a shot */
> >>
> >> - /*
> >> - * Firstly, let's wake up the return path now, with a new
> >> - * return path channel.
> >> - */
> >> - qemu_sem_post(&s->postcopy_pause_rp_sem);
> >> -
> >> /* Do the resume logic */
> >> if (postcopy_do_resume(s) == 0) {
> >> /* Let's continue! */
> >> @@ -3259,7 +3229,7 @@ void migrate_fd_connect(MigrationState *s, Error
> >> *error_in)
> >> * QEMU uses the return path.
> >> */
> >> if (migrate_postcopy_ram() || migrate_return_path()) {
> >> - if (open_return_path_on_source(s, !resume)) {
> >> + if (open_return_path_on_source(s)) {
> >> error_report("Unable to open return-path for postcopy");
> >> migrate_set_state(&s->state, s->state,
> >> MIGRATION_STATUS_FAILED);
> >> migrate_fd_cleanup(s);
> >> @@ -3320,7 +3290,6 @@ static void migration_instance_finalize(Object *obj)
> >> qemu_sem_destroy(&ms->rate_limit_sem);
> >> qemu_sem_destroy(&ms->pause_sem);
> >> qemu_sem_destroy(&ms->postcopy_pause_sem);
> >> - qemu_sem_destroy(&ms->postcopy_pause_rp_sem);
> >> qemu_sem_destroy(&ms->rp_state.rp_sem);
> >> qemu_sem_destroy(&ms->rp_state.rp_pong_acks);
> >> qemu_sem_destroy(&ms->postcopy_qemufile_src_sem);
> >> @@ -3340,7 +3309,6 @@ static void migration_instance_init(Object *obj)
> >> migrate_params_init(&ms->parameters);
> >>
> >> qemu_sem_init(&ms->postcopy_pause_sem, 0);
> >> - qemu_sem_init(&ms->postcopy_pause_rp_sem, 0);
> >> qemu_sem_init(&ms->rp_state.rp_sem, 0);
> >> qemu_sem_init(&ms->rp_state.rp_pong_acks, 0);
> >> qemu_sem_init(&ms->rate_limit_sem, 0);
> >> diff --git a/migration/migration.h b/migration/migration.h
> >> index b7c8b67542..e78db5361c 100644
> >> --- a/migration/migration.h
> >> +++ b/migration/migration.h
> >> @@ -382,7 +382,6 @@ struct MigrationState {
> >>
> >> /* Needed by postcopy-pause state */
> >> QemuSemaphore postcopy_pause_sem;
> >> - QemuSemaphore postcopy_pause_rp_sem;
> >> /*
> >> * Whether we abort the migration if decompression errors are
> >> * detected at the destination. It is left at false for qemu
> >> diff --git a/migration/trace-events b/migration/trace-events
> >> index 5259c1044b..19ec649d1d 100644
> >> --- a/migration/trace-events
> >> +++ b/migration/trace-events
> >> @@ -164,6 +164,7 @@ migration_rate_limit_pre(int ms) "%d ms"
> >> migration_rate_limit_post(int urgent) "urgent: %d"
> >> migration_return_path_end_before(void) ""
> >> migration_return_path_end_after(int rp_error) "%d"
> >> +migration_return_path_pause_err(void) ""
> >
> > If it should never trigger, it shouldn't need a tracepoint. It needs an
> > assertion if we're 100% confident, or error_report_once() perhaps would be
> > more suitable.
>
> It would trigger when a rp error happened that wasn't related to the
> QEMUFile. If we go with your suggestion above, then this goes away.
With your current patch where rp_error seems to be always reset when thread
quit, if that's true then it'll 100% happen that this will not trigger.
But yeah this is a trivial spot, feel free to choose the best if you plan
to reorganize this patch a bit. Thanks.
--
Peter Xu