Test result of multiple Mellanox CX-7 VFs migration:  PASS
[1] create two VFs and bind them to mlx5_vfio_pci driver
[2] start a VM with two VFs
[3] migrate the VM
[4] check if the VM works well

Tested-by: YangHang Liu <[email protected]>

Best Regards,
YangHang Liu


On Wed, Aug 2, 2023 at 4:43 PM Cédric Le Goater <[email protected]> wrote:
>
> On 8/2/23 10:14, Avihai Horon wrote:
> > VFIO migration uAPI defines an optional intermediate P2P quiescent
> > state. While in the P2P quiescent state, P2P DMA transactions cannot be
> > initiated by the device, but the device can respond to incoming ones.
> > Additionally, all outstanding P2P transactions are guaranteed to have
> > been completed by the time the device enters this state.
> >
> > The purpose of this state is to support migration of multiple devices
> > that might do P2P transactions between themselves.
> >
> > Add support for P2P migration by transitioning all the devices to the
> > P2P quiescent state before stopping or starting the devices. Use the new
> > VMChangeStateHandler prepare_cb to achieve that behavior.
> >
> > This will allow migration of multiple VFIO devices if all of them
> > support P2P migration.
> >
> > Signed-off-by: Avihai Horon <[email protected]>
>
>
> Reviewed-by: Cédric Le Goater <[email protected]>
>
> Thanks,
>
> C.
>
>
> > ---
> >   docs/devel/vfio-migration.rst | 93 +++++++++++++++++++++--------------
> >   hw/vfio/common.c              |  6 ++-
> >   hw/vfio/migration.c           | 46 +++++++++++++++--
> >   hw/vfio/trace-events          |  1 +
> >   4 files changed, 105 insertions(+), 41 deletions(-)
> >
> > diff --git a/docs/devel/vfio-migration.rst b/docs/devel/vfio-migration.rst
> > index b433cb5bb2..605fe60e96 100644
> > --- a/docs/devel/vfio-migration.rst
> > +++ b/docs/devel/vfio-migration.rst
> > @@ -23,9 +23,21 @@ and recommends that the initial bytes are sent and 
> > loaded in the destination
> >   before stopping the source VM. Enabling this migration capability will
> >   guarantee that and thus, can potentially reduce downtime even further.
> >
> > -Note that currently VFIO migration is supported only for a single device. 
> > This
> > -is due to VFIO migration's lack of P2P support. However, P2P support is 
> > planned
> > -to be added later on.
> > +To support migration of multiple devices that might do P2P transactions 
> > between
> > +themselves, VFIO migration uAPI defines an intermediate P2P quiescent 
> > state.
> > +While in the P2P quiescent state, P2P DMA transactions cannot be initiated 
> > by
> > +the device, but the device can respond to incoming ones. Additionally, all
> > +outstanding P2P transactions are guaranteed to have been completed by the 
> > time
> > +the device enters this state.
> > +
> > +All the devices that support P2P migration are first transitioned to the 
> > P2P
> > +quiescent state and only then are they stopped or started. This makes 
> > migration
> > +safe P2P-wise, since starting and stopping the devices is not done 
> > atomically
> > +for all the devices together.
> > +
> > +Thus, multiple VFIO devices migration is allowed only if all the devices
> > +support P2P migration. Single VFIO device migration is allowed regardless 
> > of
> > +P2P migration support.
> >
> >   A detailed description of the UAPI for VFIO device migration can be found 
> > in
> >   the comment for the ``vfio_device_mig_state`` structure in the header file
> > @@ -132,54 +144,63 @@ will be blocked.
> >   Flow of state changes during Live migration
> >   ===========================================
> >
> > -Below is the flow of state change during live migration.
> > +Below is the state change flow during live migration for a VFIO device that
> > +supports both precopy and P2P migration. The flow for devices that don't
> > +support it is similar, except that the relevant states for precopy and P2P 
> > are
> > +skipped.
> >   The values in the parentheses represent the VM state, the migration 
> > state, and
> >   the VFIO device state, respectively.
> > -The text in the square brackets represents the flow if the VFIO device 
> > supports
> > -pre-copy.
> >
> >   Live migration save path
> >   ------------------------
> >
> >   ::
> >
> > -                        QEMU normal running state
> > -                        (RUNNING, _NONE, _RUNNING)
> > -                                  |
> > +                           QEMU normal running state
> > +                           (RUNNING, _NONE, _RUNNING)
> > +                                      |
> >                        migrate_init spawns migration_thread
> > -                Migration thread then calls each device's .save_setup()
> > -                  (RUNNING, _SETUP, _RUNNING [_PRE_COPY])
> > -                                  |
> > -                  (RUNNING, _ACTIVE, _RUNNING [_PRE_COPY])
> > -      If device is active, get pending_bytes by 
> > .state_pending_{estimate,exact}()
> > -          If total pending_bytes >= threshold_size, call 
> > .save_live_iterate()
> > -                  [Data of VFIO device for pre-copy phase is copied]
> > -        Iterate till total pending bytes converge and are less than 
> > threshold
> > -                                  |
> > -  On migration completion, vCPU stops and calls 
> > .save_live_complete_precopy for
> > -  each active device. The VFIO device is then transitioned into _STOP_COPY 
> > state
> > -                  (FINISH_MIGRATE, _DEVICE, _STOP_COPY)
> > -                                  |
> > -     For the VFIO device, iterate in .save_live_complete_precopy until
> > -                         pending data is 0
> > -                   (FINISH_MIGRATE, _DEVICE, _STOP)
> > -                                  |
> > -                 (FINISH_MIGRATE, _COMPLETED, _STOP)
> > -             Migraton thread schedules cleanup bottom half and exits
> > +            Migration thread then calls each device's .save_setup()
> > +                          (RUNNING, _SETUP, _PRE_COPY)
> > +                                      |
> > +                         (RUNNING, _ACTIVE, _PRE_COPY)
> > +  If device is active, get pending_bytes by 
> > .state_pending_{estimate,exact}()
> > +       If total pending_bytes >= threshold_size, call .save_live_iterate()
> > +                Data of VFIO device for pre-copy phase is copied
> > +      Iterate till total pending bytes converge and are less than threshold
> > +                                      |
> > +       On migration completion, the vCPUs and the VFIO device are stopped
> > +              The VFIO device is first put in P2P quiescent state
> > +                    (FINISH_MIGRATE, _ACTIVE, _PRE_COPY_P2P)
> > +                                      |
> > +                Then the VFIO device is put in _STOP_COPY state
> > +                     (FINISH_MIGRATE, _ACTIVE, _STOP_COPY)
> > +         .save_live_complete_precopy() is called for each active device
> > +      For the VFIO device, iterate in .save_live_complete_precopy() until
> > +                               pending data is 0
> > +                                      |
> > +                     (POSTMIGRATE, _COMPLETED, _STOP_COPY)
> > +            Migraton thread schedules cleanup bottom half and exits
> > +                                      |
> > +                           .save_cleanup() is called
> > +                        (POSTMIGRATE, _COMPLETED, _STOP)
> >
> >   Live migration resume path
> >   --------------------------
> >
> >   ::
> >
> > -              Incoming migration calls .load_setup for each device
> > -                       (RESTORE_VM, _ACTIVE, _STOP)
> > -                                 |
> > -       For each device, .load_state is called for that device section data
> > -                       (RESTORE_VM, _ACTIVE, _RESUMING)
> > -                                 |
> > -    At the end, .load_cleanup is called for each device and vCPUs are 
> > started
> > -                       (RUNNING, _NONE, _RUNNING)
> > +             Incoming migration calls .load_setup() for each device
> > +                          (RESTORE_VM, _ACTIVE, _STOP)
> > +                                      |
> > +     For each device, .load_state() is called for that device section data
> > +                        (RESTORE_VM, _ACTIVE, _RESUMING)
> > +                                      |
> > +  At the end, .load_cleanup() is called for each device and vCPUs are 
> > started
> > +              The VFIO device is first put in P2P quiescent state
> > +                        (RUNNING, _ACTIVE, _RUNNING_P2P)
> > +                                      |
> > +                           (RUNNING, _NONE, _RUNNING)
> >
> >   Postcopy
> >   ========
> > diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> > index 16cf79a76c..7c3d636025 100644
> > --- a/hw/vfio/common.c
> > +++ b/hw/vfio/common.c
> > @@ -441,14 +441,16 @@ bool vfio_device_state_is_running(VFIODevice 
> > *vbasedev)
> >   {
> >       VFIOMigration *migration = vbasedev->migration;
> >
> > -    return migration->device_state == VFIO_DEVICE_STATE_RUNNING;
> > +    return migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
> > +           migration->device_state == VFIO_DEVICE_STATE_RUNNING_P2P;
> >   }
> >
> >   bool vfio_device_state_is_precopy(VFIODevice *vbasedev)
> >   {
> >       VFIOMigration *migration = vbasedev->migration;
> >
> > -    return migration->device_state == VFIO_DEVICE_STATE_PRE_COPY;
> > +    return migration->device_state == VFIO_DEVICE_STATE_PRE_COPY ||
> > +           migration->device_state == VFIO_DEVICE_STATE_PRE_COPY_P2P;
> >   }
> >
> >   static bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
> > diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> > index 48f9c23cbe..71855468fe 100644
> > --- a/hw/vfio/migration.c
> > +++ b/hw/vfio/migration.c
> > @@ -71,8 +71,12 @@ static const char *mig_state_to_str(enum 
> > vfio_device_mig_state state)
> >           return "STOP_COPY";
> >       case VFIO_DEVICE_STATE_RESUMING:
> >           return "RESUMING";
> > +    case VFIO_DEVICE_STATE_RUNNING_P2P:
> > +        return "RUNNING_P2P";
> >       case VFIO_DEVICE_STATE_PRE_COPY:
> >           return "PRE_COPY";
> > +    case VFIO_DEVICE_STATE_PRE_COPY_P2P:
> > +        return "PRE_COPY_P2P";
> >       default:
> >           return "UNKNOWN STATE";
> >       }
> > @@ -652,6 +656,39 @@ static const SaveVMHandlers savevm_vfio_handlers = {
> >
> >   /* ---------------------------------------------------------------------- 
> > */
> >
> > +static void vfio_vmstate_change_prepare(void *opaque, bool running,
> > +                                        RunState state)
> > +{
> > +    VFIODevice *vbasedev = opaque;
> > +    VFIOMigration *migration = vbasedev->migration;
> > +    enum vfio_device_mig_state new_state;
> > +    int ret;
> > +
> > +    new_state = migration->device_state == VFIO_DEVICE_STATE_PRE_COPY ?
> > +                    VFIO_DEVICE_STATE_PRE_COPY_P2P :
> > +                    VFIO_DEVICE_STATE_RUNNING_P2P;
> > +
> > +    /*
> > +     * If setting the device in new_state fails, the device should be 
> > reset.
> > +     * To do so, use ERROR state as a recover state.
> > +     */
> > +    ret = vfio_migration_set_state(vbasedev, new_state,
> > +                                   VFIO_DEVICE_STATE_ERROR);
> > +    if (ret) {
> > +        /*
> > +         * Migration should be aborted in this case, but vm_state_notify()
> > +         * currently does not support reporting failures.
> > +         */
> > +        if (migrate_get_current()->to_dst_file) {
> > +            qemu_file_set_error(migrate_get_current()->to_dst_file, ret);
> > +        }
> > +    }
> > +
> > +    trace_vfio_vmstate_change_prepare(vbasedev->name, running,
> > +                                      RunState_str(state),
> > +                                      mig_state_to_str(new_state));
> > +}
> > +
> >   static void vfio_vmstate_change(void *opaque, bool running, RunState 
> > state)
> >   {
> >       VFIODevice *vbasedev = opaque;
> > @@ -758,6 +795,7 @@ static int vfio_migration_init(VFIODevice *vbasedev)
> >       char id[256] = "";
> >       g_autofree char *path = NULL, *oid = NULL;
> >       uint64_t mig_flags = 0;
> > +    VMChangeStateHandler *prepare_cb;
> >
> >       if (!vbasedev->ops->vfio_get_object) {
> >           return -EINVAL;
> > @@ -798,9 +836,11 @@ static int vfio_migration_init(VFIODevice *vbasedev)
> >       register_savevm_live(id, VMSTATE_INSTANCE_ID_ANY, 1, 
> > &savevm_vfio_handlers,
> >                            vbasedev);
> >
> > -    migration->vm_state = qdev_add_vm_change_state_handler(vbasedev->dev,
> > -                                                           
> > vfio_vmstate_change,
> > -                                                           vbasedev);
> > +    prepare_cb = migration->mig_flags & VFIO_MIGRATION_P2P ?
> > +                     vfio_vmstate_change_prepare :
> > +                     NULL;
> > +    migration->vm_state = qdev_add_vm_change_state_handler_full(
> > +        vbasedev->dev, vfio_vmstate_change, prepare_cb, vbasedev);
> >       migration->migration_state.notify = vfio_migration_state_notifier;
> >       add_migration_state_change_notifier(&migration->migration_state);
> >
> > diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> > index ee7509e68e..329736a738 100644
> > --- a/hw/vfio/trace-events
> > +++ b/hw/vfio/trace-events
> > @@ -167,3 +167,4 @@ vfio_save_setup(const char *name, uint64_t 
> > data_buffer_size) " (%s) data buffer
> >   vfio_state_pending_estimate(const char *name, uint64_t precopy, uint64_t 
> > postcopy, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) 
> > precopy 0x%"PRIx64" postcopy 0x%"PRIx64" precopy initial size 0x%"PRIx64" 
> > precopy dirty size 0x%"PRIx64
> >   vfio_state_pending_exact(const char *name, uint64_t precopy, uint64_t 
> > postcopy, uint64_t stopcopy_size, uint64_t precopy_init_size, uint64_t 
> > precopy_dirty_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" 
> > stopcopy size 0x%"PRIx64" precopy initial size 0x%"PRIx64" precopy dirty 
> > size 0x%"PRIx64
> >   vfio_vmstate_change(const char *name, int running, const char *reason, 
> > const char *dev_state) " (%s) running %d reason %s device state %s"
> > +vfio_vmstate_change_prepare(const char *name, int running, const char 
> > *reason, const char *dev_state) " (%s) running %d reason %s device state %s"
>
>


Reply via email to