Re: [PATCH 4/4] migration: Introduce POSTCOPY_DEVICE state

Dr. David Alan Gilbert Wed, 01 Oct 2025 08:54:48 -0700

* Jiří Denemark ([email protected]) wrote:
> On Wed, Oct 01, 2025 at 11:05:59 +0000, Dr. David Alan Gilbert wrote:
> > * Jiří Denemark ([email protected]) wrote:
> > > On Tue, Sep 30, 2025 at 16:04:54 -0400, Peter Xu wrote:
> > > > On Tue, Sep 30, 2025 at 09:53:31AM +0200, Jiří Denemark wrote:
> > > > > On Thu, Sep 25, 2025 at 14:22:06 -0400, Peter Xu wrote:
> > > > > > On Thu, Sep 25, 2025 at 01:54:40PM +0200, Jiří Denemark wrote:
> > > > > > > On Mon, Sep 15, 2025 at 13:59:15 +0200, Juraj Marcin wrote:
> > > > > > So far, dest QEMU will try to resume the VM after getting RUN 
> > > > > > command, that
> > > > > > is what loadvm_postcopy_handle_run_bh() does, and it will (when 
> > > > > > autostart=1
> > > > > > set): (1) firstly try to activate all block devices, iff it 
> > > > > > succeeded, (2)
> > > > > > do vm_start(), at the end of which RESUME event will be generated.  
> > > > > > So
> > > > > > RESUME currently implies both disk activation success, and vm start 
> > > > > > worked.
> > > > > > 
> > > > > > > may still fail when locking disks fails (not sure if this is the 
> > > > > > > only
> > > > > > > way cont may fail). In this case we cannot cancel the migration 
> > > > > > > on the
> > > > > > 
> > > > > > Is there any known issue with locking disks that dest would fail?  
> > > > > > This
> > > > > > really sound like we should have the admin taking a look.
> > > > > 
> > > > > Oh definitely, it would be some kind of an storage access issue on the
> > > > > destination. But we'd like to give the admin an option to actually do
> > > > > anything else than just killing the VM :-) Either by automatically
> > > > > canceling the migration or allowing recovery once storage issues are
> > > > > solved.
> > > > 
> > > > The problem is, if the storage locking stopped working properly, then 
> > > > how
> > > > to guarantee the shared storage itself is working properly?
> > > > 
> > > > When I was replying previously, I was expecting the admin taking a look 
> > > > to
> > > > fix the storage, I didn't expect the VM can still be recovered anymore 
> > > > if
> > > > there's no confidence that the block devices will work all fine.  The
> > > > locking errors to me may imply a block corruption already, or should I 
> > > > not
> > > > see it like that?
> > > 
> > > If the storage itself is broken, there's clearly nothing we can do. But
> > > the thing is we're accessing it from two distinct hosts. So while it may
> > > work on the source, it can be broken on the destination. For example,
> > > connection between the destination host and the storage may be broken.
> > > Not sure how often this can happen in real life, but we have a bug
> > > report that (artificially) breaking storage access on the destination
> > > results in paused VM on the source which can only be killed.
> > 
> > I've got a vague memory that a tricky case is when some of your storage
> > devices are broken on the destination, but not all.
> > So you tell the block layer you want to take them on the destination
> > some take their lock, one fails;  now what state are you in?
> > I'm not sure if the block layer had a way of telling you what state
> > you were in when I was last involved in that.
> 
> Wouldn't those locks be automatically released when we kill QEMU on the
> destination as a reaction to a failure to start vCPUs?


Oh hmm, yeh that might work OK.

Dave

> Jirka
> 
-- 
 -----Open up your eyes, open up your mind, open up your code -------   
/ Dr. David Alan Gilbert    |       Running GNU/Linux       | Happy  \ 
\        dave @ treblig.org |                               | In Hex /
 \ _________________________|_____ http://www.treblig.org   |_______/

Re: [PATCH 4/4] migration: Introduce POSTCOPY_DEVICE state

Reply via email to