On Fri, May 19, 2017 at 05:51:43PM +0800, Peter Xu wrote: > On Fri, May 19, 2017 at 09:25:38AM +0100, Daniel P. Berrange wrote: > > On Fri, May 19, 2017 at 02:43:27PM +0800, Peter Xu wrote: > > > We don't really have a return path for the other types yet. Let's check > > > this when .get_return_path() is called. > > > > > > For this, we introduce a new feature bit, and set it up only for socket > > > typed IO channels. > > > > > > This will help detect earlier failure for postcopy, e.g., logically > > > speaking postcopy cannot work with "exec:". Before this patch, when we > > > try to migrate with "migrate -d exec:cat>out", we'll hang the system. > > > With this patch, we'll get: > > > > > > (qemu) migrate -d exec:cat>out > > > Unable to open return-path for postcopy > > > > This is wrong - post-copy migration *can* work with exec: - it just entirely > > depends on what command you are running. Your example ran a command which is > > unidirectional, but if you ran 'exec:socat ...' you would have a fully > > bidirectional channel. Actually the channel is always bi-directional, but > > 'cat' simply won't ever send data back to QEMU. > > Indeed. I should not block postcopy if the user used a TCP tunnel > between the source and destination in some way, using this exec: way. > Thanks for pointing that out. > > However I still think the idea is needed here. Say, we'd better know > whether the transport would be able to respond (though current > approach of "assuming sockets are the only ones that can reply" is not > a good solution...). Please see below. > > > > > If QEMU hangs when the other end doesn't send data back, that actually seems > > like a potentially serious bug in migration code. Even if using the normal > > 'tcp' migration protocol, if the target QEMU server hangs and fails to > > send data to QEMU on the return path, the source QEMU must never hang. > > Firstly I should not say it's a hang - it's actually by-design here > imho - migration thread is in the last phase now, waiting for a SHUT > message from destination (which I think is wise). But from the > behavior, indeed src VM is not usable during the time, just like what > happened for most postcopy cases on the source side. So, we can see > that postcopy "assumes" that destination side can reply now. > > Meanwhile, I see it reasonable for postcopy to have such an > assumption. After all, postcopy means "start VM on destination before > pages are moved over completely", then there must be someone to reply > to source, no matter whether it'll be via some kind of io channel. > > That's why I think we still need the general idea here, that we need > to know whether destination end is able to reply. > > But, I still have no good idea (after knowing this patch won't work) > on how we can do this... Any further suggestions would be greatly > welcomed.
IMHO this is nothing more than a documentation issue for the 'exec' protocol. ie, document that you should provide a bi-directional transport for live migration. A uni-directional transport is arguably only valid if you're using migrate to save/restore the VM state to a file. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|