On Mon, Jun 27, 2022 at 04:32:00PM -0400, Peter Xu wrote: > On Mon, Jun 27, 2022 at 04:03:09PM +0100, Daniel P. Berrangé wrote: > > On Wed, Jun 22, 2022 at 03:34:52PM -0400, Peter Xu wrote: > > > On Wed, Jun 22, 2022 at 07:39:06PM +0100, Dr. David Alan Gilbert (git) > > > wrote: > > > > diff --git a/migration/qemu-file.c b/migration/qemu-file.c > > > > index 74f919de67..e206b05550 100644 > > > > --- a/migration/qemu-file.c > > > > +++ b/migration/qemu-file.c > > > > @@ -377,8 +377,22 @@ static ssize_t qemu_fill_buffer(QEMUFile *f) > > > > return 0; > > > > } > > > > > > > > - len = f->ops->get_buffer(f->ioc, f->buf + pending, > > > > f->total_transferred, > > > > - IO_BUF_SIZE - pending, &local_error); > > > > + do { > > > > + len = qio_channel_read(f->ioc, > > > > + (char *)f->buf + pending, > > > > + IO_BUF_SIZE - pending, > > > > + &local_error); > > > > + if (len == QIO_CHANNEL_ERR_BLOCK) { > > > > + if (qemu_in_coroutine()) { > > > > + qio_channel_yield(f->ioc, G_IO_IN); > > > > + } else { > > > > + qio_channel_wait(f->ioc, G_IO_IN); > > > > + } > > > > + } else if (len < 0) { > > > > + len = EIO; > > > > > > This should be -EIO. > > > > > > > + } > > > > + } while (len == QIO_CHANNEL_ERR_BLOCK); > > > > > > It's failing only with the new TLS test I added for postcopy somehow (at > > > least /x86_64/migration/postcopy/recovery/tls).. I also verified after the > > > change it'll work again. > > > > Assuming you can still reproduce the pre-existing flaw, can you capture > > a stack trace when it hangs. I'm wondering if it is a sign that the > > migration is not converging when using TLS under certain load conditions, > > because the test waits forever for converge. > > Yes it is, and it reproduces here every time. It hangs at: > > if (!got_stop) { > qtest_qmp_eventwait(from, "STOP"); > } > > > > > Also what scenario are you running in ? Bare metal or a VM, and what > > host arch ? Wondering if the machine is at all slow, or for example > > missing AES hardware acceleration or some such thing. > > It's Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz, 40 cores. > > It'll pass after I modify the downtime: > > migrate_set_parameter_int(from, "downtime-limit", 100000); > > And with QTEST_LOG=1 I found that the bw is indeed low, ~700mbps.
Good, this all makes sense, and I've got pending patchues I'm testing that will fix this. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|