On Mon, 13 Jul 2020 12:03:33 +0100 "Dr. David Alan Gilbert" <[email protected]> wrote:
> * Claudio Fontana ([email protected]) wrote: > > The following workarounds hide the problem (make the test pass): > > > > 1) always including the icount field in the (unrelated) timers field that > > are sent before in the migration stream (ie not applying the reproducer > > patch). > > > > 2) increasing the IO_BUF_SIZE also hides the problem: > > > > ----------------------cut-------------------------- > > diff --git a/migration/qemu-file.c b/migration/qemu-file.c > > index be21518c57..f81d1272eb 100644 > > --- a/migration/qemu-file.c > > +++ b/migration/qemu-file.c > > @@ -30,7 +30,7 @@ > > #include "trace.h" > > #include "qapi/error.h" > > > > -#define IO_BUF_SIZE 32768 > > +#define IO_BUF_SIZE 65536 > > #define MAX_IOV_SIZE MIN_CONST(IOV_MAX, 64) > > > > struct QEMUFile { > > ----------------------cut-------------------------- > > > > 3) adding a qemu_fflush in hw/s390x/s390-skeys.c after EOS also "fixes" the > > problem: > > > > ----------------------cut-------------------------- > > diff --git a/hw/s390x/s390-skeys.c b/hw/s390x/s390-skeys.c > > index 1e036cc602..47c9a015af 100644 > > --- a/hw/s390x/s390-skeys.c > > +++ b/hw/s390x/s390-skeys.c > > @@ -252,6 +252,8 @@ static const TypeInfo qemu_s390_skeys_info = { > > .class_size = sizeof(S390SKeysClass), > > }; > > > > +extern void qemu_fflush(QEMUFile *f); > > + > > static void s390_storage_keys_save(QEMUFile *f, void *opaque) > > { > > S390SKeysState *ss = S390_SKEYS(opaque); > > @@ -302,6 +304,7 @@ static void s390_storage_keys_save(QEMUFile *f, void > > *opaque) > > g_free(buf); > > end_stream: > > qemu_put_be64(f, eos); > > + qemu_fflush(f); > > } > > > > static int s390_storage_keys_load(QEMUFile *f, void *opaque, int > > version_id) > > ----------------------cut-------------------------- > > > > Do any of you with better understanding of migration/, block and s390 have > > a suggestion on what could be the issue here, > > and what could be the next step in the investigation? > > > > Is the fact that migration/ram.c always does fflush after writing the EOS > > have any relevance here? why does it do it, > > and should s390 code also follow the same pattern? > > I didn't think it was required. > And qemu_put_buffer loops if needed and calls qemu_fflush internally. > It's possible here that the storage key code is just the canary - the > first thing that detects that the stream is invalid after it all goes > wrong. Yes, that seems possible. Especially as we end up with all zeroes after the skeys section in the bad case -- it seems like weird problem to have to be cured by an individual device. No good idea *what* actually goes wrong, though.
