Paolo, > > Do you mean something like this? > > destination > socket() > bind() to { sin_port = 0, sin_addr.s_addr = INADDR_ANY } > listen() > getsockname() > send address to source > accept() > start QEMU with file descriptor returned by accept > > source > read address > socket() > connect() > pass socket file descriptor to QEMU and migrate to it > > Anything that doesn't use sin_port = 0 and getsockname() is prone to > race conditions.
From memory we bind() to a specific randomly chosen port and if that fails retry until bind() succeeds. This is because we want the port to be within a given range. I believe that is race free as only one bind() can run at once. >> Approx 10% of migrations die after many minutes on the >> customer's platform. This does not appear to happen if migrations are >> not carried out 50 at a time. > > Dying after many minutes usually means that the destination is not set > up the same as the source, as you said below. Hmmm. OK I thought that produced an immediate error. Is there any way of logging what's up to stderr or similar etc? Alex > > Paolo > >> We appear to be getting something other than 'ms' returned through the >> monitoring system. Unhelpfully what that is is not logged. >> >> Is there anything (apart from the socket closing prematurely) which can >> cause a failed migration after many minutes? We've seen problems where >> the destination is not set up the same as the source (e.g. different >> numbers of NICs) but IIRC that fails much earlier. >> >> To make things easier (cough), this is qemu 1.0 (as shipped with Ubuntu >> Precise). >> > > > -- Alex Bligh