On Thu, Dec 11, 2025 at 08:44:27PM +0000, Chaney, Ben wrote: > > > > On 12/11/25, 2:13 PM, "Peter Xu" <[email protected] > <mailto:[email protected]>> wrote: > > > On Thu, Dec 11, 2025 at 06:42:05PM +0000, Chaney, Ben wrote: > > > > > > > > > On 12/9/25, 1:55 PM, "Peter Xu" <[email protected] > > > <mailto:[email protected]> <mailto:[email protected] > > > <mailto:[email protected]>>> wrote: > > > > > > > > On Mon, Dec 08, 2025 at 07:32:41PM +0000, Chaney, Ben wrote: > > > > > > > > > On 12/5/25, 10:13 AM, "Peter Xu" <[email protected] > > > > > <mailto:[email protected]> <mailto:[email protected] > > > > > <mailto:[email protected]>> <mailto:[email protected] > > > > > <mailto:[email protected]> <mailto:[email protected] > > > > > <mailto:[email protected]>>>> wrote: > > > > > > > > > > > > > > > > Maybe you can stick with -incoming defer, then it'll be after step > > > > > > [3], > > > > > > which will inherit the modified uid, and mgmt doesn't need to bother > > > > > > monitoring. > > > > > > > > > > I tried this approach, but It doesn't look like it is possible to > > > > > create the > > > > > cprsocket later with -incoming defer. > > > > > > > > > > > > You'll still need to chmod for the cpr socket. "defer" will still help > > > > the > > > > main channel to be created with the uid provided. > > > > > > Thanks for the pointers. I was able to get the incoming defer method > > > working, but it has much worse performance than the other method. > > > > > > Would you be open to a solution where we chown only the migration > > > sockets, or would that run into similar concerns? > > > > > > We can evaluate, but before that, could you explain your current solution > > first? > > > > > > And, what is the performance you mentioned here that is worse? > > > > > > I at least didn't expect it to be downtime, because IIUC what your mgmt > > needs to do is to chmod on the cpr channel first (during which migration > > hasn't started), then chmod once more on the main channel after CPR channel > > migrated and before main channel migration happens (during which VM should > > be running on src), hence it should have nothing to do with downtime. > > I wouldn't have expected this to affect downtime either, but it does increase > the > downtime by about 3.5 seconds (700-800ms to just over 4s). I am using the > following setup to defer the creation of the main socket: > > qemu-system-x86_64 ... -incoming defer -incoming \ > '{"channel-type": "cpr", "addr": { "transport": "socket", "type": "unix", > "path": "cpr.sock"}}' > > chown $UID:$GID cpr.sock > > echo '{"execute":"qmp_capabilities"} > {"execute": "query-status"} > {"execute":"migrate-set-parameters", > "arguments":{"mode":"cpr-transfer"}} > {"execute": "migrate", "arguments": { "channels": [ > {"channel-type": "main", "addr": { "transport": "socket", "type": "unix", > "path": "main.sock"}}, > {"channel-type": "cpr", > "addr": { "transport": "socket", "type": "unix", > "path": "cpr.sock"}}]}} > > {"execute": "query-status"} > > {"execute": "query-migrate"} > ' | $SSH_COMMAND socat STDIO unix-connect:qemu_src.monitor > > echo '{"execute":"qmp_capabilities"} > {"execute": "migrate-incoming", "arguments": { "channels": [ > {"channel-type": "main", "addr": { "transport": "socket", "type": "unix", > "path": "main.sock"}}]}} > {"execute": "query-status"} > {"execute": "query-migrate"} > ' | $SSH_COMMAND socat STDIO unix-connect:qemu_dst.monitor > > The migration finishes as soon as the migrate-incoming command is issued.
This really sounds weird, because this window should be the maximum downtime.. if it finished so fast, something is wrong. Could you spend some time investigate this problem? IMHO something was very off, a few seconds of downtime shouldn't be hard to chase. If we need to justify a chmod on top of migration channels, we still need to know why it's needed. Thanks, > There is no opportunity to chown the main socket, but because it is being > hot plugged it gets created with the appropriate permissions. > > I should also note that I am testing this in combination with the patch set > for > cpr transfer for tap devices, which makes the issue more pronounced in terms > of network interruption, however the reported downtime increases by 3.5s > regardless of if that patch set is applied or not. -- Peter Xu
