Am 21.02.2023 um 11:57 hat Fiona Ebner geschrieben: > Am 14.02.23 um 17:19 schrieb Vladimir Sementsov-Ogievskiy: > > On 02.02.23 16:27, Fiona Ebner wrote: > >> Am 02.02.23 um 12:34 schrieb Kevin Wolf: > >>> But having to switch the mirror job to sync mode just to avoid doing I/O > >>> on an inactive device sounds wrong to me. It doesn't fix the root cause > >>> of that problem, but just papers over it. > >> > >> If you say the root cause is "the job not being completed before > >> switchover", then yes. But if the root cause is "switchover happening > >> while the drive is not actively synced", then a way to switch modes can > >> fix the root cause :) > >> > >>> > >>> Why does your management tool not complete the mirror job before it > >>> does the migration switchover that inactivates images? > >> > >> I did talk with my team leader about the possibility, but we decided to > >> not go for it, because it requires doing the migration in two steps with > >> pause-before-switchover and has the potential to increase guest downtime > >> quite a bit. So I went for this approach instead. > >> > > > > > > Interesting point. Maybe we need a way to automatically complete all the > > jobs before switchower? It seems no reason to break the jobs if user > > didn't cancel them. (and of course no reason to allow a code path > > leading to assertion). > > > > Wouldn't that be a bit unexpected? There could be jobs unrelated to > migration or jobs at early stages. But sure, being able to trigger the > assertion is not nice. > > Potential alternatives could be pausing the jobs or failing migration > with a clean error?
I wonder if the latter is what we would get if child_job.inactivate() just returned an error if the job isn't completed yet? It would potentially result in a mix of active and inactive BDSes, though, which is a painful state to be in. If you don't 'cont' afterwards, you're likely to hit another assertion once you try to do something with an inactive BDS. Pausing the job feels a bit dangerous, because it means that you can resume it as a user. We'd have to add code to check that the image has actually been activated again before we allow to resume the job. > For us, the former is still best done in combination with a way to > switch to active (i.e. write-blocking) mode for drive-mirror. Switching between these modes is a useful thing to have either way. > The latter would force us to complete the drive-mirror job before > switchover even with active (i.e. write-blocking) mode, breaking our > usage of drive-mirror+migration that worked (in almost all cases, but it > would have been all cases if we had used active mode ;)) for many years now. > > Maybe adding an option for how the jobs should behave upon switchover > (e.g. complete/pause/cancel/cancel-migration) could help? Either as a > job-specific option (more flexible) or a migration option? Kevin
