this needs to be fixed asap, it's failing very frequently for 2+ days what's also changing is the underlying OS of course, it doesn't necessarily have to be a vdsm change
> On 12. 11. 2021, at 20:15, Milan Zamazal <[email protected]> wrote: > > Hi, > > Michal has observed occasional OST failures in test_vdsm_recovery last > days, which hadn't been seen before. When `systemctl stop vdsmd' is > called (via Ansible) there, vdsmd (almost?) never finishes its shutdown > within the 10 seconds timeout and gets then killed with SIGKILL. If > this action is accompanied by "Job for vdsmd.service canceled." message > then the test fails; otherwise OST continues normally. > > The situation is reproducible by running OST basic-suite-master and > making it artificially failing after test_vdsm_recovery. Then running > `systemctl stop vdsmd' manually on the given OST host (can be done > repeatedly, so it provides a good opportunity to examine the problem). > > There are two problems there: > > - "Job for vdsmd.service canceled." message that sometimes occurs after > `systemctl stop vdsmd' and then the test fails. I don't know what it > means and I can't identify any difference in journal between when the > message occurs and when it doesn't. > > - The fact that Vdsm doesn't stop within the timeout and must be killed. > This doesn't happen in my normal oVirt installation. It apparently > blocks in self.irs.prepareForShutdown() call from clientIF.py. > Journal says: > > systemd[1]: Stopping Virtual Desktop Server Manager... > systemd[1]: vdsmd.service: State 'stop-sigterm' timed out. Killing. > systemd[1]: vdsmd.service: Killing process 132608 (vdsmd) with signal > SIGKILL. > systemd[1]: vdsmd.service: Killing process 133445 (ioprocess) with signal > SIGKILL. > systemd[1]: vdsmd.service: Killing process 133446 (ioprocess) with signal > SIGKILL. > systemd[1]: vdsmd.service: Killing process 133447 (ioprocess) with signal > SIGKILL. > systemd[1]: vdsmd.service: Main process exited, code=killed, status=9/KILL > systemd[1]: vdsmd.service: Failed with result 'timeout'. > systemd[1]: Stopped Virtual Desktop Server Manager. > > And vdsm.log (from a different run, sorry): > > 2021-11-12 07:09:30,274+0000 INFO (MainThread) [vdsm.api] START > prepareForShutdown() from=internal, > task_id=21b12bbd-1d61-4217-b92d-641a53d5f7bb (api:48) > 2021-11-12 07:09:30,317+0000 DEBUG (vmchannels) [vds] VM channels listener > thread has ended. (vmchannels:214) > 2021-11-12 07:09:30,317+0000 DEBUG (vmchannels) [root] FINISH thread > <Thread(vmchannels, stopped daemon 140163095193344)> (concurrent:261) > 2021-11-12 07:09:30,516+0000 DEBUG (mailbox-hsm/4) [root] FINISH thread > <Thread(mailbox-hsm/4, stopped daemon 140162799089408)> (concurrent:261) > 2021-11-12 07:09:30,517+0000 INFO (ioprocess/143197) [IOProcess] > (3a729aa1-8e14-4ea0-8794-e3d67fbde542) Starting ioprocess (__init__:465) > 2021-11-12 07:09:30,521+0000 INFO (ioprocess/143199) [IOProcess] > (ost-he-basic-suite-master-storage:_exports_nfs_share1) Starting ioprocess > (__init__:465) > 2021-11-12 07:09:30,535+0000 INFO (ioprocess/143193) [IOProcess] > (ost-he-basic-suite-master-storage:_exports_nfs_share2) Starting ioprocess > (__init__:465) > 2021-11-12 07:09:30,679+0000 INFO (ioprocess/143195) [IOProcess] > (0187cf2f-2344-48de-a2a0-dd007315399f) Starting ioprocess (__init__:465) > 2021-11-12 07:09:30,719+0000 INFO (ioprocess/143192) [IOProcess] > (15fa3d6c-671b-46ef-af9a-00337011fa26) Starting ioprocess (__init__:465) > 2021-11-12 07:09:30,756+0000 INFO (ioprocess/143194) [IOProcess] > (ost-he-basic-suite-master-storage:_exports_nfs_exported) Starting ioprocess > (__init__:465) > 2021-11-12 07:09:30,768+0000 INFO (ioprocess/143198) [IOProcess] > (ost-he-basic-suite-master-storage:_exports_nfs__he) Starting ioprocess > (__init__:465) > 2021-11-12 07:09:30,774+0000 INFO (ioprocess/143196) [IOProcess] > (a8bab4ef-2952-4c42-ba44-dbb3e1b8c87c) Starting ioprocess (__init__:465) > 2021-11-12 07:09:30,957+0000 DEBUG (mailbox-hsm/2) [root] FINISH thread > <Thread(mailbox-hsm/2, stopped daemon 140162815874816)> (concurrent:261) > 2021-11-12 07:09:31,629+0000 INFO (mailbox-hsm) > [storage.mailbox.hsmmailmonitor] HSM_MailboxMonitor - Incoming mail > monitoring thread stopped, clearing outgoing mail (mailbox:500) > 2021-11-12 07:09:31,629+0000 INFO (mailbox-hsm) > [storage.mailbox.hsmmailmonitor] HSM_MailMonitor sending mail to SPM - > ['/usr/bin/dd', > 'of=/rhev/data-center/f54c6052-437f-11ec-9094-54527d140533/mastersd/dom_md/inbox', > 'iflag=fullblock', 'oflag=direct', 'conv=notrunc', 'bs=4096', 'count=1', > 'seek=2'] (mailbox:382) > 2021-11-12 07:09:32,610+0000 DEBUG (mailbox-hsm/1) [root] FINISH thread > <Thread(mailbox-hsm/1, stopped daemon 140162841052928)> (concurrent:261) > 2021-11-12 07:09:32,792+0000 DEBUG (mailbox-hsm/3) [root] FINISH thread > <Thread(mailbox-hsm/3, stopped daemon 140162807482112)> (concurrent:261) > 2021-11-12 07:09:32,818+0000 DEBUG (mailbox-hsm/0) [root] FINISH thread > <Thread(mailbox-hsm/0, stopped daemon 140162824267520)> (concurrent:261) > 2021-11-12 07:09:32,820+0000 INFO (MainThread) [storage.monitor] Shutting > down domain monitors (monitor:243) > 2021-11-12 07:09:32,820+0000 INFO (MainThread) [storage.monitor] Stop > monitoring 15fa3d6c-671b-46ef-af9a-00337011fa26 (shutdown=True) (monitor:268) > 2021-11-12 07:09:32,820+0000 INFO (MainThread) [storage.monitor] Stop > monitoring a8bab4ef-2952-4c42-ba44-dbb3e1b8c87c (shutdown=True) (monitor:268) > 2021-11-12 07:09:32,820+0000 INFO (MainThread) [storage.monitor] Stop > monitoring 3a729aa1-8e14-4ea0-8794-e3d67fbde542 (shutdown=True) (monitor:268) > 2021-11-12 07:09:32,820+0000 INFO (MainThread) [storage.monitor] Stop > monitoring 0187cf2f-2344-48de-a2a0-dd007315399f (shutdown=True) (monitor:268) > 2021-11-12 07:09:32,820+0000 INFO (MainThread) [storage.monitor] Stop > monitoring 186180cb-5cc5-4aa4-868a-9e1ed7965ddf (shutdown=True) (monitor:268) > 2021-11-12 07:09:32,831+0000 INFO (monitor/0187cf2) [storage.check] Stop > checking > '/rhev/data-center/mnt/ost-he-basic-suite-master-storage:_exports_nfs_exported/0187cf2f-2344-48de-a2a0-dd007315399f/dom_md/metadata' > (check:135) > 2021-11-12 07:09:32,838+0000 INFO (monitor/a8bab4e) [storage.check] Stop > checking > '/rhev/data-center/mnt/ost-he-basic-suite-master-storage:_exports_nfs_share1/a8bab4ef-2952-4c42-ba44-dbb3e1b8c87c/dom_md/metadata' > (check:135) > 2021-11-12 07:09:32,844+0000 INFO (monitor/15fa3d6) [storage.check] Stop > checking > '/rhev/data-center/mnt/ost-he-basic-suite-master-storage:_exports_nfs__he/15fa3d6c-671b-46ef-af9a-00337011fa26/dom_md/metadata' > (check:135) > 2021-11-12 07:09:32,864+0000 INFO (monitor/3a729aa) [storage.check] Stop > checking > '/rhev/data-center/mnt/ost-he-basic-suite-master-storage:_exports_nfs_share2/3a729aa1-8e14-4ea0-8794-e3d67fbde542/dom_md/metadata' > (check:135) > 2021-11-12 07:09:32,865+0000 INFO (monitor/186180c) [storage.check] Stop > checking '/dev/186180cb-5cc5-4aa4-868a-9e1ed7965ddf/metadata' (check:135) > 2021-11-12 07:09:32,867+0000 INFO (monitor/186180c) [storage.blocksd] > Tearing down domain 186180cb-5cc5-4aa4-868a-9e1ed7965ddf (blockSD:996) > 2021-11-12 07:09:32,867+0000 DEBUG (monitor/186180c) [common.commands] > /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n /sbin/lvm vgchange --config > 'devices { preferred_names=["^/dev/mapper/"] ignore_suspended_devices=1 > write_cache_state=0 disable_after_error_count=3 > filter=["a|^/dev/mapper/36001405423a0b45cef54e6e9fd0c7df4$|^/dev/mapper/36001405e9302e3f55e74aedbf352b79f$|", > "r|.*|"] hints="none" obtain_device_list_from_udev=0 } global { > locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 > use_lvmpolld=1 } backup { retain_min=50 retain_days=0 }' --available n > 186180cb-5cc5-4aa4-868a-9e1ed7965ddf (cwd None) (commands:154) > 2021-11-12 07:09:38,063+0000 DEBUG (monitor/186180c) [common.commands] > FAILED: <err> = b' Logical volume 186180cb-5cc5-4aa4-868a-9e1ed7965ddf/ids > in use.\n Can\'t deactivate volume group > "186180cb-5cc5-4aa4-868a-9e1ed7965ddf" with 1 open logical volume(s)\n'; <rc> > = 5 (commands:186) > 2021-11-12 07:09:38,066+0000 WARN (monitor/186180c) [storage.lvm] Command > with specific filter failed or returned no data, retrying with a wider > filter: LVM command failed: 'cmd=[\'/sbin/lvm\', \'vgchange\', \'--config\', > \'devices { preferred_names=["^/dev/mapper/"] ignore_suspended_devices=1 > write_cache_state=0 disable_after_error_count=3 > filter=["a|^/dev/mapper/36001405423a0b45cef54e6e9fd0c7df4$|^/dev/mapper/36001405e9302e3f55e74aedbf352b79f$|", > "r|.*|"] hints="none" obtain_device_list_from_udev=0 } global { > locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 > use_lvmpolld=1 } backup { retain_min=50 retain_days=0 }\', \'--available\', > \'n\', \'186180cb-5cc5-4aa4-868a-9e1ed7965ddf\'] rc=5 out=[] err=[\' Logical > volume 186180cb-5cc5-4aa4-868a-9e1ed7965ddf/ids in use.\', \' Can\\\'t > deactivate volume group "186180cb-5cc5-4aa4-868a-9e1ed7965ddf" with 1 open > logical volume(s)\']' (lvm:532) > 2021-11-12 07:09:38,067+0000 DEBUG (monitor/186180c) [common.commands] > /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n /sbin/lvm vgchange --config > 'devices { preferred_names=["^/dev/mapper/"] ignore_suspended_devices=1 > write_cache_state=0 disable_after_error_count=3 > filter=["a|^/dev/mapper/36001405423a0b45cef54e6e9fd0c7df4$|^/dev/mapper/36001405d40a7e761def4ed09aac63282$|^/dev/mapper/36001405e9302e3f55e74aedbf352b79f$|^/dev/mapper/36001405ece10ae48e99470295edda1bb$|^/dev/mapper/36001405f269e8e9efb34e5f9170b349b$|^/dev/mapper/36001405f8c5fe6c239e46a0976b842df$|", > "r|.*|"] hints="none" obtain_device_list_from_udev=0 } global { > locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 > use_lvmpolld=1 } backup { retain_min=50 retain_days=0 }' --available n > 186180cb-5cc5-4aa4-868a-9e1ed7965ddf (cwd None) (commands:154) > > What is interesting here is that ioprocess is started here rather than > being killed as on clean vdsmd shutdowns. > > Any ideas? > > Thanks, > Milan > _______________________________________________ Devel mailing list -- [email protected] To unsubscribe send an email to [email protected] Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/[email protected]/message/UZ67MRXHZIEAVL63AOW4BADQH4OU447V/
