[ovirt-devel] Re: OST: Vdsm: Occasional failures when stopping vdsmd

Michal Skrivanek Fri, 12 Nov 2021 23:41:39 -0800

this needs to be fixed asap, it's failing very frequently for 2+ days
what's also changing is the underlying OS of course, it doesn't necessarily 
have to be a vdsm change



> On 12. 11. 2021, at 20:15, Milan Zamazal <[email protected]> wrote:
> 
> Hi,
> 
> Michal has observed occasional OST failures in test_vdsm_recovery last
> days, which hadn't been seen before.  When `systemctl stop vdsmd' is
> called (via Ansible) there, vdsmd (almost?) never finishes its shutdown
> within the 10 seconds timeout and gets then killed with SIGKILL.  If
> this action is accompanied by "Job for vdsmd.service canceled." message
> then the test fails; otherwise OST continues normally.
> 
> The situation is reproducible by running OST basic-suite-master and
> making it artificially failing after test_vdsm_recovery.  Then running
> `systemctl stop vdsmd' manually on the given OST host (can be done
> repeatedly, so it provides a good opportunity to examine the problem).
> 
> There are two problems there:
> 
> - "Job for vdsmd.service canceled." message that sometimes occurs after
>  `systemctl stop vdsmd' and then the test fails.  I don't know what it
>  means and I can't identify any difference in journal between when the
>  message occurs and when it doesn't.
> 
> - The fact that Vdsm doesn't stop within the timeout and must be killed.
>  This doesn't happen in my normal oVirt installation.  It apparently
>  blocks in self.irs.prepareForShutdown() call from clientIF.py.
>  Journal says:
> 
>    systemd[1]: Stopping Virtual Desktop Server Manager...
>    systemd[1]: vdsmd.service: State 'stop-sigterm' timed out. Killing.
>    systemd[1]: vdsmd.service: Killing process 132608 (vdsmd) with signal 
> SIGKILL.
>    systemd[1]: vdsmd.service: Killing process 133445 (ioprocess) with signal 
> SIGKILL.
>    systemd[1]: vdsmd.service: Killing process 133446 (ioprocess) with signal 
> SIGKILL.
>    systemd[1]: vdsmd.service: Killing process 133447 (ioprocess) with signal 
> SIGKILL.
>    systemd[1]: vdsmd.service: Main process exited, code=killed, status=9/KILL
>    systemd[1]: vdsmd.service: Failed with result 'timeout'.
>    systemd[1]: Stopped Virtual Desktop Server Manager.
> 
>  And vdsm.log (from a different run, sorry):
> 
>    2021-11-12 07:09:30,274+0000 INFO  (MainThread) [vdsm.api] START 
> prepareForShutdown() from=internal, 
> task_id=21b12bbd-1d61-4217-b92d-641a53d5f7bb (api:48)
>    2021-11-12 07:09:30,317+0000 DEBUG (vmchannels) [vds] VM channels listener 
> thread has ended. (vmchannels:214)
>    2021-11-12 07:09:30,317+0000 DEBUG (vmchannels) [root] FINISH thread 
> <Thread(vmchannels, stopped daemon 140163095193344)> (concurrent:261)
>    2021-11-12 07:09:30,516+0000 DEBUG (mailbox-hsm/4) [root] FINISH thread 
> <Thread(mailbox-hsm/4, stopped daemon 140162799089408)> (concurrent:261)
>    2021-11-12 07:09:30,517+0000 INFO  (ioprocess/143197) [IOProcess] 
> (3a729aa1-8e14-4ea0-8794-e3d67fbde542) Starting ioprocess (__init__:465)
>    2021-11-12 07:09:30,521+0000 INFO  (ioprocess/143199) [IOProcess] 
> (ost-he-basic-suite-master-storage:_exports_nfs_share1) Starting ioprocess 
> (__init__:465)
>    2021-11-12 07:09:30,535+0000 INFO  (ioprocess/143193) [IOProcess] 
> (ost-he-basic-suite-master-storage:_exports_nfs_share2) Starting ioprocess 
> (__init__:465)
>    2021-11-12 07:09:30,679+0000 INFO  (ioprocess/143195) [IOProcess] 
> (0187cf2f-2344-48de-a2a0-dd007315399f) Starting ioprocess (__init__:465)
>    2021-11-12 07:09:30,719+0000 INFO  (ioprocess/143192) [IOProcess] 
> (15fa3d6c-671b-46ef-af9a-00337011fa26) Starting ioprocess (__init__:465)
>    2021-11-12 07:09:30,756+0000 INFO  (ioprocess/143194) [IOProcess] 
> (ost-he-basic-suite-master-storage:_exports_nfs_exported) Starting ioprocess 
> (__init__:465)
>    2021-11-12 07:09:30,768+0000 INFO  (ioprocess/143198) [IOProcess] 
> (ost-he-basic-suite-master-storage:_exports_nfs__he) Starting ioprocess 
> (__init__:465)
>    2021-11-12 07:09:30,774+0000 INFO  (ioprocess/143196) [IOProcess] 
> (a8bab4ef-2952-4c42-ba44-dbb3e1b8c87c) Starting ioprocess (__init__:465)
>    2021-11-12 07:09:30,957+0000 DEBUG (mailbox-hsm/2) [root] FINISH thread 
> <Thread(mailbox-hsm/2, stopped daemon 140162815874816)> (concurrent:261)
>    2021-11-12 07:09:31,629+0000 INFO  (mailbox-hsm) 
> [storage.mailbox.hsmmailmonitor] HSM_MailboxMonitor - Incoming mail 
> monitoring thread stopped, clearing outgoing mail (mailbox:500)
>    2021-11-12 07:09:31,629+0000 INFO  (mailbox-hsm) 
> [storage.mailbox.hsmmailmonitor] HSM_MailMonitor sending mail to SPM - 
> ['/usr/bin/dd', 
> 'of=/rhev/data-center/f54c6052-437f-11ec-9094-54527d140533/mastersd/dom_md/inbox',
>  'iflag=fullblock', 'oflag=direct', 'conv=notrunc', 'bs=4096', 'count=1', 
> 'seek=2'] (mailbox:382)
>    2021-11-12 07:09:32,610+0000 DEBUG (mailbox-hsm/1) [root] FINISH thread 
> <Thread(mailbox-hsm/1, stopped daemon 140162841052928)> (concurrent:261)
>    2021-11-12 07:09:32,792+0000 DEBUG (mailbox-hsm/3) [root] FINISH thread 
> <Thread(mailbox-hsm/3, stopped daemon 140162807482112)> (concurrent:261)
>    2021-11-12 07:09:32,818+0000 DEBUG (mailbox-hsm/0) [root] FINISH thread 
> <Thread(mailbox-hsm/0, stopped daemon 140162824267520)> (concurrent:261)
>    2021-11-12 07:09:32,820+0000 INFO  (MainThread) [storage.monitor] Shutting 
> down domain monitors (monitor:243)
>    2021-11-12 07:09:32,820+0000 INFO  (MainThread) [storage.monitor] Stop 
> monitoring 15fa3d6c-671b-46ef-af9a-00337011fa26 (shutdown=True) (monitor:268)
>    2021-11-12 07:09:32,820+0000 INFO  (MainThread) [storage.monitor] Stop 
> monitoring a8bab4ef-2952-4c42-ba44-dbb3e1b8c87c (shutdown=True) (monitor:268)
>    2021-11-12 07:09:32,820+0000 INFO  (MainThread) [storage.monitor] Stop 
> monitoring 3a729aa1-8e14-4ea0-8794-e3d67fbde542 (shutdown=True) (monitor:268)
>    2021-11-12 07:09:32,820+0000 INFO  (MainThread) [storage.monitor] Stop 
> monitoring 0187cf2f-2344-48de-a2a0-dd007315399f (shutdown=True) (monitor:268)
>    2021-11-12 07:09:32,820+0000 INFO  (MainThread) [storage.monitor] Stop 
> monitoring 186180cb-5cc5-4aa4-868a-9e1ed7965ddf (shutdown=True) (monitor:268)
>    2021-11-12 07:09:32,831+0000 INFO  (monitor/0187cf2) [storage.check] Stop 
> checking 
> '/rhev/data-center/mnt/ost-he-basic-suite-master-storage:_exports_nfs_exported/0187cf2f-2344-48de-a2a0-dd007315399f/dom_md/metadata'
>  (check:135)
>    2021-11-12 07:09:32,838+0000 INFO  (monitor/a8bab4e) [storage.check] Stop 
> checking 
> '/rhev/data-center/mnt/ost-he-basic-suite-master-storage:_exports_nfs_share1/a8bab4ef-2952-4c42-ba44-dbb3e1b8c87c/dom_md/metadata'
>  (check:135)
>    2021-11-12 07:09:32,844+0000 INFO  (monitor/15fa3d6) [storage.check] Stop 
> checking 
> '/rhev/data-center/mnt/ost-he-basic-suite-master-storage:_exports_nfs__he/15fa3d6c-671b-46ef-af9a-00337011fa26/dom_md/metadata'
>  (check:135)
>    2021-11-12 07:09:32,864+0000 INFO  (monitor/3a729aa) [storage.check] Stop 
> checking 
> '/rhev/data-center/mnt/ost-he-basic-suite-master-storage:_exports_nfs_share2/3a729aa1-8e14-4ea0-8794-e3d67fbde542/dom_md/metadata'
>  (check:135)
>    2021-11-12 07:09:32,865+0000 INFO  (monitor/186180c) [storage.check] Stop 
> checking '/dev/186180cb-5cc5-4aa4-868a-9e1ed7965ddf/metadata' (check:135)
>    2021-11-12 07:09:32,867+0000 INFO  (monitor/186180c) [storage.blocksd] 
> Tearing down domain 186180cb-5cc5-4aa4-868a-9e1ed7965ddf (blockSD:996)
>    2021-11-12 07:09:32,867+0000 DEBUG (monitor/186180c) [common.commands] 
> /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n /sbin/lvm vgchange --config 
> 'devices {  preferred_names=["^/dev/mapper/"]  ignore_suspended_devices=1  
> write_cache_state=0  disable_after_error_count=3  
> filter=["a|^/dev/mapper/36001405423a0b45cef54e6e9fd0c7df4$|^/dev/mapper/36001405e9302e3f55e74aedbf352b79f$|",
>  "r|.*|"]  hints="none"  obtain_device_list_from_udev=0 } global {  
> locking_type=1  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0  
> use_lvmpolld=1 } backup {  retain_min=50  retain_days=0 }' --available n 
> 186180cb-5cc5-4aa4-868a-9e1ed7965ddf (cwd None) (commands:154)
>    2021-11-12 07:09:38,063+0000 DEBUG (monitor/186180c) [common.commands] 
> FAILED: <err> = b'  Logical volume 186180cb-5cc5-4aa4-868a-9e1ed7965ddf/ids 
> in use.\n  Can\'t deactivate volume group 
> "186180cb-5cc5-4aa4-868a-9e1ed7965ddf" with 1 open logical volume(s)\n'; <rc> 
> = 5 (commands:186)
>    2021-11-12 07:09:38,066+0000 WARN  (monitor/186180c) [storage.lvm] Command 
> with specific filter failed or returned no data, retrying with a wider 
> filter: LVM command failed: 'cmd=[\'/sbin/lvm\', \'vgchange\', \'--config\', 
> \'devices {  preferred_names=["^/dev/mapper/"]  ignore_suspended_devices=1  
> write_cache_state=0  disable_after_error_count=3  
> filter=["a|^/dev/mapper/36001405423a0b45cef54e6e9fd0c7df4$|^/dev/mapper/36001405e9302e3f55e74aedbf352b79f$|",
>  "r|.*|"]  hints="none"  obtain_device_list_from_udev=0 } global {  
> locking_type=1  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0  
> use_lvmpolld=1 } backup {  retain_min=50  retain_days=0 }\', \'--available\', 
> \'n\', \'186180cb-5cc5-4aa4-868a-9e1ed7965ddf\'] rc=5 out=[] err=[\'  Logical 
> volume 186180cb-5cc5-4aa4-868a-9e1ed7965ddf/ids in use.\', \'  Can\\\'t 
> deactivate volume group "186180cb-5cc5-4aa4-868a-9e1ed7965ddf" with 1 open 
> logical volume(s)\']' (lvm:532)
>    2021-11-12 07:09:38,067+0000 DEBUG (monitor/186180c) [common.commands] 
> /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n /sbin/lvm vgchange --config 
> 'devices {  preferred_names=["^/dev/mapper/"]  ignore_suspended_devices=1  
> write_cache_state=0  disable_after_error_count=3  
> filter=["a|^/dev/mapper/36001405423a0b45cef54e6e9fd0c7df4$|^/dev/mapper/36001405d40a7e761def4ed09aac63282$|^/dev/mapper/36001405e9302e3f55e74aedbf352b79f$|^/dev/mapper/36001405ece10ae48e99470295edda1bb$|^/dev/mapper/36001405f269e8e9efb34e5f9170b349b$|^/dev/mapper/36001405f8c5fe6c239e46a0976b842df$|",
>  "r|.*|"]  hints="none"  obtain_device_list_from_udev=0 } global {  
> locking_type=1  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0  
> use_lvmpolld=1 } backup {  retain_min=50  retain_days=0 }' --available n 
> 186180cb-5cc5-4aa4-868a-9e1ed7965ddf (cwd None) (commands:154)
> 
>  What is interesting here is that ioprocess is started here rather than
>  being killed as on clean vdsmd shutdowns.
> 
> Any ideas?
> 
> Thanks,
> Milan
> 
_______________________________________________
Devel mailing list -- [email protected]
To unsubscribe send an email to [email protected]
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/[email protected]/message/UZ67MRXHZIEAVL63AOW4BADQH4OU447V/

[ovirt-devel] Re: OST: Vdsm: Occasional failures when stopping vdsmd

Reply via email to