TimServers opened a new issue, #12920:
URL: https://github.com/apache/cloudstack/issues/12920
### problem
##### ISSUE TYPE
* Bug Report
##### COMPONENT NAME
KVM, Orchestration, HA
##### CLOUDSTACK VERSION
4.22.x
##### CONFIGURATION
- KVM hypervisor
- Shared primary storage on NFS
- HA-enabled user VM
- sync.interval = 60
- no ha.tag configured
- tested with an HA-enabled VM deployed on a healthy KVM host
- multiple management servers in the zone/cluster
##### OS / ENVIRONMENT
- CloudStack management servers on Ubuntu 24.04
- MySQL 8
- KVM hosts on Linux/libvirt
- Primary storage: NFS
##### SUMMARY
On CloudStack 4.22.x, if a KVM VM is stopped unexpectedly on the hypervisor
using `virsh destroy`, CloudStack detects `PowerReportMissing`, waits for the
grace period, and schedules HA restart work. However, the HA worker then fails
to restart the VM because `KVMInvestigator` reports the VM as alive (`alive?
true`) while the host is still up.
As a result:
- the VM remains in `Running` state in CloudStack/UI
- the VM is not transitioned to `Stopped`
- HA does not restart it
- the same HA scheduling/investigation loop repeats on subsequent sync cycles
This appears related to #10406 / #10407, which were intended to fix cases
where VMs were not moving to `Stopped` when `PowerReportMissing` is processed.
##### EXPECTED RESULTS
After the grace period passes, CloudStack should process
`PowerReportMissing`, transition the VM to `Stopped`, and, because HA is
enabled, restart the VM automatically.
Expected behavior for this test case:
1. `virsh destroy <domain>` removes the libvirt domain.
2. CloudStack detects the VM as missing.
3. After the graceful period expires, CloudStack updates the VM power report
to `PowerReportMissing`.
4. CloudStack transitions the VM state from `Running` to `Stopped`.
5. HA schedules a restart for the VM.
6. The VM is restarted automatically on an eligible host.
7. The CloudStack UI/API reflects the VM state correctly and does not
continue to show the VM as `Running`.
##### ACTUAL RESULTS
CloudStack detects the VM as missing and the graceful period is working
correctly:
```text
2026-03-31 02:28:43,791 DEBUG ... Detected missing VM. host: 6, vm id:
91(...), power state: PowerReportMissing, last state update:
2026-03-31T02:27:43+0000
2026-03-31 02:28:43,791 DEBUG ... vm id: 91 - time since last state
update(60791 ms) has not passed graceful period yet
2026-03-31 02:29:43,722 DEBUG ... Detected missing VM. host: 6, vm id:
91(...), power state: PowerReportMissing, last state update:
2026-03-31T02:27:43+0000
2026-03-31 02:29:43,722 DEBUG ... vm id: 91 - time since last state
update(120722 ms) has passed graceful period
```
After the graceful period passes, CloudStack updates the VM power report and
schedules HA restart work:
```
2026-03-31 02:29:43,742 DEBUG ... VM state report is updated. Host {...}, VM
instance {"id":91,"instanceName":"i-2-91-VM","state":"Running"...}, power
state: PowerReportMissing
2026-03-31 02:29:43,775 INFO ... Detected out-of-band stop of a HA enabled
VM ... will schedule restart.
2026-03-31 02:29:43,798 INFO ... Schedule vm for HA: VM instance
{"id":91,"instanceName":"i-2-91-VM","state":"Running"...}
2026-03-31 02:29:43,820 INFO ... HA on VM instance
{"id":91,"instanceName":"i-2-91-VM","state":"Running"...}
```
The HA worker checks the VM, and the host-side agent confirms that the
libvirt domain no longer exists:
```
2026-03-31 02:29:43,855 DEBUG ... Unable to get vm state on VM instance
{"id":91,"instanceName":"i-2-91-VM","state":"Running"...}```
```
KVM host agent log:
```
2026-03-31 02:29:43,928 ERROR ... Could not get state for VM [i-2-91-VM]
(retry=0) due to: org.libvirt.LibvirtException: Domain not found: no domain
with matching name 'i-2-91-VM'
```
However, KVMInvestigator then reports the VM as alive, and the HA restart is
cancelled:
```
2026-03-31 02:29:43,859 INFO ... KVMInvestigator found VM instance
{"id":91,"instanceName":"i-2-91-VM","state":"Running"...} to be alive? true
2026-03-31 02:29:43,860 INFO ... VM instance
{"id":91,"instanceName":"i-2-91-VM","state":"Running"...} is alive and host is
up. No need to restart it.
```
This same pattern repeats on later sync cycles, including 02:31:43,
02:33:43, and 02:36:43.
Final observed behavior:
the VM remains in Running state in CloudStack/UI
the VM is not transitioned to Stopped
HA does not restart the VM
the missing-domain / HA-scheduled / KVMInvestigator alive=true loop repeats
continuously
### versions
cloudstack-management 4.22.0.0
cloudstack-agent 4.22.0.0
libvirt 10.0.0-2ubuntu8.11
ubuntu 24.04 LTS
### The steps to reproduce the bug
1. Deploy a user VM on a KVM host with HA enabled.
2. Confirm the VM is in `Running` state in CloudStack.
3. On the KVM host, destroy the domain unexpectedly:
```bash
virsh destroy <domain-name>
### What to do about it?
_No response_
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]