VM HA not working when there's an issue with vm restart

### problem

We're trying VM HA on cloudstack. We disabled HOST HA (see issue [13371](https://github.com/apache/cloudstack/issues/13371)) and then simulated a crash on an host with a "fake" kernel panic. The investigators correctly found the host dead and put it in alert/disconnected state. 
```text
2026-06-29 10:01:03,855 DEBUG [o.a.c.k.h.KVMHostActivityChecker] (AgentTaskPool-3:[ctx-5facfbaf]) (logid:9ccd9518) Checking Host {"id":76,"name":"fakehost1.ha","type":"Routing","uuid":"ad118d97-03e2-4555-9d70-595864e87
1fd"} status...
2026-06-29 10:01:20,849 INFO  [c.c.a.m.ClusteredAgentManagerImpl] (AgentTaskPool-4:[ctx-dd435ced]) (logid:69b8079e) Investigating why host Host {"id":76,"name":"fakehost1.ha","type":"Routing","uuid":"ad118d97-03e2-4555
-9d70-595864e871fd"} has disconnected with event
2026-06-29 10:01:23,858 DEBUG [o.a.c.k.h.KVMHostActivityChecker] (AgentTaskPool-3:[ctx-5facfbaf]) (logid:9ccd9518) Setting Host {"id":76,"name":"fakehost1.ha","type":"Routing","uuid":"ad118d97-03e2-4555-9d70-595864e871
fd"} to "Disconnected" status.
2026-06-29 10:01:23,860 DEBUG [o.a.c.k.h.KVMHostActivityChecker] (AgentTaskPool-3:[ctx-5facfbaf]) (logid:9ccd9518) Investigating Host {"id":76,"name":"fakehost1.ha","type":"Routing","uuid":"ad118d97-03e2-4555-9d70-5958
64e871fd"} via neighboring Host {"id":75,"name":"fakehost2.ha","type":"Routing","uuid":"ffea4f8e-0d91-4a22-b600-aea1891be419"}.
2026-06-29 10:01:23,918 DEBUG [o.a.c.k.h.KVMHostActivityChecker] (AgentTaskPool-3:[ctx-5facfbaf]) (logid:9ccd9518) Host {"id":76,"name":"fakehost1.ha","type":"Routing","uuid":"ad118d97-03e2-4555-9d70-595864e871fd"} is 
not active according to neighbor Host {"id":75,"name":"fakehost2.ha","type":"Routing","uuid":"ffea4f8e-0d91-4a22-b600-aea1891be419"}, details: Heart is not beating.
2026-06-29 10:01:23,918 DEBUG [o.a.c.k.h.KVMHostActivityChecker] (AgentTaskPool-3:[ctx-5facfbaf]) (logid:9ccd9518) HA: HOST is ineligible legacy state Disconnected for host Host {"id":76,"name":"fakehost1.ha","type":"R
outing","uuid":"ad118d97-03e2-4555-9d70-595864e871fd"}
2026-06-29 10:01:23,918 INFO  [c.c.a.m.ClusteredAgentManagerImpl] (AgentTaskPool-3:[ctx-5facfbaf]) (logid:9ccd9518) The agent from host Host {"id":76,"name":"fakehost1.ha","type":"Routing","uuid":"ad118d97-03e2-4555-9d
70-595864e871fd"} state determined is Disconnected
2026-06-29 10:01:23,918 WARN  [c.c.a.m.ClusteredAgentManagerImpl] (AgentTaskPool-3:[ctx-5facfbaf]) (logid:9ccd9518) Agent is disconnected but the host is still up: Host {"id":76,"name":"fakehost1.ha","type":"Routing","
uuid":"ad118d97-03e2-4555-9d70-595864e871fd"} state: Enabled
```

The management servers then tried to start the vms on another host

```text
2026-06-29 10:01:54,745 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl$5] (Work-Job-Executor-7:[ctx-fa35a62a, job-21199/job-21200]) (logid:28a7d0f4) Executing AsyncJob {"accountId":1,"cmd":"com.cloud.vm.VmWorkStart","cmdInfo":"rO0ABXNyABhjb2
0uY2xvdWQudm0uVm1Xb3JrU3RhcnR9cMGsvxz73gIAC0oABGRjSWRMAAZhdm9pZHN0ADBMY29tL2Nsb3VkL2RlcGxveS9EZXBsb3ltZW50UGxhbm5lciRFeGNsdWRlTGlzdDtMAAljbHVzdGVySWR0ABBMamF2YS9sYW5nL0xvbmc7TAAGaG9zdElkcQB-AAJMAAtqb3VybmFsTmFtZXQAEkxqYXZhL2xhbmcvU3R
yaW5nO0wAEXBoeXNpY2FsTmV0d29ya0lkcQB-AAJMAAdwbGFubmVycQB-AANMAAVwb2RJZHEAfgACTAAGcG9vbElkcQB-AAJMAAlyYXdQYXJhbXN0AA9MamF2YS91dGlsL01hcDtMAA1yZXNlcnZhdGlvbklkcQB-AAN4cgATY29tLmNsb3VkLnZtLlZtV29ya5-ZtlbwJWdrAgAESgAJYWNjb3VudElkSgAGdXNl
cklkSgAEdm1JZEwAC2hhbmRsZXJOYW1lcQB-AAN4cAAAAAAAAAABAAAAAAAAAAEAAAAAAAAFy3QAGVZpcnR1YWxNYWNoaW5lTWFuYWdlckltcGwAAAAAAAAAAHBwcHBwcHBwc3IAEWphdmEudXRpbC5IYXNoTWFwBQfawcMWYNEDAAJGAApsb2FkRmFjdG9ySQAJdGhyZXNob2xkeHA_QAAAAAAADHcIAAAAEAAAA
AF0AAtIYU9wZXJhdGlvbnQAP3JPMEFCWE55QUJGcVlYWmhMbXhoYm1jdVFtOXZiR1ZoYnMwZ2NvRFZuUHJ1QWdBQldnQUZkbUZzZFdWNGNBRXhw","cmdVersion":0,"completeMsid":null,"created":"Mon Jun 29 10:01:52 CEST 2026","id":21200,"initMsid":90520741747699,"insta
nceId":null,"instanceType":null,"lastPolled":null,"lastUpdated":null,"processStatus":0,"removed":null,"result":null,"resultCode":0,"status":"IN_PROGRESS","userId":1,"uuid":"319b5aa3-a42e-42d6-a538-c9c172529d70"}
2026-06-29 10:01:54,788 INFO  [c.c.a.m.a.i.FirstFitRoutingAllocator] (Work-Job-Executor-7:[ctx-fa35a62a, job-21199/job-21200, ctx-60b8df98, FirstFitRoutingAllocator]) (logid:28a7d0f4)  Guest VM is requested with Custom[UEFI] Boot Typ
e false
2026-06-29 10:01:54,831 DEBUG [o.a.c.e.o.NetworkOrchestrator] (Work-Job-Executor-7:[ctx-fa35a62a, job-21199/job-21200, ctx-60b8df98]) (logid:28a7d0f4) Network Network {"id": 347, "name": "DR-FE_11NET", "uuid": "48655ada-11ad-4101-abc
b-1531f6575018", "networkofferingid": 32} is already implemented
2026-06-29 10:01:54,842 DEBUG [o.a.c.e.o.NetworkOrchestrator] (Work-Job-Executor-7:[ctx-fa35a62a, job-21199/job-21200, ctx-60b8df98]) (logid:28a7d0f4) Changing active number of NICs for Network ID=Network {"id": 347, "name": "DR-FE_1
1NET", "uuid": "48655ada-11ad-4101-abcb-1531f6575018", "networkofferingid": 32} on 1
2026-06-29 10:01:54,849 DEBUG [o.a.c.e.o.VolumeOrchestrator] (Work-Job-Executor-7:[ctx-fa35a62a, job-21199/job-21200, ctx-60b8df98]) (logid:28a7d0f4) No need to recreate the volume [{"name":"ROOT-1483","uuid":"37824f75-50a3-4dea-9062
-f0ee8cde341c"}] since it already has an assigned pool: [0f61b84c-4c67-3856-8e16-eb920a49c200]. Adding disk to the VM.
.....
2026-06-29 10:01:55,976 DEBUG [o.a.c.e.o.NetworkOrchestrator] (Work-Job-Executor-7:[ctx-fa35a62a, job-21199/job-21200, ctx-60b8df98]) (logid:28a7d0f4) The nic Nic {"broadcastUri":"vlan:\/\/20","deviceId":0,"iPv4Address":null,"id":264
6,"instanceId":1483,"reservationId":"22066ce0-4ddd-4d9a-9fa3-c1d7c611ba2d","uuid":"9a4c66dc-e3f9-484f-a1af-2817ebfba998"} on NicProfile {"broadcastUri":null,"deviceId":0,"iPv4Address":null,"id":2646,"reservationId":"22066ce0-4ddd-4d9
a-9fa3-c1d7c611ba2d","uuid":"9a4c66dc-e3f9-484f-a1af-2817ebfba998","vmId":1483} was released according to VM instance {"id":1483,"instanceName":"i-2-1483-VM","state":"Starting","type":"User","uuid":"1f7afd4f-398b-4789-ae14-73d28903f0
36"} by guru com.cloud.network.guru.ExternalGuestNetworkGuru@601e3c8a, now updating record.
2026-06-29 10:01:55,977 DEBUG [o.a.c.e.o.NetworkOrchestrator] (Work-Job-Executor-7:[ctx-fa35a62a, job-21199/job-21200, ctx-60b8df98]) (logid:28a7d0f4) Changing active number of NICs for Network ID=Network {"id": 347, "name": "DR-FE_1
1NET", "uuid": "48655ada-11ad-4101-abcb-1531f6575018", "networkofferingid": 32} on -1
2026-06-29 10:01:56,038 WARN  [c.c.d.DeploymentPlanningManagerImpl] (Work-Job-Executor-7:[ctx-fa35a62a, job-21199/job-21200, ctx-60b8df98]) (logid:28a7d0f4) The last host [Host {"id":76,"name":"fakehost1.ha","type":"Ro
uting","uuid":"ad118d97-03e2-4555-9d70-595864e871fd"}] of VM [VM instance {"id":1483,"instanceName":"i-2-1483-VM","state":"Starting","type":"User","uuid":"1f7afd4f-398b-4789-ae14-73d28903f036"}] is in the avoid set. Skipping this and
 trying other available hosts.
....
2026-06-29 10:01:56,189 INFO  [o.a.c.s.a.ZoneWideStoragePoolAllocator] (Work-Job-Executor-7:[ctx-fa35a62a, job-21199/job-21200, ctx-60b8df98]) (logid:28a7d0f4) Using volume allocation algorithm firstfit to reorder pools.
2026-06-29 10:01:56,221 ERROR [c.c.v.ClusteredVirtualMachineManagerImpl] (Work-Job-Executor-7:[ctx-fa35a62a, job-21199/job-21200, ctx-60b8df98]) (logid:28a7d0f4) Unable to orchestrate start VM instance {"id":1483,"instanceName":"i-2-
1483-VM","state":"Stopped","type":"User","uuid":"1f7afd4f-398b-4789-ae14-73d28903f036"} due to [Unable to create a deployment for VM instance {"id":1483,"instanceName":"i-2-1483-VM","state":"Starting","type":"User","uuid":"1f7afd4f-3
98b-4789-ae14-73d28903f036"} after 3 attempts Last known error: Unable to start VM on Host {"id":75,"name":"fakehost2.ha","type":"Routing","uuid":"ffea4f8e-0d91-4a22-b600-aea1891be419"} due to internal error: QEMU unex
pectedly closed the monitor (vm='i-2-1483-VM'): 2026-06-29T08:01:55.363939Z qemu-kvm: -blockdev {"node-name":"libvirt-2-format","read-only":false,"discard":"unmap","cache":{"direct":true,"no-flush":false},"driver":"qcow2","file":"lib
virt-2-storage","backing":null}: Failed to get "write" lock
2026-06-29 10:01:56,249 WARN  [c.c.h.HighAvailabilityManagerExtImpl] (HA-Worker-1:[ctx-9f19d45c, work-17]) (logid:9228a43f) Encountered unhandled exception during HA process, reschedule work HAWork[17-HA-1483-Stopped-Scheduled] com.cloud.utils.exception.CloudRuntimeException: Unable to orchestrate the start of VM instance {"instanceName":"i-2-1483-VM","uuid":"1f7afd4f-398b-4789-ae14-73d28903f036"}.
        at com.cloud.vm.VirtualMachineManagerImpl.orchestrateStart(VirtualMachineManagerImpl.java:6012)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:569)
        at com.cloud.vm.VmWorkJobHandlerProxy.handleVmWorkJob(VmWorkJobHandlerProxy.java:102)
        at com.cloud.vm.VirtualMachineManagerImpl.handleVmWorkJob(VirtualMachineManagerImpl.java:6133)
        at com.cloud.vm.VmWorkJobDispatcher.runJob(VmWorkJobDispatcher.java:99)
        at org.apache.cloudstack.framework.jobs.impl.AsyncJobManagerImpl$5.runInContext(AsyncJobManagerImpl.java:698)
        at org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:49)
        at org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:56)
        at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:103)
        at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:53)
        at org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:46)
        at org.apache.cloudstack.framework.jobs.impl.AsyncJobManagerImpl$5.run(AsyncJobManagerImpl.java:646)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        at java.base/java.lang.Thread.run(Thread.java:840)

2026-06-29 10:01:56,250 WARN  [c.c.h.HighAvailabilityManagerExtImpl] (HA-Worker-1:[ctx-9f19d45c, work-17]) (logid:9228a43f) Rescheduling work HAWork[17-HA-1483-Stopped-Scheduled] to try again at 2026-06-29T10:02:26.432+0200. Finished attempt 1/5 times.
```
The job fails (expectedly i would say) because of the nfs write lock held from the dead host.
But when retrying the vm restart we get
```
2026-06-29 10:01:52,843 INFO  [c.c.h.HighAvailabilityManagerExtImpl] (HA-Worker-1:[ctx-9f19d45c, work-17]) (logid:9228a43f) Processing work HAWork[17-HA-1483-Stopped-Scheduled]
2026-06-29 10:01:52,845 INFO  [c.c.h.HighAvailabilityManagerExtImpl] (HA-Worker-1:[ctx-9f19d45c, work-17]) (logid:9228a43f) HA on VM instance {"id":1483,"instanceName":"i-2-1483-VM","state":"Stopped","type":"User","uuid":"1f7afd4f-398b-4789-ae14-73d28903f036"}
2026-06-29 10:01:52,860 WARN  [o.a.c.f.j.AsyncJobExecutionContext] (HA-Worker-1:[ctx-9f19d45c, work-17, ctx-b9bac272]) (logid:9228a43f) Job is executed without a context, setup psudo job for the executing thread
2026-06-29 10:01:52,871 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl] (HA-Worker-1:[ctx-9f19d45c, work-17, ctx-b9bac272]) (logid:9228a43f) Sync job-21200 execution on object VmWorkJobQueue.1483
2026-06-29 10:01:56,249 WARN  [c.c.h.HighAvailabilityManagerExtImpl] (HA-Worker-1:[ctx-9f19d45c, work-17]) (logid:9228a43f) Encountered unhandled exception during HA process, reschedule work HAWork[17-HA-1483-Stopped-Scheduled] com.cloud.utils.exception.CloudRuntimeException: Unable to orchestrate the start of VM instance {"instanceName":"i-2-1483-VM","uuid":"1f7afd4f-398b-4789-ae14-73d28903f036"}.
2026-06-29 10:01:56,250 WARN  [c.c.h.HighAvailabilityManagerExtImpl] (HA-Worker-1:[ctx-9f19d45c, work-17]) (logid:9228a43f) Rescheduling work HAWork[17-HA-1483-Stopped-Scheduled] to try again at 2026-06-29T10:02:26.432+0200. Finished attempt 1/5 times.
2026-06-29 10:02:54,848 INFO  [c.c.h.HighAvailabilityManagerExtImpl] (HA-Worker-0:[ctx-fce5735e, work-17]) (logid:f2990b6c) Processing work HAWork[17-HA-1483-Stopped-Scheduled]
2026-06-29 10:02:54,852 INFO  [c.c.h.HighAvailabilityManagerExtImpl] (HA-Worker-0:[ctx-fce5735e, work-17]) (logid:f2990b6c) HA on VM instance {"id":1483,"instanceName":"i-2-1483-VM","state":"Stopped","type":"User","uuid":"1f7afd4f-398b-4789-ae14-73d28903f036"}
2026-06-29 10:02:54,852 INFO  [c.c.h.HighAvailabilityManagerExtImpl] (HA-Worker-0:[ctx-fce5735e, work-17]) (logid:f2990b6c) VM VM instance {"id":1483,"instanceName":"i-2-1483-VM","state":"Stopped","type":"User","uuid":"1f7afd4f-398b-4789-ae14-73d28903f036"} has been changed.  Current State = Stopped Previous State = Stopped last updated = 20 previous updated = 17
2026-06-29 10:02:54,852 INFO  [c.c.h.HighAvailabilityManagerExtImpl] (HA-Worker-0:[ctx-fce5735e, work-17]) (logid:f2990b6c) Completed work HAWork[17-HA-1483-Stopped-Scheduled]. Took 2/5 attempts.
```
The restart process didn't even try a second time because of the mismatch in the update counter between the vm and the work (20 vs 17). The final state of the vm (without manual intervention) is Stopped.


### versions

Cloudstack version: 4.22.1.0, with management server running on ubuntu22 and agent running on oracle linux 9
Shared nfsv4 (4.2) storage 

### The steps to reproduce the bug

1. Disable HOST HA
2. Enable VM HA on some vm
3. Shutdown forcefully the host with aforementioned vm and check for nfs locks


### What to do about it?

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

VM HA not working when there's an issue with vm restart #13510

problem

versions

The steps to reproduce the bug

What to do about it?

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

VM HA not working when there's an issue with vm restart #13510

Description

problem

versions

The steps to reproduce the bug

What to do about it?

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions