[XCP-ng] Enable and test High Availabilty for virtual machines

Enabling and testing High Availability (HA) for virtual machines helps to ensure continuous availability of critical applications in the event of a physical host failure. With HA, virtual machines are automatically restarted on alternative physical hosts in the same cluster, minimizing downtime and ensuring that business-critical services are not interrupted.

After completing the 2 preparation steps: creating the Pool and setting up NFS Shared Storage, it’s time to enable High Availability for the virtual machine and test it out.

To follow the instructions in this article, you need to complete all the steps shared in the previous 2 parts:

  • [XCP-ng] Resource Pool Management
  • [XCP-ng] Set up NFS Shared Storage and enable High Availability

1. Enable High Availability for the virtual machine

To be able to enable High Availability, the virtual machine must satisfy the following two conditions:

  • The virtual machine located in the Pool has been enabled High Availability.
  • The virtual machine’s hard drive must be stored on Shared Storage.

Open Xen Orchestra, access the virtual machine that needs to activate HA, switch to the Advanced tab, scroll down to HA and select restart.

In this article, I activate the Ubuntu 22.04 virtual machine, currently running on the xcp-ng-18 server, belonging to My Pool.

2. Test 1: Actively shutdown the virtual machine

After enabling the High Availabilty feature, the Ubuntu 22.04 virtual machine will be monitored by Xen Orchestra to ensure that the virtual machine is always in an operating state with the shortest downtime, every time a problem occurs. If Xen Orchestra detects that the virtual machine is idle, it will attempt to restart it on another server located in the same Pool.

For testing, I directly access the Ubuntu 22.04 virtual machine via SSH or Console and ask to shutdown with the command sudo shutdown now.

Since Xen-Orchestra can’t know the command shutdown is executed in the virtual machine, it will think that the virtual machine has a problem. Immediately, it will zone the virtual machine to another host and restart right away. In less than 1 minute, the Ubuntu 22.04 virtual machine was transferred to the xcp-ng-16 server and working properly again.

In the event that the virtual machine needs to be shut down for maintenance, the only way to prevent the above situation from happening is to perform the shutdown request by clicking the Force shutdown button. At this point, Xen Orchestra knows that the virtual machine is required to shut down, so it will not attempt to restore it again.

3. Experiment 2: actively shutdown xcp-ng-18 . server

In this experiment, I SSH into the xcp-ng-18 server, and shutdown with the command sudo shutdown now.

Xen-Orchestra immediately received the signal from xcp-ng-18 and immediately performed the necessary work to move the Ubuntu 22.04 virtual machine elsewhere.

The Ubuntu 22.04 virtual machine is switched to a yellow (busy) state.

Work is being done by Xen Orchestra immediately after receiving news that xcp-ng-18 is about to shut down.

In less than 1 minute, the Ubuntu 22.04 virtual machine was transferred to the xcp-ng-16 server.

In about 1 minute of processing, the virtual machine still works normally, without losing connection. I tried pinging to the virtual machine before starting the operation, and when it was done, the connection was always stable, no dropped packets.

4. Experiment 3: Unplug the power cord

In this experiment, I will shut down the xcp-ng-18 server by unplugging the power cord to simulate a sudden crash.

Immediately after unplugging the power cord, Xen Orchestra could not detect the problem immediately as in the previous experiment. The Ubuntu 22.04 virtual machine is still working, but can no longer access the Console.

After 1 minute and 40 seconds, Xen Orchestra realized that the connection to xcp-ng-18 was lost and immediately performed a roam for the virtual machine. Adding about 1′ time for the virtual machine to reactivate, the downtime is about 3 minutes.

5. Experiment 4: Unplug the network

Instead of unplugging the power cord, this time I switched to unplugging the network.

Similar to the previous situation, Xen Orchestra takes time to react to an incident. This time it took 1 minute 30 seconds, the new virtual machine started to perform roaming. Total downtime is also about 3 minutes.

The simulation scenarios I made above are mainly intended to learn about how XCP-ng’s High Availabilty feature works. In fact, running a stable High Availabilty system requires much more complicated setup and configuration.

In addition to configuration for servers and virtual machines, High Availability must also be set up for NFS Shared Storage, Switch, power supply, etc. so that the entire system is always ready under any incident.

Reference: Xen-Orchestra


If my article has provided valuable insights and information to you, consider showing your appreciation with a virtual pat on the back or a kind message. Your encouragement will drive me to continue creating and sharing informative content. Thank you for taking the time to read!

Leave a Reply

Your email address will not be published. Required fields are marked *