Here is a little peek at something we are working on.
Instance High Availability for OpenStack.
Some may be familiar with this from their days as a VMware admin where it was generally known as VMHA.
In summary, OpenStack doesn’t have a native ability to restart instances (virtual machines) on other hosts when the host they are running on becomes unavailable (I.e. goes down). So, Awnix began a project to create something that will perform this function, and a little more.
Ok, but what does it actually do?
Today all of the main functionality exists and works as intended. The software will:
Identify a down compute node, warn an admin, evacuate all instances, and restart them on running healthy nodes.
Identify a “sick” compute node by running a series of “health checks” (which can easily be customized) and then warn an administrator of a “sick” condition, or warn the admin and drain all instances from that node and disable it.
Identify a node that is acting rather odd, but otherwise appears to be working, disable the node to prevent it from hosting any new workloads, notify the administrator, but otherwise leave the running instances and the host alone so that an admin can take a look at it.
Automatically add repaired compute nodes back into the monitoring system.
Integrates with Slack.com for notifications.
Scales to….unknown at this time. A lot. Tested on over 30 compute nodes in a cluster thus far without issue.
Minimal footprint/resource usage.
Customizable health checks. (ok, they’re just scripts).
Runs on any OpenStack distro with standard APIs.
and probably some other cool things I’m overlooking right now.
Pics or it didn’t happen!
Here are a few screenshots from our Slack channel this evening, showing a few scenarios in action.
1. Safety First!
The software isn’t working right, isn’t running, and isn’t checking in.
Is the host down? Or, did someone kill the HA service?
Before taking actions to migrate or evacuate VMs we should do some safety checks to see if the host is actually alive to preclude doing drama to Instances that may actually be OK.
So the software runs some checks to see if the host appears to be alive. It is.
Something else must be wrong that isn’t affecting the instances, but this is automation so we don’t want to automatically take actions that may be unnecessary, and we also don’t want new instances started on a host that is no longer monitored.
The software notifies the admin that the host appears to be alive, but something is wrong, so the the host is disabled in the scheduler and will no longer be able to accept new instances.
2. A host is “sick”
A “sick” host can be caused by any number of reasons. E.g. bad disk, failed fan, overtemp cpu, etc…
Any of which can be defined by the person administering the OpenStack cluster by writing custom health check scripts. I.e. the actions taken are tunable.
As you can see the software uses a custom script to perform this check (the software uses all check scripts placed in the “health” directory)
Since this run failed a critical “health check” the host is considered to be in such a sorry state that all the instances must be migrated elsewhere.
There are a few additional built-in safety checks that the software runs, as above, to double-check the actual criticality of the problem before initiating the migration.
The software then identifies all instances and tells them to move to healthy nodes, monitors the moves, and reports the success, or failure, of each move once complete.
The time the software will wait before considering the migrations a failure is variable and is determined based on the number of instances to be migrated.
3. Another possible course of action is an evacuate on a dead host.
I.e. automated checks identify that the host is gone and can’t be found.
E.g. Kernel panic, network cables disconnected, power failure, etc…
Subsequent confirmation tests to see if the host is alive at all have failed as well.
So, instead of attempting to migrate the instances from the dead host, the software will now “evacuate” all instances, restarting them on healthy nodes.
4. When writing health checks you probably want to test them but not do dramatic things to your instances and hosts right?
O.K. run the software in a DRY RUN mode to test and tweak your checks and their results.
That’s cool! When can I get it?
There is no release date set at this time, but we will announce the release when the time draws near. We are going to do some further testing/tweaking, then testing with some beta volunteers first.
If you are interested in being a beta volunteer please let us know. We are only accepting a few volunteers to start and have a couple thus far, so if you are interested please request soon. We will, of course, expect tests to be run in a timely manner and that beta testers provide detailed and valuable feedback.