Overlords, or Big Brother is watching you

An important component in an HA system is an overlord or Big Brother process (as in Orwell, not the TV show). This process is responsible for ensuring that all of the other processes in the system are running. When a process faults, we need to be able to restart it or make a standby process active.

That's the job of the overlord process. It monitors the processes for basic sanity (the definition of which is fairly broad — we'll come back to this), and performs an orderly shutdown, restart, fail-over, or whatever else is required for the failed (or failing) component.

One remaining question is “who watches the watcher?” What happens when the overlord process faults? How do we recover from that? There are a number of steps that you should take with the overlord process regardless of anything I'll tell you later on:

However, since the overlord is a piece of software that's more complex than “Hello, world” it will have bugs and it will fail.

It would be a chicken-and-egg problem to simply say that we need an overlord to watch the overlord—this would result in a never-ending chain of overlords.

What we really need is a standby overlord that is waiting for the primary overlord to die or become unresponsive, etc. When the primary fails, the standby takes over (possibly killing the faulty primary), becomes primary, and starts up its own standby version. We'll discuss this mechanism next.