Graceful fail-over

To avoid a cascade failure, the clients of a process must be coded so they can tolerate a momentary outage of a lower-level process.

It would almost completely defeat the purpose of having hot standby processes if the processes that used their services couldn't gracefully handle the failure of a lower-level process. We discussed the impacts of cascade failures, but not their solution.

In general, the higher-level processes need to be aware that the lower-level process they rely on may fault. The higher-level processes need to maintain the state of their interactions with the lower-level process—they need to know what they were doing in order to be able to recover.

Let's look at a simple example first. Suppose that a process were using the serial port. It issues commands to the serial port when it starts up:

Suppose that the serial port is supervised by the overlord process, and that it follows the cold standby model.

When the serial port driver fails, the overlord restarts it. Unfortunately, the overlord has no idea of what settings the individual ports should have; the serial port driver will set them to whatever defaults it has, which may not match what the higher-level process expects.

The higher-level process may notice that the serial port has disappeared when it gets an error from a write(), for example. When that happens, the higher-level process needs to determine what happened and recover. This would case a cascade failure in non-HA software—the higher-level process would get the error from the write(), and would call exit() because it didn't handle the error in an HA-compatible manner.

Let's assume that our higher-level process is smarter than that. It notices the error, and because this is an HA system, assumes that someone else (the overlord) will notice the error as well and restart the serial port driver. The main trick is that the higher-level process needs to restore its operating context—in our example, it needs to reset the serial port to 38400 baud, eight data bits, one stop bit, and no parity, and it needs to reset the port to operate in raw mode.

Only after it has completed those tasks can the higher-level process continue where it left off in its operation. Even then, it may need to perform some higher-level reinitialization—not only does the serial port need to be set for a certain speed, but the peripheral that the high-level process was communicating with may need to be reset as well (for example, a modem may need to be hung up and the phone number redialed).

This is the concept of fault tolerance: being able to handle a fault and to recover gracefully.

If the serial port were implemented using the hot standby model, some of the initialization work may not be required. Since the state carried by the serial port is minimal (i.e., the only state that's generally important is the baud rate and configuration), and the serial port driver is generally very small, a cold standby solution may be sufficient for most applications.