In-service upgrades

The most interesting thing that happens when you combine the concepts of fault tolerance and the various standby models is that you get in-service upgrades almost for free.

An in-service upgrade means that you need to be able to modify the version of software running in your system without affecting the system's ability to do whatever it's doing.

As an interesting implementation, and for a bit of contrast, some six nines systems, like central office telephone switches, accomplish this in a unique manner. In the switch, there are two processors running the main software. This is for reliability—the two processors are operating in lock-step synchronization, meaning that they execute the exact same instruction, from the exact same address, at the exact same time. If there is ever any discrepancy between the two CPUs, service is briefly interrupted as both CPUs go into an independent diagnostic mode, and the failing CPU is taken offline (alarm bells ring, logs are generated, the building is evacuated, etc.).

This dual-CPU mechanism is also used for upgrading the software. One CPU is manually placed offline, and the switch runs with only the other CPU (granted, this is a small “asking for trouble” kind of window, but these things are generally done at 3:00 AM on a Sunday morning). The offline CPU is given a new software load, and then the two CPUs switch roles—the currently running CPU goes offline, and the offline CPU with the new software becomes the controlling CPU. If the upgrade passes sanity testing, the offline processor is placed online, and full dual-redundant mode is reestablished. Even scarier things can happen, such as live software patches!

We can do something very similar with software, using the HA concepts that we've discussed so far (and good design—see the Design Philosophy chapter).

What's the real difference between killing a driver and restarting it with the same version, versus killing a driver and restarting a newer version (or, in certain special cases, an older version)? If you've made the versions of the driver compatible, there is no difference. That's what I meant when I said that you get in-service upgrades for free! To upgrade a driver, you kill the current version. The higher level software notices the outage, and expects something like the overlord process to come in and fix things. However, the overlord process not only fixes things, but upgrades the version that it loads. The higher-level software doesn't really notice; it retries its connection to the driver, and eventually discovers that a driver exists and continues running.

Figure 1. Preparation for in-service upgrade; the secondary server has a higher version number.

Of course, I've deliberately oversimplified things to give you the big picture. A botched in-service upgrade is an excellent way to get yourself fired. Here are just some of the kinds of things that can go wrong:

These are things that require testing, testing, and more testing.