Cascade failures

In a typical system, the software fits into several natural layers. The GUI is at the topmost level in the hierarchy, and might interact with a database or a control program layer. These layers then interact with other layers, until finally, the lowest layer controls the hardware.

What happens when a process in the lowest layer fails? When this happens, the next layer often fails as well—it sees that its driver is no longer available and faults. The layer above that notices a similar condition—the resource that it depends on has gone away, so it faults. This can propagate right up to the highest layer, which may report some kind of diagnostic, such as "database not present." One of the problems is that this diagnostic masks the true cause of the problem—it wasn't really a problem with the database, but rather it was a problem with the lowest-level driver.

We call this a cascade failure—lower levels causing higher levels to fail, with the failure propagating higher and higher until the highest level fails.

In this case, maximizing the MTBF would mean making not only the lower-level drivers more stable, but also preventing the cascade failure in the first place. This also decreases the MTTR because there are fewer things to repair. When we talk about in-service upgrades, below, we'll see that preventing cascade failures also has some unexpected benefits.

To prevent a cascade failure, you can:

What might not be immediately obvious is that these two points are interrelated. It does little good to have a higher-level layer prepared to deal with an outage of a lower-level layer, if the lower-level layer takes a long time to recover. It also doesn't help much if the low-level driver fails and its standby takes over, but the higher-level layer isn't prepared to gracefully handle that momentary outage.