Parallel calculations

What if your systems are in parallel? How does that look?

In a parallel system, the picture is as follows:

Figure 1. Aggregate module formed by two modules in parallel.

If module A has an availability of Xa, and module B has an availability of Xb, the combined availability of a subsystem constructed of modules A and B connected in parallel is:

availability = 1 - (1 - Xa) × (1 - Xb)

Practically speaking, if both modules have a five nines availability, the system constructed from connecting the two modules in parallel will be:

availability = 1 - (1 - 0.99999) * (1 - 0.99999)
             = 1 - 0.00001 * 0.00001
             = 1 - 0.0000000001
             = 0.9999999999

That number is ten nines!

The thing to remember here is that you're not extensively penalized for serial dependencies, but the rewards for parallel dependencies are very worthwhile! Therefore, you'll want to construct your systems to have as much parallel flow as possible and minimize the amount of serial flow.

In terms of software, just what is a parallel flow? A parallel flow is one in which either module A or module B (in our example) can handle the work. This is accomplished by having a redundant server, and the ability to seamlessly use either server—whichever one happens to be available. The reason a parallel flow is more reliable is that a single fault is more likely to occur than a double fault.

A double fault isn't impossible, just much less likely. Since the two (or more) modules are operating in parallel, meaning that they are independent of each other, and either will satisfy the request, it would take a double fault to impact both modules.

A hardware example of this is powering your machine from two independent power grids. The chance that both power grids will fail simultaneously is far less than the chance of either power grid failing. Since we're assuming that the hardware can take power from either grid, and that the power grids are truly independent of each other, you can use the availability numbers of both power grids and plug them into the formula above to calculate the likelihood that your system will be without power. (And then there was the North American blackout of August 14, 2003 to reassure everyone of the power grid's stability! :-))

For another example, take a cluster of web servers connected to the same filesystem (running on a RAID box) which can handle requests in parallel. If one of the servers fails, the users will still be able to access the data, even if performance suffers a little. They might not even notice a performance hit.

You can, of course, extend the formula for sub-systems that have more than two components in series or parallel. This is left as an exercise for the reader.