Where's the problem?

The ideal High Availability (HA) system is one that remains up and running continuously, uninterrupted for an indefinite period of time. In practical terms, HA systems strive for "five nines" availability, a metric referring to the percentage of uptime a system can sustain in a year — 99.999% uptime amounts to about five minutes downtime per year.

Obviously, systems fail. For one reason or another, systems aren't as available for use as their users and designers would like them to be. Of all the possible causes of system failure — power outages, component breakdowns, operator errors, software faults, etc. — the lion's share belongs to software faults.

Many HA systems try to address the problem of system failure by turning to hardware solutions such as:

But if so many system crashes are caused by software faults, then throwing more hardware at the problem may not solve it at all. What if the system's memory state isn't properly restored after recovery? What if yours is an HA system (e.g., a consumer appliance) where redundant hardware simply isn't an option? Or what if your particular HA system is based on a custom chassis for which a PCI-based HA "solution" would be pointless?