Where's the problem?
Obviously, systems fail. For one reason or another, systems aren't as available for use as their users and designers would like them to be. Of all the possible causes of system failure — power outages, component breakdowns, operator errors, software faults, etc. — the lion's share belongs to software faults.
Many HA systems try to address the problem of system failure by turning to hardware solutions such as:
- rugged hardware
- redundant systems/components
- hot-swap CompactPCI components
- clustering
But if so many system crashes are caused by
software faults, then throwing more hardware at the
problem may not solve it at all. What if the system's memory
state isn't properly restored after recovery? What if yours
is an HA system (e.g., a consumer appliance) where redundant
hardware simply isn't an option? Or what if your particular
HA system is based on a custom chassis for which a PCI-based
HA solution
would be pointless?