The ideal High Availability (HA) system is one that remains up and running continuously, uninterrupted for an indefinite period of time. In practical terms, HA systems strive for “five nines” availability, a metric referring to the percentage of uptime a system can sustain in a year — 99.999% uptime amounts to about five minutes downtime per year.
Obviously, systems fail. For one reason or another, systems aren't as available for use as their users and designers would like them to be. Of all the possible causes of system failure — power outages, component breakdowns, operator errors, software faults, etc. — the lion's share belongs to software faults.
Many HA systems try to address the problem of system failure by turning to hardware solutions such as:
But if so many system crashes are caused by software faults, then throwing more hardware at the problem may not solve it at all. What if the system's memory state isn't properly restored after recovery? What if yours is an HA system (e.g. a consumer appliance) where redundant hardware simply isn't an option? Or what if your particular HA system is based on a custom chassis for which a PCI-based HA “solution” would be pointless?
Most system designers wouldn't think of using a “standard” desktop PC as the foundation for an effective HA system. Apart from the reliability issues arising from the hardware itself, the underlying software isn't meant for continuous operation. When desktop operating systems and applications need to be patched or upgraded, most users expect to reboot their machines. Unfortunately, they might also have become accustomed to rebooting as part of their daily operations!
But in an HA system, various software components may need to be upgraded on a live system. Individual modules should be readily accessible for analysis and repair, without jeopardizing the availability of the system itself.
In our view, effective HA systems must address the main problem — software faults — through a modular approach to system design and implementation. Based on a microkernel architecture, the QNX Neutrino RTOS not only helps isolate problem areas throughout the system, but also ensures complete independence of system components. Each component enjoys full MMU-based memory protection. And system-level modules such as device drivers benefit from the same isolation and protection as any other process. You can start and stop a driver, networking protocol, filesystem, etc., without touching the kernel. A microkernel RTOS inherently keeps the single point of failure (SPOF) number as low as possible.
QNX High Availability Framework provides a reliable software infrastructure on which to build highly effective HA systems. In addition to support for hardware-oriented HA solutions (e.g. CompactPCI as well as custom hardware), you also have the tools to isolate and even repair software faults before they occur throughout your entire system.
For example, suppose a device driver crashes because it tried to write to memory that was allocated to another process. The MMU will alert the QNX Neutrino microkernel, which in turn will alert the High Availability Manager (HAM). A HAM can then restart the driver. In addition, a dump file can be generated for postmortem analysis.
Viewing this dump file, you can immediately determine which line of code is the culprit and then prepare a fix that you can download to all other units in the field before they run into the same bug. With a conventional OS, a rogue driver may run for days before the system becomes corrupted enough to fail — and then it's too late to identify the problem, let alone dynamically install an upgraded driver!
A HAM can perform a multistage recovery, executing several actions in a certain order. This technique is useful whenever strict dependencies exist between various actions in a sequence, so that the system can restore itself to the state it was in before a failure.
Equipped with the QNX Neutrino RTOS itself, as well as the special tools and API in the High Availability Framework, you should be able to anticipate the kinds of problems that are likely to happen, isolate them, and then plan accordingly. In other words, assuming that failure will occur, you can now design for it and build systems that can recover intelligently.