The QNX Approach to HA

In this chapter...

The reset “solution”

Traditional approaches to dealing with software malfunctions have included such mechanisms as:

Hardware/software watchdog
This is a piece of hardware that's known to be fault-free. It triggers code to check the sanity of the system. This sanity check usually involves examining a set of registers that are continuously updated by properly functioning software components. But when one of the components isn't working properly, the system is reset.
Manual operator intervention
Many systems aren't designed to include an automatic fault detection, but rely instead on a manual approach — an operator who monitors the health of the system. If the system state is deemed invalid, then the operator takes the appropriate action, which usually includes a system reset.
Memory constraint faulting
Several operating systems (and hardware platforms) include features that let you generate a fault when a program accesses memory that isn't yours. Once this occurs, the program becomes unreliable. With most realtime executives, the result is that the system must be reset in order to return to a sane operating state.

All of these approaches are relatively successful at detecting a software fault. But the net result of this detection, especially when faced with a multitude of faults in several potentially separate software components, is the rather drastic action of a system reset.

Traditional RTOS architecture

One of the principal reasons for this lack of graceful recovery is the monolithic architecture of a traditional realtime embedded system. At the heart of most of these systems lies a realtime executive — a single memory image consisting of the RTOS itself and often numerous tasks.

Since all tasks — including critical system-level services — share the very same address space, when the integrity of one task is called into question, the integrity of the entire system is at risk. If a single component such as a device driver fails, the RTOS itself could fail. In HA terms, each software component becomes a single point of failure (SPOF).

The only sure recovery mechanism in such an environment is to reset the system and start from scratch.

Such realtime systems present a very low granularity of fault recovery, making the HA procedure of planning for and dealing with failure seemingly straightforward (a system reset), yet often very costly (in terms of downtime, system restoration, etc.). For some embedded applications, a reset may involve a specialized, time-consuming procedure in order to restore the system to full operation in the field.

Modularity means granularity

What is really needed here is a more modular approach. System architects often de-couple and modularize their systems from a design/implementation point of view. Ideally, these modules would be the focus not only of the design, but also of the fault-recovery process, so that if one module malfunctions, then only that module would require a reset — the integrity of the rest of the system would remain intact. In other words, that particular module wouldn't be a SPOF.

This modular approach would also help us address the fact that the mean time to repair (MTTR) for a system reboot is a magnitude larger than the MTTR for replacing a single running task.

This type of increased granularity on the recovery of individual tasks is precisely what the QNX Neutrino microkernel offers. The architecture of the QNX Neutrino realtime operating system itself provides so many intrinsic HA features that many QNX users take them for granted and often design recoverability into their systems without giving it a second thought.

Let's look briefly at the key features of the QNX Neutrino RTOS and see how system designers can easily make use of these builtin HA-ready features to build effective HA systems.

Intrinsic HA

Three key factors of the QNX Neutrino architecture contribute directly to intrinsic HA:

QNX Neutrino microkernel
Only a few essential services are provided (e.g. message passing and realtime scheduling). The result is a robust, dependable system — fewer lines of code in the kernel reduce the probability of OS errors.

Also, the kernel's fixed-priority preemptive scheduler ensures a predictable system — there are fewer HA software paths to analyze and deal with separately.

POSIX process model
This means full MMU-supported memory protection between system processes, making it easy to isolate and protect individual tasks.

The process model also offers dynamic process creation and destruction, which is especially important for HA systems, because you can more readily perform fault detection, recovery, and live upgrades in the field.

The POSIX API provides a standard programming environment and can help achieve system simplification, validation, and verification.

In addition, the process model lets you easily monitor external tasks, which not only aids in fault detection and diagnosis, but also in service distribution.

Message passing
In the QNX Neutrino realtime operating system, all interprocess communication happens through standard message passing. For HA systems, this facilitates task decoupling, task simplification, and service distribution.

Local and network-remote messaging is identical and practically transparent for the application. In a network-distributed HA system, the QNX message-based approach fosters replication, redundancy, and system simplification.

These represent some of the more prominent HA-oriented features that become readily apparent when the QNX Neutrino RTOS forms the basis of an HA design.