Glossary

5 nines (high availability): A system that's characterized as having a “5 nines” availability rating (99.999%). This means that the system has a downtime of just 5 minutes per year. A “6 nines” system will have a downtime of just 20 minutes every forty years. This is also known variously in the industry as “carrier-class” or “telco-class” availability.
analog (data acquisition): Indicates an input or output signal that corresponds to a range of voltages. In the real world, an analog signal is continuously variable, meaning that it can take on any value within the range. When the analog value is used with a data acquisition card, it will be digitized, meaning that only a finite number of discrete values are represented. Analog inputs are digitized by an “analog to digital” (A/D) convertor, and analog outputs are synthesized by a “digital to analog” (D/A) convertor. Generally, most convertors have an accuracy of 8, 12, 16, or more bits. Compare digital.
asynchronous: Used to indicate that a given operation is not synchronized to another operation. For example, a pulse is a form of asynchronous message-passing in that the sender is not blocked waiting for the message to be received. Contrast with synchronous. See also blocking and receive a message.
availability (high availability): Availability is a ratio that expresses the amount of time that a system is “up” and available for use. It is calculated by taking the MTBF and dividing it by the sum of the MTBF plus the MTTR, and is usually expressed as a percentage. To increase a system's availability, you need to raise the MTBF and/or lower the MTTR. The availability is usually stated as the number of leading 9s in the ratio (see 5 nines). An availability of 100% (also known as continuous availability) is extremely difficult, if not impossible, to attain because that would imply that the value of MTTR was zero (and the availability was just MTBF divided by MTBF, or 1) or that the MTBF was infinity.
blocking: A means for a thread to synchronize with other threads or events. In the blocking state (of which there are about a dozen), a thread doesn't consume any CPU — it's waiting on a list maintained within the kernel. When the event that the thread was waiting for occurs (or a timeout occurs), the thread is unblocked and is able to consume CPU again. See also unblock.
cascade failure (high availability): A cascade failure is one in which several modules fail as the result of a single failure. A good example of this is if a process is using a driver, and the driver fails. If the process that used the driver isn't fault tolerant, then it too may fail. If other processes that depend on this driver aren't fault tolerant, then they too will fail. This chain of failures is called a “cascade failure.” The North American power outage of August 14, 2003 is a good example. See also fault tolerance.
client (message-passing): QNX Neutrino's message-passing architecture is structured around a client/server relationship. In the case of the client, it's the one requesting services of a server. The client generally accesses these services using standard file-descriptor-based function calls (e.g., lseek()), which are synchronous, in that the client's call doesn't return until the request is completed by the server. A thread can be both a client and a server at the same time.
code (memory): A code segment is one that is executable. This means that instructions can be executed from it. Also known as a “text” segment. Contrast with data or stack segments.
cold standby (high availability): Cold standby mode refers to the way a failed software component is restarted. In cold-standby mode, the software component is generally restarted by loading it from media (disk, network), having the component go through initializations, and then having the component advertise itself as ready. Cold standby is the slowest of the three modes (cold, warm, and hot), and, while its timing is system specific, it usually takes on the order of tens of milliseconds to seconds. Cold standby is the simplest standby model to implement, but also the one that impacts MTTR the most negatively. See also hot standby, restartability, and warm standby.
continuous availability (high availability): A system with an availability of 100%. The system has no downtime, and as such, is difficult, if not impossible, to attain with moderately complex systems. The reason it's difficult to attain is that every piece of software, hardware, and infrastructure has some kind of failure rate. There is always some non-zero probability of a catastrophic failure for the system as a whole.
data (memory): A data segment is one that is not executable. It's typically used for storing data, and as such, can be marked read-only, write-only, read/write, or no access. Contrast with code or stack segments.
deadlock: A failure condition reached when two threads are mutually blocked on each other, with each thread waiting for the other to respond. This condition can be generated quite easily; simply have two threads send each other a message — at this point, both threads are waiting for the other thread to reply to the request. Since each thread is blocked, it will not have a chance to reply, hence deadlock. To avoid deadlock, clients and servers should be structured around a send hierarchy. (Of course, deadlock can occur with more than two threads; A sends to B, B sends to C, and C sends back to A, for example.) See also blocking, client, reply to a message, send a message, server, and thread.
digital (data acquisition): Indicates an input or output signal that has two states only, usually identified as on or off (other names are commonly used as well, “energized” and “de-energized” for example). Compare analog.
exponential backoff (high availability): A policy that's used to determine at what intervals a process should be restarted. Its use is to prevent overburdening the system in case a component keeps failing. See also restartability.
fault tolerance (high availability): A term used in conjunction with high availability that refers to a system's ability to handle a fault. When a fault occurs in a fault-tolerant system, the software is able to work around the fault, for example, by retrying the operation or switching to an alternate server. Generally, fault tolerance is incorporated into a system to avoid cascade failures. See also cascade failure.
guard page (stack): An inaccessible data area present at the end of the valid virtual address range for a stack. The purpose of the guard page is to cause a memory-access exception should the stack overflow past its defined range.
HA (or high availability): A designation applied to a system to indicate that it has a high level of availability. A system that's designed for high-availability needs to consider cascade failures, restartability, and fault tolerance. Generally speaking, a system designated as “high availability” will have an availability of 5 nines or better. See also cascade failure.
hot standby (high availability): Hot-standby mode refers to the way in which a failed software component is restarted. In hot-standby mode, the software component is actively running, and effectively shadows the state of the primary process. The primary process feeds it updates, so that the secondary (or “standby”) process is ready to take over the instant that the primary process fails. Hot standby is the fastest, but most expensive to implement of the three modes (cold, warm, and hot), and, while its timing is system specific, is usually thought of as being on the order of microseconds to milliseconds. Hot standby is very expensive to implement, because it must continually be shadowing the data updates from the primary process, and must be able to assume operation when the primary dies. Hot standby, however, is the preferred solution to minimizing MTTR and hence increasing availability. See also cold standby, restartability, and warm standby.
in-service upgrade or ISU (high availability): An upgrade performed on a live system, with the least amount of impact to the operation of the system. The basic algorithm is to simulate a fault and then, instead of having the overlord process restart the failed component, it instead starts a new version. In certain cases, the policy of the overlord may be to perform a version downgrade instead of an upgrade. See also fault tolerance and restartability.
message-passing: The QNX Neutrino operating system is based on a message-passing model, where all services are provided in a synchronous manner by passing messages around from client to server. The client will send a message to the server and block. The server will receive a message from the client, perform some amount of processing, and then reply to the client's message, which will unblock the client. See also blocking and reply to a message.
MTBF or Mean Time Between Failures (high availability): The MTBF is expressed in hours and indicates the mean time that elapses between failures. MTBF is applied to both software and hardware, and is used, in conjunction with the MTTR, in the calculation of availability. A computer backplane, for example, may have an MTBF that's measured in the tens of thousands of hours of operation (several years). Software usually has a lower MTBF than hardware.
MTTR or Mean Time To Repair (high availability): The MTTR is expressed in hours, and indicates the mean time required to repair a system. MTTR is applied to both software and hardware, and is used, in conjunction with the MTBF, in the calculation of availability. A server, for example, may have an MTTR that's measured in milliseconds, whereas a hardware component may have an MTTR that's measured in minutes or hours, depending on the component. Software usually has a much lower MTTR than hardware.
overlord (high availability): A process responsible for monitoring the stability of various system processes, according to the policy, and performing actions (such as restarting processes based on a restart policy). The overlord may also be involved with an in-service upgrade or downgrade. See also restartability.
policy (high availability): A set of rules used in a high-availability system to determine the limits that are enforced by the overlord process against other processes in the system. The policy also determines how such processes are restarted, and may include algorithms such as exponential backoff. See also restartability.
primary (high availability): The “primary” designation refers to the active process when used in discussions of cold, warm, and hot standby. The primary system is running, and the secondary system(s) is/are the “backup” system(s). See also cold standby, warm standby, and hot standby.
process (noun): A non-schedulable entity that occupies memory, effectively acting as a container for one or more threads. See also thread.
pulse (message-passing): A nonblocking message received in a manner similar to a regular message. It is non-blocking for the sender, and can be waited on by the receiver using the standard message-passing functions MsgReceive() and MsgReceivev() or the special pulse-only receive function MsgReceivePulse(). While most messages are typically sent from client to server, pulses are generally sent in the opposite direction, so as not to break the send hierarchy (which could cause deadlock). See also receive a message.
QNX Software Systems: The company responsible for the QNX 2, QNX 4, and QNX Neutrino operating systems.
QSS: An abbreviation for QNX Software Systems.
receive a message (message-passing): A thread can receive a message by calling MsgReceive() or MsgReceivev(). If there is no message available, the thread will block, waiting for one. A thread that receives a message is said to be a server. See also blocking.
reply to a message (message-passing): A server will reply to a client's message to deliver the results of the client's request back to the client, and unblock the client. See also client.
resource manager: A server process that provides certain well-defined file-descriptor-based services to arbitrary clients. A resource manager supports a limited set of messages that correspond to standard client C library functions such as open(), read(), write(), lseek(), devctl(), etc. See also client.
restartability (high availability): The characteristic of a system or process that lets it be gracefully restarted from a faulted state. Restartability is key in lowering MTTR, and hence in increasing availability. The overlord process is responsible for determining that another process has exceeded some kind of limit, and then, based on the policy, the overlord process may be responsible for restarting the component.
secondary (or standby) (high availability): Refers to the inactive process when used in discussions of cold, warm, and hot standby. The primary system is the one that's currently running; the secondary system is the “backup” system. There may be more than one secondary process. See also cold standby, warm standby, and hot standby.
segment (memory): A contiguous “chunk” of memory with the same accessibility permissions throughout. Note that this is different from the (now archaic) x86 term, which indicated something accessible via a segment register. In this definition, a segment can be of an arbitrary size. Segments typically represent code (or “text”), data, stack, or other uses.
send a message (message-passing): A thread can send a message to another thread. The MsgSend*() series of functions are used to send the message; the sending thread blocks until the receiving thread replies to the message. A thread that sends a message is said to be a client. See also blocking, message-passing, and reply to a message.
send hierarchy: A design paradigm where messages are sent in one direction, and replies flow in the opposite direction. The primary purpose of having a send hierarchy is to avoid deadlock. A send hierarchy is accomplished by assigning clients and servers a “level,” and ensuring that messages that are being sent go only to a higher level. This avoids the potential for deadlock where two threads would send to each other, because it would violate the send hierarchy — one thread should not have sent to the other thread, because that other thread must have been at a lower level. See also client, reply to a message, send a message, server, and thread.
server (message-passing): A regular, user-level process that provides certain types of functionality (usually file-descriptor-based) to clients. Servers are typically resource managers. QNX Neutrino provides an extensive library that performs much of the functionality of a resource manager for you. The server's job is to receive messages from clients, process them, and then reply to the messages, which unblocks the clients. A thread within a process can be both a client and a server at the same time. See also client, receive a message, reply to a message, resource manager, and unblock.
stack (memory): A stack segment is one used for the stack of a thread. It generally is placed at a special virtual address location, can be grown on demand, and has a guard page. Contrast with data or code segments.
synchronous: Used to indicate that a given operation has some synchronization to another operation. For example, during a message-passing operation, when the server does a MsgReply() (to reply to the client), unblocking the client is said to be synchronous to the reply operation. Contrast with asynchronous. See also message-passing and unblock.
timeout: Many kernel calls support the concept of a timeout, which limits the time spent in a blocked state. The blocked state will be exited if whatever condition was being waited upon has been satisfied, or the timeout time has elapsed. See also blocking.
thread: A single, schedulable, flow of execution. Threads are implemented directly within the QNX Neutrino kernel and are manipulated by the POSIX pthread*() function calls. A thread will need to synchronize with other threads (if any) by using various synchronization primitives such as mutexes, condition variables, semaphores, etc. Threads are scheduled in either FIFO or Round Robin scheduling mode. A thread is always associated with a process.
unblock: A thread that had been blocked will be unblocked when the condition it has been blocked on is met, or a timeout occurs. For example, a thread might be blocked waiting to receive a message. When the message is sent, the thread will be unblocked. See also blocking and send a message.
warm standby (high availability): Warm-standby mode refers to the way a failed software component is restarted. In warm-standby mode, the software component is lying in a “dormant” state, perhaps having performed some rudimentary initialization. The component is waiting for the failure of its primary component; when that happens, the component completes its initializations, and then advertises itself as being ready to serve requests. Warm standby is the middle-of-the-road version of the three modes (cold, warm, and hot). While its timing is system-specific, this is usually thought of as being on the order of milliseconds. Warm standby is relatively easy to implement, because it performs its usual initializations (as if it were running in primary mode), then halts and waits for the failure of the primary before continuing operation. See also cold standby, hot standby, restartability, and server.