The Dangers of Over-Engineering a Safe System

April 2013
15 pages

The Dangers of Over-Engineering a Safe System

Attempts to increase dependability of a specific component without considering the question of overall system dependability may lead to the introduction of new problems. We examine the effect on dependability of adding software error detection to a 2oo2 system, consider the benefits and adverse consequences of this addition, and suggest other approaches to improving dependability.

Contents

Two tragic corrections — About safety claims — A simple safe system — A new vulnerability — Software error detection — About the 2oo2 system — Assumptions about the memory devices — Assumptions about handling memory errors — Calculation with no software error detection — Calculation with software error detection — Summary of findings concerning software error correction — Rethinking the problem [+]

Two tragic corrections

Two tragedies, one maritime, the other aviation, can illustrate how well-meaning but ill-thought solutions can precipitate the very tragedies they are meant to avoid. The first tragedy occurred in the Chicago River in 1915. The SS Eastland listed and rolled over, killing ...

About safety claims

When we design a safe software system, one of our first tasks must be to determine its safety requirements. This means that we must determine the system’s required level of dependability; or, inversely, the acceptable level ...

A simple safe system

The system we will use for our discussion is a very simple, hypothetical in-cab controller (for an equally hypothetical) ATO system running a driverless Light Rapid Transit (LRT) system. Figure 1 below illustrates this system. For simplicity, we have ...

A new vulnerability

The problem we face is that, though the effects of radiation on computer memory have long been known, when the original specifications for our system were written no one thought to include the threat of memory errors caused by cosmic rays. ...

Software error detection

Since the problem is memory errors, it seems obvious that the solution is to add memory error detection to our system. Of course, before we do this we should be certain that this solution will a) be effective and b) not compromise ...

About the 2oo2 system

The 2oo2 system that allows our ATO controller to move from its design safe state and perform its tasks running the LRT functions as follows: 1. Two independent processing subsystems receive the same stimuli (events) from the outside environment; 2. Each processing subsystems uses the events it receives from the ...

Assumptions about the memory devices

We assume that the memory devices (DIMMs) in our 2oo2 system have single-bit error correction and multiple-bit error detection (SECDED) ECC algorithms built in based on a Hamming code with a minimum distance of 4. We also assume that the ...

Assumptions about handling memory errors

Three types of memory failure are possible. In our calculations we make the following assumptions about how these three error types are handled. Detected and correctable memory errors are counted, but otherwise ignored ...

Calculation with no software error detection

To estimate the dangerous failure rate, we ran a simulation of 109 years (about 88 × 10¹² hours) 100 times, enough to obtain sufficient results for us to calculate a confidence interval. The results of our simulation are shown in Table 1, which ...

Calculation with software error detection

Given the relatively slow speed at which application-level software error detection operates (about 23 hours to test two Gigabytes of memory), it is likely that ECC hardware will find both correctable and detectable but ...

Summary of findings concerning software error correction

The 2oo2 model provides an excellent controller design for providing sys

Download

Author
Chris Hobbs
chobbs@qnx.com

Chris Hobbs

Chris Hobbs is a kernel developer at QNX, specializing in "sufficiently-available" software: software created with the minimum development effort to meet the availability and reliability needs of the customer; and in producing safe software (in conformance with IEC61508 SIL3). He is also a specialist in WBEM/CIM device, network and service management, and the author of A Practical Approach to WBEM/CIM Management (2004).

In addition to his software development work, Chris is a flying instructor, a singer with a particular interest in Schubert's Lieder, and the author of several books, including Learning to Fly in Canada (2000) and The Largest Number Smaller than Five (2007). His blog, Software Musings, focuses "primarily on software and analytical philosophy".

Chris Hobbs earned a B.Sc., Honours in Pure Mathematics and Mathematical Philosophy at the University of London's Queen Mary and Westfield College.