Two tragic corrections —
About safety claims —
A simple safe system —
A new vulnerability —
Software error detection —
About the 2oo2 system —
Assumptions about the memory devices —
Assumptions about handling memory errors —
Calculation with no software error detection —
Calculation with software error detection —
Summary of findings concerning software error correction —
Rethinking the problem
[+]
Two tragic corrections
Two tragedies, one maritime, the other aviation, can illustrate how well-meaning but ill-thought solutions can precipitate the very tragedies they are meant to avoid. The first tragedy occurred in the Chicago River in 1915. The SS Eastland listed and rolled over, killing ...
About safety claims
When we design a safe software system, one of our first tasks must be to determine its safety requirements. This means that we must determine the system’s required level of dependability; or, inversely, the acceptable level ...
A simple safe system
The system we will use for our discussion is a very simple, hypothetical in-cab controller (for an equally hypothetical) ATO system running a driverless Light Rapid Transit (LRT) system. Figure 1 below illustrates this system. For simplicity, we have ...
A new vulnerability
The problem we face is that, though the effects of radiation on computer memory have long been known, when the original specifications for our system were written no one thought to include the threat of memory errors caused by cosmic rays. ...
Software error detection
Since the problem is memory errors, it seems obvious that the solution is to add memory error detection to our system. Of course, before we do this we should be certain that this solution will a) be effective and b) not compromise ...
About the 2oo2 system
The 2oo2 system that allows our ATO controller to move from its design safe state and perform its tasks running the LRT functions as follows:
1. Two independent processing subsystems receive the same stimuli (events) from the outside environment; 2. Each processing subsystems uses the events it receives from the ...
Assumptions about the memory devices
We assume that the memory devices (DIMMs) in our 2oo2 system have single-bit error correction and multiple-bit error detection (SECDED) ECC algorithms built in based on a Hamming code with a minimum distance of 4. We also assume that the ...
Assumptions about handling memory errors
Three types of memory failure are possible. In our calculations we make the following assumptions about how these three error types are handled. Detected and correctable memory errors are counted, but otherwise ignored ...
Calculation with no software error detection
To estimate the dangerous failure rate, we ran a simulation of 109 years (about 88 × 1012 hours) 100 times, enough to obtain sufficient results for us to calculate a confidence interval. The results of our simulation are shown in Table 1, which ...
Calculation with software error detection
Given the relatively slow speed at which application-level software error detection operates (about 23 hours to test two Gigabytes of memory), it is likely that ECC hardware will find both correctable and detectable but ...
Summary of findings concerning software error correction
The 2oo2 model provides an excellent controller design for providing sys