Fault-Tolerant Systems

As mentioned previously in this chapter, simulating a system to obtain its reliability or similar attributes requires the knowledge of parameters such as the components' failure rates. These can be obtained either through lengthy observations, or much faster through fault injection experiments. In such experiments, various faults are injected either into a simulation model of the target system or a hardware and software prototype of the system. The behavior of the system in the presence of each fault is then observed and classified. Parameters that can be estimated based on such experiments include the probability that a fault will cause an error, and the probability that the system will perform successfully the actions required to recover from that error (the latter probability is often called coverage factor, see Chapter 2). These actions consist of detecting the fault, identifying the system component affected by the fault, and taking an appropriate recovery action which may involve system reconfiguration. Each of these actions takes time that is not a constant but may change from one fault to another and may also depend on the current workload. Thus, fault injection experiments, in addition to providing estimates for the coverage factor, can also be used to estimate the distribution of the individual delay associated with each of the above actions.
In addition, fault injection experiments can be used to evaluate and validate the system dependability. For example, errors in the implementation of fault tolerance mechanisms can be discovered, and system components whose failure is...