Fault-Tolerant Systems

This chapter introduces the reader to statistical simulation approaches for evaluating the reliability and associated attributes of fault tolerant computer systems.
Simulation is frequently used when analytical approaches are either not feasible or not sufficiently accurate. Simulation, in general, has a deep theoretical foundation in statistics that can take years to master, and to which many books have been devoted. However, learning to write a basic simulation program and to use the fundamental statistical tools for analyzing the output data is much easier. These basic techniques are what we concentrate on in this chapter. Having said that, this chapter is meant primarily for readers with a reasonably strong understanding of probability theory.
We start by explaining how to write a simulation program. We then show how the output can be analyzed to deduce the system attributes. We then consider ways in which the results can be made more accurate by reducing the variance of the simulation output. We end the chapter by considering a different kind of simulation fault injection, which is an experimental technique to characterize a system's response to faults.
When faced with the need to construct a simulation model, one has three options:
Write a program in a high level general programming language, such as C, Java, or C++.
Use a special purpose simulation language such as SIMPSCRIPT, GPSS, or SIMAN.
Use or modify an available simulation package that has been designed to simulate such systems. Examples include SimpleScalar for computer architectures and OPNET for network...