Fault-Tolerant Systems

The purpose of this chapter is to illustrate the practical use of methods described previously in the book, by highlighting the fault-tolerance aspects of six different computer systems that have various fault-tolerance techniques implemented in their design. We do not aim at providing a comprehensive, low-level description; for that, the interested reader should consult the references mentioned in the Further Reading section.
Several generations of NonStop systems have been developed since 1976, by Tandem Computers (since acquired by Hewlett Packard). The main use for these fault-tolerant systems has been in online transaction processing, where a reliable response to inquiries in real time must be guaranteed. The fault-tolerance features implemented in these systems have evolved through several generations, taking advantage of better technologies and newer approaches to fault tolerance. In this section we present the main (although not all) fault-tolerance aspects of the NonStop designs.
The NonStop systems have followed four key design principles, listed below.
Modularity. The hardware and software are constructed of modules of fine granularity. These modules constitute units of failure, diagnosis, service, and repair. Keeping the modules as decoupled as possible reduces the probability that a fault in one module will affect the operation of another.
Fail-Fast Operation. A fail-fast module either works properly or stops. Thus, each module is self-checking and stops upon detecting a failure. Hardware checks (through error-detecting codes; see Chapter 3) and software consistency tests (see Chapter 5) support fail-fast operation.
Single Failure Tolerance. When...