System-on-Chip Test Architectures: Nanometer Design for Testability

Fault tolerance is the ability of a system to continue error-free operation in the presence of unexpected faults. Faults can be either temporary (because of radiation, noise, ground bounce, etc.) or permanent (because of manufacturing defects, oxide breakdown, electromigration, etc.). As technology continues to scale, circuit behavior is becoming less predicable and more prone to failures, thereby increasing the need for fault-tolerant design. Fault tolerance requires some form of redundancy in time, space, or information. When an error occurs, it either needs to be masked/corrected or the operation needs to be retried if it is a temporary fault. If there is a permanent fault, then retrying an operation will not solve the problem. In that case, sufficient redundancy or spare units are required to continue error-free operation, or the part needs to be repaired or replaced.
This chapter gives an overview of a number of fault-tolerant design schemes suitable for nanometer system-on-chip (SOC) applications. It starts with an introduction to the basic concepts in fault-tolerant design and the metrics used to specify and evaluate the dependability of the design. Coding theory is next reviewed and some of the commonly used error detecting and correcting codes are described. Then fault-tolerant design schemes using hardware, time, and information redundancy are discussed. This is followed by some examples of various types of fault-tolerant applications used in industry.
Fault tolerance is the ability of a system to continue...