Fault-Tolerant Systems

Errors in data may occur when the data are being transferred from one unit to another, from one system to another, or even while the data are stored in a memory unit. To tolerate such errors, we introduce redundancy into the data: this is called information redundancy. The most common form of information redundancy is coding, which adds check bits to the data, allowing us to verify the correctness of the data before using it and, in some cases, even allowing the correction of the erroneous data bits. Several commonly used error-detecting and error-correcting codes are discussed in Section 3.1.
Introducing information redundancy through coding is not limited to the level of individual data words but can be extended to provide fault tolerance for larger data structures. The best-known example of such a use is the Redundant Array of Independent Disks (RAID) storage system. Various RAID organizations are presented in Section 3.2, and the resulting improvements in reliability and availability are analyzed.
In a distributed system where the same data sets may be needed by different nodes in the system, data replication may help with data accessibility. Keeping a copy of the data on just a single node could cause this node to become a performance bottleneck and leave the data vulnerable to the failure of that node. An alternative approach would be to keep identical copies of the data on multiple nodes. Several schemes for managing the replicated copies of the same data are presented in Section...