Systems Reliability and Failure Prevention

Differences between software and hardware failure mechanisms pose problems in the assessment of digital systems. It is essential to distinguish between software faults, the correction of which can take a considerable amount of time, and software failures that can typically be corrected by restarting the program or resetting the computer. Because one fault can cause many failures it may be advisable to restrict the operation of a system until the fault has been corrected.
Once a program has successfully operated on a data set under specific timing conditions it will always process that data set correctly under the same conditions. It follows that a program that has been debugged as a result of reviews and testing is not likely to fail while processing normal data under normal conditions. Failures are more likely to happen when it processes exception conditions, particularly multiple REs.
To make a program as reliable as possible it must therefore be tested under conditions that cause a lot of exceptions to be processed. Random selection of exception conditions and a test environment that generates multiple exceptions for a given test case promise economical elimination of software faults.
The UML and software development using UML tools can eliminate omissions and ambiguities in requirements statements. It can also facilitate the generation of software FMEA with detail and objectivity that is comparable to hardware FMEA and that therefore can be used as the centerpiece of a reliability program.
Fault tolerant software can be and has been used...