Mission-Critical Microsoft Exchange 2000: Building Highly Available Messaging and Knowledge Management Systems

If you recall the discussion of the "black box" of downtime from Chapter 2, you will remember that I pointed out that downtime is not a singular event but a series of individual outage components. My main emphasis was to point out that, by identifying and evaluating each individual component of a downtime occurrence, we can look for ways to reduce the overall time period of a downtime event. By looking inside of each outage point, we may be able to find possible points of process improvement that will substantially reduce or even eliminate periods of time that are unnecessary or too lengthy. In Chapter 2, I identified seven points or components of a typical outage. These were prefailure errors, the failure point, the notification point, the decision point, the recovery action point, the postrecovery point, and the normal operational point. Within each of these components of downtime we can find many subcomponents in which we may be able to find errors or oversights that, once addressed, can be substantially reduced or eliminated.
It is the recovery action point that we will focus on in this chapter. I believe that this component of downtime is responsible for the majority of the "chargeable" time within a downtime event. For example, I have seen many organizations rack up hours and hours of downtime simply because they did not have a good backup or because they interfered with Exchange Server's own recovery measures. I believe that lack of knowledge and poor operational...