Tru64 UNIX Troubleshooting: Diagnosing and Correcting System Problems

Most system crashes are caused by kernel panics. A panic occurs when the UNIX kernel detects a severe software or hardware error and deliberately brings the system down rather than continuing to operate in an unsafe manner. It is also possible for a system to crash without panicking. This kind of crash is almost invariably due to hardware or environmental issues, such as power or temperature problems. (In rare cases, a system may crash due to a kernel panic that doesn't leave any traces of the panic; an example of this type of problem was discussed in section 2.2.4.)
In general terms, crashes can be divided into three major classes:
Kernel panics that produce crash dumps
Kernel panics that don't produce crash dumps
Non-panic crashes
These three classes require different troubleshooting techniques. Before getting into these, we'll discuss how Tru64 UNIX crash dumps are created.
When the kernel encounters a severe problem that causes it to panic, it first writes a panic message to the system console, the system message file, and the binary error log. The panic routine then stops all running processes and calls a kernel routine named "dumpsys" to dump the contents of physical memory to disk, specifically to one or more of the active swap devices. (The dumpsys routine can also be invoked by entering the console command "CRASH"; in this way, a forced crash dump can be created when a system is hung.) The dumpsys routine locates the...