Using the High Availability Manager
The High Availability Manager (HAM) provides a mechanism for
monitoring processes and services on your system. The goal
is to provide a resilient manager (or smart
watchdog
) that can perform multistage recovery when
system services or processes fail, do not respond, or provide
an unacceptable level of service. The HA framework, including the HAM, uses
a simple publish/subscribe mechanism to communicate interesting
system events between interested components in the system.
The HAM acts as a conduit through which the rest of the system can both obtain and deliver information regarding the state of the system as a whole. The HAM can monitor specific processes and can control the behavior of the system when specific components fail and need to be recovered. The HAM also permits external detectors to report interesting events to the system, and can associate actions with the occurrence of these events.
In many HA systems, single points of failure (SPOFs) must be identified and dealt with carefully. Since the HAM maintains information about the health of the system and also provides the basic recovery framework, the HAM itself must never become a SPOF.
As a self-monitoring manager, the HAM is resilient to internal failures. If, for whatever reason, the HAM itself is stopped abnormally, it can immediately and completely reconstruct its own state. A mirror process called the Guardian perpetually stands ready and waiting to take over the HAM's role. Since all state information is maintained in shared memory, the Guardian can assume the exact same state that the original HAM was in before the failure.
But what happens if the Guardian terminates abnormally? The Guardian (now the new HAM) creates a new Guardian for itself before taking the place of the original HAM. Practically speaking, therefore, one can't exist without the other.
Since the HAM/Guardian pair monitor each other, the failure of either one can be completely recovered from. The only way to stop HAM is to explicitly instruct it to terminate the Guardian and then to terminate itself.