Compound restart

Recovery often involves more than restarting a single component. The death of one component might actually require restarting and resetting many other components. We might also have to do some initial cleanup before the dead component is restarted.

A HAM lets you specify a list of actions that will be performed when a given condition is triggered. For example, suppose the entity being monitored is fs-nfs2, and there's a set of directories that have been mounted and are currently in use. If fs-nfs2 were to die, the simple restart of that component won't remount the directories and make them available again! We'd have to restart fs-nfs2, and then follow that up with the explicit mounting of the appropriate directories.

Similarly, if io-pkt* were to die, it would take down the network drivers and TCP/IP stack (npm-tcpip.so) with it. So restarting io-pkt* involves also reinitializing the network driver. Also, any other components that use the network connection will also need to be reset (like inetd) so that they can reestablish their connections again.

Consider the following example of performing a compound restart mechanism.

/* addnfs.c */

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/stat.h>
#include <sys/netmgr.h>
#include <fcntl.h>
#include <ha/ham.h>

int main(int argc, char *argv[])
{
      int status;
    ham_entity_t *ehdl;
    ham_condition_t *chdl;
    ham_action_t *ahdl;
    char *fsnfspath;
    int fsnfs2pid;
    if (argc > 1) 
        fsnfspath = strdup(argv[1]);
    else
        fsnfspath = strdup("/usr/sbin/fs-nfs2");
    if (argc > 2) 
        fsnfs2pid = atoi(argv[2]);
    else
        fsnfs2pid = -1;
    ham_connect(0);
    ehdl = ham_attach("Fs-nfs2", ND_LOCAL_NODE, fsnfs2pid, fsnfspath, 0);
    if (ehdl != NULL)
    {
      chdl = ham_condition(ehdl,CONDDEATH, "Death", HREARMAFTERRESTART);
    if (chdl != NULL) {
        ahdl = ham_action_restart(chdl, "Restart", fsnfspath, 
                              HREARMAFTERRESTART);
          if (ahdl == NULL)    
              printf("add action failed\n");
            /* else {
          ahdl = ham_action_waitfor(chdl, "Delay1", NULL, 2000, HREARMAFTERRESTART);
            if (ahdl == NULL)    
                printf("add action failed\n");
          ahdl = ham_action_execute(chdl, "MountPPCBE", 
                  "/bin/mount -t nfs 10.12.1.115:/ppcbe /ppcbe",
                   HREARMAFTERRESTART|((fsnfs2pid == -1) ? HACTIONDONOW:0));
            if (ahdl == NULL)    
                printf("add action failed\n");
          ahdl = ham_action_waitfor(chdl, "Delay2", NULL, 2000, HREARMAFTERRESTART);
            if (ahdl == NULL)    
                printf("add action failed\n");
          ahdl = ham_action_execute(chdl, "MountWeb", 
          "/bin/mount -t nfs 10.12.1.115:/web /web",
           HREARMAFTERRESTART|((fsnfs2pid == -1) ? HACTIONDONOW:0));
            if (ahdl == NULL)    
                printf("add action failed\n");
            } */
        }
        else
            printf("add condition failed\n");
    }
    else
        printf("add entity failed\n");
    ham_disconnect(0);
    exit(0);
}

This example attaches fs-nfs2 as an entity, and then attaches a series of execute and waitfor actions to the condition death. When fs-nfs2 dies, HAM will restart it and also remount the remote directories that need to be remounted in sequence. Note that you can specify delays as actions and also wait for specific names to appear in the namespace.