Compound restart

QNX SDP8.0High Availability Framework Developer's GuideDeveloper

Recovery often involves more than restarting a single component. The death of one component might actually require restarting and resetting many other components. We might also have to do some initial cleanup before the dead component is restarted.

A HAM lets you specify a list of actions that will be performed when a given condition is triggered. For example, suppose the entity being monitored is fs-nfs3, and there's a set of directories that have been mounted and are currently in use. If fs-nfs3 were to die, the simple restart of that component won't remount the directories and make them available again! We'd have to restart fs-nfs3, and then follow that up with the explicit mounting of the appropriate directories.

Similarly, if io-sock were to die, it would take down the network drivers and TCP/IP stack with it. So recovering from an io-sock failure involves not just restarting io-sock, but also resetting any other components that use a network connection (like ptpd2) so that they can establish their connections again.

Consider the following example of performing a compound restart mechanism.

/* addnfs.c */

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/stat.h>
#include <sys/netmgr.h>
#include <fcntl.h>
#include <ha/ham.h>

int main(int argc, char *argv[])
{
    ham_entity_t *ehdl;
    ham_condition_t *chdl;
    ham_action_t *ahdl;
    char *fsnfspath = "/usr/sbin/fs-nfs3";

    ham_connect(0);
    ehdl = ham_attach("Fs-nfs3", 0, -1, fsnfspath, 0);
    if (ehdl == NULL) {
        perror("ham_attach");
        exit(EXIT_FAILURE);
    }
    chdl = ham_condition(ehdl, CONDDEATH, "Death", HREARMAFTERRESTART);
    if (chdl == NULL) {
        perror("ham_condition");
        exit(EXIT_FAILURE);
    }
    ahdl = ham_action_restart(chdl, "Restart", fsnfspath, HREARMAFTERRESTART);
    if (ahdl == NULL) {
        perror("ham_action_restart");
        exit(EXIT_FAILURE);
    }
    ahdl = ham_action_waitfor(chdl, "Delay1", NULL, 2000, HREARMAFTERRESTART);
    if (ahdl == NULL) {
        perror("ham_action_waitfor(Delay1)");
        exit(EXIT_FAILURE);
    }
    ahdl = ham_action_execute(chdl, "Mount_bin",
            "/bin/mount -t nfs 10.12.1.115:/qnx/bin /bin",
            HREARMAFTERRESTART|HACTIONDONOW);
    if (ahdl == NULL) {
        perror("ham_action_execute(Mount_bin)");
        exit(EXIT_FAILURE);
    }

    ahdl = ham_action_waitfor(chdl, "Delay2", NULL, 2000, HREARMAFTERRESTART);
    if (ahdl == NULL) {
        perror("ham_action_waitfor(Delay2)");
        exit(EXIT_FAILURE);
    }

    ahdl = ham_action_execute(chdl, "MountWeb",
            "/bin/mount -t nfs 10.12.1.115:/web /web",
            HREARMAFTERRESTART|HACTIONDONOW);
    if (ahdl == NULL) {
        perror("ham_action_execute(MountWeb)");
        exit(EXIT_FAILURE);
    }

    ham_disconnect(0);
    exit(EXIT_SUCCESS);
}

This example attaches fs-nfs3 as an entity, and then attaches a series of execute and waitfor actions to the condition death. When fs-nfs3 dies, HAM will restart it and also remount the remote directories that need to be remounted in sequence. Note that you can specify delays as actions and also wait for specific names to appear in the namespace.

Page updated: