Using the High Availability Manager

In this chapter...

Introduction

The High Availability Manager (HAM) provides a mechanism for monitoring processes and services on your system. The goal is to provide a resilient manager (or “smart watchdog”) that can perform multistage recovery when system services or processes fail, do not respond, or provide an unacceptable level of service. The HA framework, including the HAM, uses a simple publish/subscribe mechanism to communicate interesting system events between interested components in the system. By automatically integrating into the native networking mechanism (QNET), this framework transparently extends a local monitoring mechanism to a network.

The HAM acts as a conduit through which the rest of the system can both obtain and deliver information regarding the state of the system as a whole. The system could be a single node or a collection of nodes connected via QNET. The HAM can monitor specific processes and can control the behavior of the system when specific components fail and need to be recovered. The HAM also permits external detectors to report interesting events to the system, and can associate actions with the occurrence of these events.

In many HA systems, single points of failure (SPOFs) must be identified and dealt with carefully. Since the HAM maintains information about the health of the system and also provides the basic recovery framework, the HAM itself must never become a SPOF.

As a self-monitoring manager, the HAM is resilient to internal failures. If, for whatever reason, the HAM itself is stopped abnormally, it can immediately and completely reconstruct its own state. A mirror process called the Guardian perpetually stands ready and waiting to take over the HAM's role. Since all state information is maintained in shared memory, the Guardian can assume the exact same state that the original HAM was in before the failure.

But what happens if the Guardian terminates abnormally? The Guardian (now the new HAM) creates a new Guardian for itself before taking the place of the original HAM. Practically speaking, therefore, one can't exist without the other.

Since the HAM/Guardian pair monitor each other, the failure of either one can be completely recovered from. The only way to stop HAM is to explicitly instruct it to terminate the Guardian and then to terminate itself.

HAM hierarchy

HAM consists of three main components:

Entities

Entities are the fundamental units of observation/monitoring in the system. Essentially, an entity is a process (pid). As processes, all entities are uniquely identifiable by their pids. Associated with each entity is a symbolic name that can be used to refer to that specific entity. Again, the names associated with entities are unique across the system. Managers are currently associated with a node, so uniqueness rules apply to a node. As we'll see later, this uniqueness requirement is very similar to the naming scheme used in a hierarchical filesystem.

The basic entity types are:

Self-attached entities
These are processes that explicitly choose to be HA-aware. These processes use the ham_attach_self() and ham_detach_self() functions to connect to and disconnect from a HAM. Self-attached processes are compiled against the HAM API library, and the lifetime of the monitoring is from the time of the ham_attach_self() call to the time of the ham_detach_self() call.

Self-attached entities can also choose to send heartbeats to a HAM, which will then monitor them for failure. Since arbitrary processes on the system aren't necessarily “trackable” for failure (i.e. they're not in session 1, not child processes, etc.), you can use this heartbeat mechanism to monitor such processes.

Self-attached entities can, on their own, decide at exactly what point in their lifespan they want to be monitored, what conditions they want acted upon, and when they want to stop the monitoring. In other words, this is a situation where a process says, “Do the following if I die.”

Externally attached entities
These are generic processes in the system that are being monitored. These could be arbitrary daemons or service providers whose health is deemed important. This method is useful for the case where Process A says, “Tell me when Process B dies” but Process B needn't know about this at all.
Global entity
A global entity is really just a place holder for matching any entity. It can be used to associate actions that will be triggered when an interesting event is detected with respect to any entity on the system. The term global refers to the set of entities being monitored in the system. This permits one to say things like “when any process dies or when any process misses a heartbeat, do the following”. The global entity is never added or removed, but is only referred to. Conditions can be added/removed to the global entity as usual, and actions added/removed from any of the conditions.

Note: To get a handle for a global entity, call ham_entity_handle(), passing NULL for the ename argument.

The dumper process is normally used to obtain core images of processes that terminate abnormally as a result of performing any illegal operations. A HAM receives notification of such terminations from dumper. In addition the HAM also receives notification, from the system, of the termination of any process that is in session 1. This includes daemon processes that call procmgr_daemon(), thereby detaching themselves from their controlling terminal.

If a process calls daemon(), a new process is created and replaces the original one, becoming the session leader. If the HAM was monitoring the original process, it automatically switches to monitoring the new process instead.

Conditions

Conditions are associated with entities. These conditions represent the state of the entity. Here are some examples of conditions:

Condition Description
CONDDEATH The entity has died.
CONDABNORMALDEATH The entity has died an abnormal death. This condition is triggered whenever an entity dies by a mechanism that results in the generation of a core file (see dumper in the Utilities Reference for details).
CONDDETACH The entity that was being monitored is detaching. This ends HAM's monitoring of that entity.
CONDATTACH An entity for whom a place holder was previously created (someone has subscribed to events relating to this entity), has joined the system. This is also the start of the monitoring of the entity by a HAM.
CONDHBEATMISSEDHIGH The entity missed sending a heartbeat message specified for a condition of “high” severity.
CONDHBEATMISSEDLOW The entity missed sending a heartbeat message specified for a condition of “low” severity.
CONDRESTART The entity was restarted. This condition is true after the entity is successfully restarted.
CONDRAISE An externally detected condition is reported to a HAM. Subscribers can associate actions with these externally detected conditions.
CONDSTATE An entity reports a state transition to a HAM. Subscribers can associate actions with specific state transitions.
CONDANY This condition type matches any condition type. It can be used to associate the same actions with one of many conditions.

The conditions described above with the exception of CONDSTATE, CONDRAISE and CONDANY are automatically detected and/or triggered by a HAM (i.e. the HAM is the publisher of the conditions). The CONDSTATE and CONDRAISE conditions are published to a HAM by external detectors. For all conditions, subscribers can associate with lists of actions that will be performed in sequence when the condition is triggered. Both the CONDSTATE and CONDRAISE conditions provide filtering capabilities so the subscribers can selectively associate actions with individual conditions, based on the information published.

Conditions are also associated with symbolic names, which also need to be unique within an entity.


Note: The HAM architecture is extensible. Several conditions are automatically detected by a HAM. Also, by using the Condition Raise mechanism other components in the system can notify a HAM of interesting events in the system. These conditions can be fully customized. Also, by studying the source code, it is possible to add the capability of detecting other conditions into the HAM (e.g. low memory, high CPU utilization, low disk space, etc.) to suit your HA application.

Actions

Actions are associated with conditions. A condition can contain multiple actions. The actions are executed whenever the corresponding condition is true. Actions within a condition execute in FIFO order (the order in which they were added into the condition). Multiple conditions that are true are triggered simultaneously in an arbitrary order. Conditions specified as HCONDINDEPENDENT will execute in a separate thread of execution, in parallel with other conditions. (See the section Condition functions in this chapter.)

The HAM API includes several functions for different kinds of actions:

Action Description
ham_action_restart() This action restarts the entity.
ham_action_execute() Executes an arbitrary command (e.g. to start a process).
ham_action_notify_pulse() Notifies some process that this condition has occurred. This notification is sent using a specific pulse with a value specified by the process that wished to receive this notify message. Pulses can be delivered to remote nodes, by specifying the appropriate node specifier.
ham_action_notify_signal() Notifies some process that this condition has occurred. This notification is sent using a specific realtime signal with a value specified by the process that wished to receive this notify message. Signals can be delivered to remote nodes, by specifying the appropriate node specifier.
ham_action_notify_pulse_node() This is the same as the ham_action_notify_pulse() described above, except that the node name specified for the recipient of the pulse can be given using the fully qualified node name instead of the node identifier.
ham_action_notify_signal_node() This is the same as the ham_action_notify_signal() described above, except that the node name specified for the recipient of the signal can be given using the fully qualified node name instead of the node identifier.
ham_action_waitfor() This action lets you insert delays between consecutive actions in a sequence. You can also wait for certain names to appear in the namespace.
ham_action_heartbeat_healthy() Resets the heartbeat mechanism for an entity that had previously missed sending heartbeats, and had triggered a missed heartbeat condition, but has now recovered.
ham_action_log() This allows one to insert a customizable verbosity message into the activity log maintained by a HAM.

Actions are also associated with symbolic names, which are unique within a specific condition.


Note: Again, the HAM architecture is extensible, so you may add your own action functions as you see fit.

Action Fail actions

When an action in a list of actions fails, one can specify an alternate list of actions that will be performed to recover from the failure of the given action. These actions are referred to as action_fail actions, and are associated with each individual action. The action_fail actions are essentially the same set of actions that would normally be executed with the exception of ham_action_restart() and ham_action_heartbeat_healthy()). Here's the list of action fail actions:

Action Description
ham_action_fail_execute() Executes an arbitrary command (e.g. to start a process).
ham_action_fail_notify_pulse() Notifies some process that this condition has occurred. This notification is sent using a specific pulse with a value specified by the process that wished to receive this notify message. Pulses can be delivered to remote nodes by specifying the appropriate node specifier.
ham_action_fail_notify_signal() Notifies some process that this condition has occurred. This notification is sent using a specific realtime signal with a value specified by the process that wished to receive this notify message. Signals can be delivered to remote nodes by specifying the appropriate node specifier.
ham_action_fail_notify_pulse_node() This is the same as the ham_action_fail_notify_pulse() described above, except that the node name specified for the recipient of the pulse can be given using the fully qualified node name instead of the node identifier.
ham_action_fail_notify_signal_node() This is the same as the ham_action_fail_notify_signal() described above, except that the node name specified for the recipient of the signal can be given using the fully qualified node name instead of the node identifier.
ham_action_fail_waitfor() This action lets you insert delays between consecutive actions in a sequence. You can also wait for certain names to appear in the namespace.
ham_action_fail_log() This allows one to insert a customizable verbosity message into the activity log maintained by a HAM.

Multistaged recovery

This complete mechanism allows us to perform recovery of a failure of a single service or process in a multi-staged fashion.

For example, suppose you've started fs-nfs2 (the NFS filesystem) and then mounted a few directories from multiple sources. You can instruct HAM to restart fs-nfs2 upon failure, and also to remount the appropriate directories as required after restarting the NFS process. And if during the lifespan of fs-nfs2 some directories are unmounted, you can remove those particular actions from the set of actions to be performed.

As another example, suppose io-pkt* (network I/O manager) were to die. We can tell a HAM to restart it and also to load the appropriate network drivers (and maybe a few more services that essentially depend on network services in order to function).

State of the HAM

Effectively, a HAM's internal state is like a hierarchical filesystem, where entities are like directories, conditions associated with those entities are like subdirectories, and actions inside those conditions are like leaf nodes of this tree structure.

A HAM also presents this state as a read-only filesystem under /proc/ham. As a result, arbitrary processes can also view the current state (e.g. you can do ls /proc/ham).

Besides presenting a view of the state as a filesystem, for each item (entity/condition/action) a HAM can also display statistics and information relating to it in a corresponding .info file at each level in a HAM filesystem under /proc/ham.

Example of the view shown in /proc/ham

Consider the following simple example where a HAM is monitoring inetd and restarts it when it dies:

# ls -al /proc/ham
total 2
-r--------  1 root      root            175 Aug 30 23:05 .info
dr-x------  1 root      root              1 Aug 30 23:06 inetd

The .info file at the highest level provides information about the HAM and the Guardian, as well as an overview of the entities and other objects in the system:

# cat /proc/ham/.info
Ham Pid            : 10993674
Guardian Pid       : 10997782
Ham Failures       : 0
Guardian Failures  : 0
Num Entities       : 1
Num Conditions     : 1
Num Actions        : 1

In this case the only entity being monitored is inetd, which appears as a directory at the top level under /proc/ham:

# ls -al /proc/ham/inetd
total 2
-r--------  1 root      root            173 Aug 30 23:06 .info
dr-x------  1 root      root              1 Aug 30 23:06 death

# cat /proc/ham/inetd/.info
Path            : inetd
Entity Pid      : 11014167
Num conditions  : 1
Entity type     : ATTACHED
Stats:
Created         : 2001/08/30 23:04:49:930148650
Num Restarts    : 0

As you can see, the .info provides information and statistics relating to the inetd entity that is being monitored. The information is generated dynamically and contains up-to-date data for each entity.

The inetd entity has associated with it only one condition (i.e. death), which is triggered when the entity dies.

# ls -al /proc/ham/inetd/death
total 2
-r--------  1 root      root            126 Aug 30 23:07 .info
-r--------  1 root      root            108 Aug 30 23:07 restart

# cat /proc/ham/inetd/death/.info
Path            : inetd/death
Entity Pid      : 11014167
Num Actions     : 1
Condition ReArm : ON
Condition type  : CONDDEATH

Similarly, there's only one action associated with this death condition: the restart mechanism. Each action under the condition appears as a file under the appropriate condition directory. The file contains details about the action that will be performed when the condition is triggered.

# cat /proc/ham/inetd/death/restart
Path         : inetd/death/restart
Entity Pid   : 11014167
Action ReArm : ON
Restart Line : /usr/sbin/inetd -D

Note: If inetd isn't a self-attached entity, you need to specify the -D option to it, to force inetd to daemonize by calling procmgr_daemon() instead of by calling daemon(). The HAM can see death messages only from self-attached entities, processes that terminate abnormally, and tasks that are running in session 1, and the call to daemon() doesn't put the caller into that session.

If inetd is a self-attached entity, you don't need to specify the -D option because the HAM automatically switches to monitoring the new process that daemon() creates.


When inetd dies, all the actions associated with a death condition under it are executed:

# slay inetd

# cat /proc/ham/inetd/.info
Path            : inetd
Entity Pid      : 11071511  <- new pid of entity
Num conditions  : 1
Entity type     : ATTACHED
Stats:
Created         : 2001/08/30 23:04:49:930148650
Last Death      : 2001/08/30 23:10:31:889820814
Restarted       : 2001/08/30 23:10:31:904818519
Num Restarts    : 1

As you can see, the statistics relating to the entity inetd are updated.

Similarly, if a HAM itself is terminated, the Guardian takes over as the new HAM, and creates a Guardian for itself.

# cat /proc/ham/.info
Ham Pid            : 10993674  <----- This is the HAM
Guardian Pid       : 10997782  <----- This is the Guardian
Ham Failures       : 0
Guardian Failures  : 0
Num Entities       : 1
Num Conditions     : 1
Num Actions        : 1

... Kill the ham ....

# /bin/kill -9 10993674        <---- Simulate failure

... re-read the stats ...

# cat /proc/ham/.info  
Ham Pid            : 10997782  <----- This is the new HAM
Guardian Pid       : 11124746  <----- This is the Guardian
Ham Failures       : 1
Guardian Failures  : 0
Num Entities       : 1
Num Conditions     : 1
Num Actions        : 1

As you can see, the old Guardian is now the new HAM, and a new Guardian has been created. All entities and conditions remain as before; the monitoring continues as usual. The HAM and the Guardian ignore all signals that they can.

HAM API

A HAM provides an API for you to use in order to interact with it. This API provides a collection of functions to:

The API is implemented as a library that you can link against. The library is thread-safe and also cancellation-safe.

Connect/disconnect functions

The HAM API library maintains only one connection to the HAM. The library itself is thread-safe, and multiple connections (from different threads) or the same thread are multiplexed on the same single connection to a HAM. The library maintains reference counts.

Here are the basic connect functions:

/* Basic connect functions
   return success (0) or failure (-1, with errno set) */

int ham_connect(unsigned flags);
int ham_connect_nd(int nd, unsigned flags);
int ham_connect_node(const char *nodename, unsigned flags);

int ham_disconnect(unsigned flags);
int ham_disconnect_nd(int nd, unsigned flags);
int ham_disconnect_node(const char *nodename, unsigned flags);

These functions are used to open or close connections to a HAM. The first call to ham_connect*() will open the fd, while subsequent calls will increment the reference count.

Similarly, ham_disconnect() will decrement the count until zero; the call that makes the count zero will close the fd. The functions return -1 on error, and 0 on success. Similarly ham_disconnect*() will decrement the reference count until zero, with the call that makes the count zero closing the fd. The functions return -1 on error with errno set, and 0 on success.

In a multithreaded situation, there will exist only one open connection to a given HAM at any given time, even if multiple threads were to perform ham_connect*()/ham_disconnect*() calls.

The ham_*_nd() and ham_*_node() versions of the calls are used to open a connection to a remote HAM across QNET. The nd that is passed to the function is the node identifier that refers to the remote host at the instant the call is made. Since node identifiers are transient values, it is essential that the node identifier is obtained just prior to the call. The other option is to use the fully qualified node name (FQNN) of the host and to pass this as the nodename parameter. An nd of ND_LOCAL_NODE (a constant defined in sys/netmgr.h) or a nodename of NULL (or the empty string) are equivalent, and refer to the current node. (This is also the same as calling ham_connect() or ham_disconnect() directly).

Calls to ham_connect(), ham_connect_nd(), and ham_connect_node() can be freely mixed, as long as the number of connect calls equals the number of disconnect calls for each connection to a specific (local or remote) HAM before the connection (fd) is closed.

Attach/detach functions

For self-attached entities

ham_entity_t *ham_attach_self(char *ename, uint64_t hp, int hpdl, 
              int hpdh, unsigned flags);
int ham_detach_self(ham_entity_t *ehdl, unsigned flags);

You use these two functions to attach/detach a process to/from a HAM as a self-attached entity.

The ename argument represents the symbolic name for this entity, which needs to be unique in the system (of all monitored entities at the instant the call is made).

The hp argument represents time values in nanoseconds for the heartbeat period. Heartbeating can be used to ensure “liveness” of the monitored entity. Liveness is a property that describes a component's useful progress. In many cases, the availability of a system component is compromised not because the component has necessarily died, but because it isn't responding or making any progress. The heartbeating mechanism lets you specify that a component will issue a heartbeat at a given interval, and if it misses a certain number of heartbeats, then that would constitute a heartbeat-missed condition.

The hpdl and hpdh represent the number of heartbeats that can be missed before the conditions heartbeatmissedlow and heartbeatmissedhigh are triggered. The HAM API library registers this request with a HAM and also creates a thread that keeps the connection to a HAM open. If the entity were to abnormally terminate, the connection to the HAM is closed, and the HAM will know that this is an abnormal termination (since ham_detach_self() wasn't called first).

On the other hand, if a HAM were to abnormally fail (extremely unlikely) and the Guardian takes over as the new HAM, the connection to the old HAM will have gone stale. In that case, the Guardian notifies all self-attached entities to reattach. The extra thread mentioned above handles this reattach transparently.

If a connection to a HAM is already open, then ham_attach_self() uses the same connection, but increments the reference count of connections opened by this client. A client that indicates that it will heartbeat at a certain period must call ham_heartbeat() to actually transmit a heartbeat to the HAM.

The library also verifies whether the ename provided by the caller is unique. If it doesn't already exist, then this request is forwarded to a HAM, which also checks it again to avoid any race conditions in creating new entities. The ham_attach_self() returns a generic handle, which can be used to detach the process from the HAM later. Note that this handle is an opaque pointer that's also used to add conditions and actions as shown below.

The ham_detach_self() function is used to close the connection to a HAM. From this point on, the HAM will no longer monitor this process as a self-attached entity. The extra thread is canceled. The ham_detach_self() function takes as an argument the handle returned by ham_attach_self().

Code snippet using self-attach/detach calls

The following snippet of code uses the ham_attach|detach_self() functions:

...
ham_entity_t *ehdl; /* The entity Handle */
int status;

/* 
 connects to a HAM with a heartbeat of 5 seconds
 and an entity name of "client1", and no flags
 it also specifies hpdh = 4, and hpdh = 8
*/

ehdl = ham_attach_self("client1", 5000000000, 4, 8, 0);
if (ehdl == NULL) {
  printf("Could not attach to Ham\n");
  exit(-1);
}
/* Detach from a HAM using the original handle */
status = ham_detach_self(ehdl,0);
...

For attaching/detaching all other entities

ham_entity_t *ham_attach(char *ename, int nd, pid_t pid, char *line, 
              unsigned flags);
ham_entity_t *ham_attach_node(char *ename, const char *nodename, pid_t pid, 
              char *line, unsigned flags);
int ham_detach(ham_entity_t *ehdl, unsigned flags);
int ham_detach_name(int nd, char *ename, unsigned flags);
int ham_detach_name_node(const char *nodename, char *ename, unsigned flags);

These attach/detach/detach-name functions are very similar to the *_self() functions above, except here the calling process asks a HAM to monitor a different process.

This mechanism allows for arbitrary monitoring of entities that already exist and aren't compiled against the HAM API library. In fact, the entities that are being monitored needn't even be aware that they're being monitored.

You can use the ham_attach() call either to:

In the ham_attach() call, if pid is -1, then we assume that the entity isn't running. The entity is started now using line as the startup command line for it. But if pid is greater than 0, then line is ignored and the pid given is attached to as an entity. Again ename needs to be unique across all entities currently registered.

The nd specifier in ham_attach() and ham_detach_name(), and the nodename specifier in the ham_attach_node() and ham_detach_name_node() versions of the calls are used to refer to a remote HAM across Qnet. The nd that is passed to the function is the node identifier that refers to the remote host at the instant the call is made. Since node identifiers are transient values, it is essential that the node identifier is obtained just prior to the call. The other option is to use the fully qualified node name (FQNN) of the host and to pass this as the nodename parameter. An nd of ND_LOCAL_NODE (a constant defined in sys/netmgr.h or a nodename of NULL (or the empty string) are equivalent, and refer to the current node.

The ham_detach*() functions stop monitoring a given entity. The ham_detach() call takes as an argument the original handle returned by ham_attach(). You can also call ham_detach_name(), which uses the entity's name instead of the handle.

Note that the entity handle can also be used later to add conditions to the entity (described below).

Code snippet using attach/detach calls

...
ham_entity_t *ehdl;
int status;
ehdl = ham_attach("inetd", 0, -1, "/usr/sbin/inetd -D", 0);
/* inetd is started, running and monitored now */
... 
...
status = ham_detach(ehdl,0);
...
...

Of course the attach and detach needn't necessarily be performed by the same caller:

...
ham_entity_t *ehdl;
int status;
/* starts and begins monitoring inetd */
ehdl = ham_attach("inetd", 0, -1, "/usr/sbin/inetd -D", 0);
...
...
/* disconnect from Ham (monitoring still continues) */
exit(0);

And to detach inetd:

...
int status;
/* stops monitoring inetd. */
status = ham_detach_name(0, "inetd", 0);
...
exit(0);

If inetd were already running, say with pid 105328676, then we can write the attach/detach code as follows:

ham_entity_t *ehdl;
int status;
ehdl = ham_attach("inetd", 0, 105328676, NULL, 0);
...
...
status = ham_detach(ehdl,0);
/* status = ham_detach_name(0, "inetd",0); */
...
...
exit(0);

For convenience, the ham_attach() and ham_detach() functions connect to a HAM if such a connection doesn't already exist. We do this only to make the use of the functions easier.

The connections to a HAM persist only for the duration of the attach/detach calls; any subsequent requests to the HAM must be preceded by the appropriate ham_connect() calls.

The best way to perform a large sequence of requests to a HAM is to:

  1. Call ham_connect() before the first request.
  2. Call ham_disconnect() after the last request.

This is the most efficient method, because it guarantees that there's always the same connection open to the HAM.

Entity functions

The ham_attach_*() functions are normally used when an entity is either already running or will be started by a HAM, and monitoring begins with the invocation of the ham_attach*() call. The HAM API also provides two functions that allow users to create placeholders for entities that are not yet running and that might be started in the future. This allows subscribers of interesting events to indicate their interest in these events, without necessarily waiting for a publisher (other entity/HAM) to create the entity.

ham_entity_t *ham_entity(const char *ename, int nd, unsigned flags);
ham_entity_t *ham_entity_node(const char *ename, const char *nodename, 
              unsigned flags);

These functions create entity place holders with the name specified ename, on the corresponding node described by either the node identifier nd or the nodename given by nodename. Once created, these placeholders can be used to add conditions and actions to their associated entities. When a subsequent ham_attach*() call is made that references the same ename, it will fill the entity place holder with the appropriate process ID. From that time onwards, the entity is monitored normally.

Condition functions

ham_condition_t *ham_condition(ham_entity_t *ehdl, int type,
                 const char *cname, unsigned flags);
int ham_condition_remove(ham_condition_t *chdl, unsigned flags);

Each entity can be associated with various conditions. And for each of these conditions there's a set of actions that will be performed in sequence when the condition is true. If an entity has multiple conditions that are true simultaneously with different sets of actions associated with each condition, then all the actions are performed for each condition, in sequence.

This mechanism lets you combine actions together into sets and choose to remove/control them as a single “group” instead of as individual items.

Since conditions are associated with entities, an entity handle must be available in order to add conditions. The ham_condition*() functions return an opaque pointer that is a condition handle, which you can use later to either remove a condition or add actions to the condition.

Condition types

You can specify any of the following for type:

CONDDEATH
The entity has died.
CONDABNORMALDEATH
The entity has died an abnormal death. This condition is triggered whenever an entity dies by a mechanism that results in the generation of a core file (see dumper in the Utilities Reference for details).
CONDDETACH
The entity that was being monitored is detaching. This ends HAM's monitoring of that entity.
CONDATTACH
An entity for whom a place holder was previously created (someone has subscribed to events relating to this entity), has joined the system. This is also the start of the monitoring of the entity by a HAM.
CONDHBEATMISSEDHIGH
The entity missed sending a heartbeat message specified for a condition of “high” severity.
CONDHBEATMISSEDLOW
The entity missed sending a heartbeat message specified for a condition of “low” severity.
CONDRESTART
The entity was restarted. This condition is true after the entity is successfully restarted.
CONDANY
This condition type matches any condition type. It can be used to associate the same actions with one of many conditions.

The CONDATTACH, CONDDETACH and CONDRESTART conditions are triggered by the HAM, when entities attach, detach, or restart respectively. The CONDHBEATMISSEDHIGH and CONDHBEATMISSEDLOW conditions are triggered internally by the HAM when it detects the missed heartbeat conditions, as defined by the entities when they indicated their original intent to heartbeat.

CONDDEATH is triggered whenever an entity dies. CONDABNORMALDEATH is triggered only when an abnormal death takes place, but such an abnormal death also triggers a CONDDEATH condition.

You use the detach condition to perform some actions whenever a monitored entity properly detaches from a HAM. After this point, the HAM will no longer monitor the entity. In effect, you can use this to “notify” interested clients when the HAM can no longer provide any more information about the detaching entity.

The restart condition is asserted and triggered by a HAM automatically if an entity dies and is restarted.

Condition flags

HCONDNOWAIT
Guarantees that there can be no “waitfor” statements in the list of actions in this condition. All conditions that are flagged HCONDNOWAIT are handled in a separate thread, and thus aren't delayed in any way by the nature of the actions in other conditions.
HCONDINDEPENDENT
If this flag is set, then all actions in this condition are executed in a separate thread. This lets you insert delays into a condition, without incurring any delays in other conditions.

If a condition is flagged with both HCONDINDEPENDENT and HCONDNOWAIT, then HCONDNOWAIT takes precedence, and all actions in this condition are executed in the same thread as all other conditions that are also flagged as HCONDNOWAIT. This is because all HCONDNOWAIT conditions are guaranteed to have minimal delays already.

If a condition is flagged with neither HCONDNOWAIT nor HCONDINDEPENDENT, it is treated as an HCONDOTHER condition, implying that it will be executed in the FIFO order among all conditions that are true.

To sum up:

  1. Whenever a condition (e.g. CONDDEATH, CONDDETACH, etc.) occurs, all conditions flagged HCONDNOWAIT are executed in FIFO order in a single thread.
  2. All conditions flagged HCONDINDEPENDENT (but not HCONDNOWAIT) are executed each in a separate thread.
  3. All other conditions are executed in FIFO order in one single thread.

This limits the number of threads in all to be at most:

(number of HCONDINDEPENDENT conditions) + 2

That is, one for all the conditions flagged HCONDNOWAIT, and one for all OTHER conditions.

In addition, within a condition, all actions are also executed in FIFO order. This is true irrespective of whether the conditions are HCONDNOWAIT or HCONDINDEPENDENT.

Action functions

/* action operations             */
ham_action_t *ham_action_restart(ham_condition_t *chdl, const char *aname, 
              const char *path, unsigned flags);
ham_action_t *ham_action_execute(ham_condition_t *chdl, const char *aname, 
              const char *path, unsigned flags);
ham_action_t *ham_action_waitfor(ham_condition_t *chdl, const char *aname, 
              const char *path, int delay, unsigned flags);
ham_action_t *ham_action_notify_pulse(ham_condition_t *chdl, const char *aname, 
              int nd, int topid, int chid, int pulsecode, int value, 
              unsigned flags);
ham_action_t *ham_action_notify_signal(ham_condition_t *chdl, const char *aname, 
              int nd, pid_t topid, int signum, int code, int value, 
              unsigned flags);
ham_action_t *ham_action_notify_pulse_node(ham_condition_t *chdl, 
              const char *aname, const char *nodename, int topid, int chid, 
              int pulsecode, int value, unsigned flags);
ham_action_t *ham_action_notify_signal_node(ham_condition_t *chdl, 
              const char *aname, const char *nodename, pid_t topid, 
              int signum, int code, int value, unsigned flags);
ham_action_t *ham_action_heartbeat_healthy(ham_condition_t *chdl, 
              const char *aname, unsigned flags);
ham_action_t *ham_action_log(ham_condition_t *chdl, const char *aname, 
              const char *msg, unsigned attachprefix, int verbosity, 
              unsigned flags);

/* remove an action              */
int ham_action_remove(ham_action_t *ahdl, unsigned flags);

As mentioned earlier, a HAM currently supports several different types of action functions, but note that you can add your own action functions to suit your particular HA application.

ham_action_restart()
Provides a restart mechanism for the entity in the event that a death condition has occurred. This implies that the entity in question has terminated; the restart action will restart the entity and also keep track of the new pid that the entity will now be associated with.

Note: Restart actions can be associated only with death conditions. And across all conditions of type death, there can be only a single restart action at any time. This ensures that the entity is restarted only if it terminates, and only once. (Conditions of type death include conditions of the types CONDDEATH and CONDABNORMALDEATH.

ham_action_execute()
Executes an arbitrary command in the event that the condition is true. This could be any executable command line. When the condition in question is true, the list of actions is traversed and executed in sequence.

This executes a command line as specified in the parameters. The command line must contain the FULL path to the executable along with all parameters to be passed to it. The command line is in turn passed onto a spawn command by a HAM to create a new process that will execute the command.

You'll find execute actions useful when you need to set up a multistage recovery. For example, if fs-nfs2 dies and is restarted, the ham_action_execute() function lets you remount any directories that are required after fs-nfs2 is restarted.

You can have an execute action take place immediately by setting the HACTIONDONOW flag. Again, this is useful in startup situations when an entity is created in many stages.

Note that HACTIONDONOW is ignored for waitfor actions. So in order to insert delays into a sequence of actions flagged HACTIONDONOW, you'll need to insert the delays in the client program (between calls to ham_action*()).

ham_action_waitfor()
Given a sequence of actions in a condition that will execute in FIFO order, you can insert delays into the execution sequence by using ham_action_waitfor() (as long as the condition permits it — see the section Condition functions in this chapter). The delay specified is in multiples of 100 msecs.

The ham_action_waitfor() call takes as an argument a path component, which can be used to wait for a specific name to appear in the name space. If path is NULL, the waitfor is for exactly delay msecs. But if path is specified, the waitfor is for either delay msecs or until path appears in the namespace, whichever occurs earlier. Note that the delay when a pathname is specified is in integral multiples of 100 msecs.

If a pathname is specified, the delays will be the closest integral multiple of 100 msecs, rounding up. A delay of 0 effectively disables the waitfor, making the pathname specification redundant.

ham_action_notify_pulse(), ham_action_notify_signal()
The ham_action_notify_pulse() function sends the appropriate pulse to the given nd/pid/chid.

The action_notify_signal() sends an appropriate realtime signal with a value to the pid that requests it.

Actions can persist across a restart if the entity is restarted. Similarly, conditions can also be set to persist (i.e. you can rearm them) after a restart of the entity. You can do this by ORing HREARMAFTERRESTART into the flags argument to either the ham_condition() call or to the appropriate action statement.

If a condition persists when an entity is restarted, each individual action is checked to see if it also persists. Actions that needn't be rearmed are performed once and removed. Any actions that fail are also removed, even if they're set to be rearmed.

If a condition isn't marked as rearmed, then all actions under it are automatically removed, since the actions are associated only with the condition and can't be retained if the condition no longer exists.

The persistence of conditions and actions across a restart depends on the restart of the entity itself. So if the entity isn't restarted (i.e. there's no ACTION_RESTART or the ACTION_RESTART fails for some reason), then the entity is removed, along with all conditions and actions associated with the entity as well.

ham_action_notify_pulse_node()
This is the same as the ham_action_notify_pulse() above, except that the node name specified for the recipient of the pulse can be given using the fully qualified node name instead of the node identifier (nd).
ham_action_notify_signal_node()
This is the same as the ham_action_notify_signal() above, except that the node name specified for the recipient of the signal can be given using the fully qualified node name instead of the node identifier (nd).

Action fail functions

/* action fail operations          */
int ham_action_fail_execute(ham_action_t *ahdl, const char *aname, 
    const char *path, unsigned flags);
int ham_action_fail_waitfor(ham_action_t *ahdl, const char *aname, 
    const char *path, int delay, unsigned flags);
int ham_action_fail_notify_pulse(ham_action_t *ahdl, const char *aname, 
    int nd, int topid, int chid, int pulsecode, int value, unsigned flags);
int ham_action_fail_notify_signal(ham_action_t *ahdl, const char *aname, 
    int nd, pid_t topid, int signum, int code, int value, unsigned flags);
int ham_action_fail_notify_pulse_node(ham_action_t *ahdl, const char *aname, 
    const char *nodename, int topid, int chid, int pulsecode, int value, 
    unsigned flags);
int ham_action_fail_notify_signal_node(ham_action_t *ahdl, const char *aname, 
    const char *nodename, pid_t topid, int signum, int code, int value, 
    unsigned flags);
int ham_action_fail_log(ham_action_t *ahdl, const char *aname, 
    const char *message, unsigned attachprefix, int verbosity, unsigned flags);

/* remove an action fail operation */
int ham_action_fail_remove(ham_action_t *ahdl, const char *aname, 
    unsigned flags);

These actions are used to associate a list of actions that will be executed when an action in a condition fails. These functions are similar to the corresponding action functions described in the previous section, the primary difference being the first parameter, which in the case of these functions is a handle to an action (as opposed to a handle to a condition).

Example to monitor inetd

The following code snippet shows how to begin monitoring the inetd process:



#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/stat.h>
#include <sys/netmgr.h>
#include <fcntl.h>
#include <ha/ham.h>

int main(int argc, char *argv[])
{
  int status;
  char *inetdpath;
    ham_entity_t *ehdl;
    ham_condition_t *chdl;
    ham_action_t *ahdl;
    int inetdpid;

    inetdpath = strdup("/usr/sbin/inetd -D");
    inetdpid = -1;
    ham_connect(0);
    ehdl = ham_attach("inetd", ND_LOCAL_NODE, inetdpid, inetdpath, 0);
    if (ehdl != NULL)
    {
      chdl = ham_condition(ehdl,CONDDEATH, "death", HREARMAFTERRESTART);
    if (chdl != NULL) {
        ahdl = ham_action_restart(chdl, "restart", inetdpath, 
                              HREARMAFTERRESTART);
          if (ahdl == NULL)    
              printf("add action failed\n");
        }
        else
            printf("add condition failed\n");
    }
    else
        printf("add entity failed\n");
    ham_disconnect(0);
    exit(0);
}

Example to monitor fs-nfs2

The following code snippet shows how to begin monitoring the fs-nfs2 process:



#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/stat.h>
#include <sys/netmgr.h>
#include <fcntl.h>
#include <ha/ham.h>

int main(int argc, char *argv[])
{
  int status;
    ham_entity_t *ehdl;
    ham_condition_t *chdl;
    ham_action_t *ahdl;
    char *fsnfspath;
    int fsnfs2pid;

    fsnfspath = strdup("/usr/sbin/fs-nfs2");
    fsnfs2pid = -1;

    ham_connect(0);
    ehdl = ham_attach("Fs-nfs2", ND_LOCAL_NODE, fsnfs2pid, fsnfspath, 0);
    if (ehdl != NULL)
    {
      chdl = ham_condition(ehdl,CONDDEATH, "Death", HREARMAFTERRESTART);
    if (chdl != NULL) {
        ahdl = ham_action_restart(chdl, "Restart", fsnfspath, 
                              HREARMAFTERRESTART);
          if (ahdl == NULL)    
              printf("add action failed\n");
            else {
          ahdl = ham_action_waitfor(chdl, "Delay1", NULL, 2000, 
                                  HREARMAFTERRESTART);
            if (ahdl == NULL)    
                printf("add action failed\n");
          ahdl = ham_action_execute(chdl, "MountDir1", 
                     "/bin/mount -t nfs a.b.c.d:/dir1 /dir1", 
                     HREARMAFTERRESTART|HACTIONDONOW));
            if (ahdl == NULL)    
                printf("add action failed\n");
          ahdl = ham_action_waitfor(chdl, "Delay2", NULL, 2000, 
                    HREARMAFTERRESTART);
            if (ahdl == NULL)    
                printf("add action failed\n");
          ahdl = ham_action_execute(chdl, "Mountdir2", 
                                  "/bin/mount -t nfs a.b.c.d:/dir2 /dir2",
                                  HREARMAFTERRESTART|HACTIONDONOW);
            if (ahdl == NULL)    
                printf("add action failed\n");
            }
        }
        else
            printf("add condition failed\n");
    }
    else
        printf("add entity failed\n");
    ham_disconnect(0);
    exit(0);
}

Functions to operate on handles

/* Get/Free handles    */
ham_entity_t *ham_entity_handle(int nd, const char *ename, unsigned flags);
ham_condition_t *ham_condition_handle(int nd, const char *ename, 
                 const char *cname, unsigned flags);
ham_action_t *ham_action_handle(int nd, const char *ename, const char *cname, 
              const char *aname, unsigned flags);
ham_entity_t *ham_entity_handle_node(const char *nodename, const char *ename, 
              unsigned flags);
ham_condition_t *ham_condition_handle_node(const char * nodename, 
              const char *ename, const char *cname, unsigned flags);
ham_action_t *ham_action_handle_node(const char * nodename, const char *ename, 
              const char *cname, const char *aname, unsigned flags);
int ham_entity_handle_free(ham_entity_t *ehdl);
int ham_condition_handle_free(ham_condition_t *chdl);
int ham_action_handle_free(ham_action_t *ahdl);

You use the handle functions to get/free handles based on entity, condition, and action names. You can then use these handles later to add or remove conditions and actions. As for all the other functions the *_node*() variations are used to refer to a HAM that is not necessarily local, using a fully qualified node name (FQNN).

A client example

Here's an example of a client that obtains notifications via pulses and signals about significant events from a HAM. It registers a pulse-notification scheme in the event that inetd dies or detaches. It also registers a signal-notification mechanism for the death of fs-nfs2.

This example also demonstrates how the delayed notification occurs, and shows how to overcome this using an HCONDINDEPENDENT condition.

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <errno.h>
#include <sys/neutrino.h>
#include <sys/iomsg.h>
#include <sys/netmgr.h>
#include <signal.h>
#include <ha/ham.h>

#define PCODEINETDDEATH      _PULSE_CODE_MINAVAIL+1
#define PCODEINETDDETACH     _PULSE_CODE_MINAVAIL+2
#define PCODENFSDELAYED      _PULSE_CODE_MINAVAIL+3
#define PCODEINETDRESTART1   _PULSE_CODE_MINAVAIL+4
#define PCODEINETDRESTART2   _PULSE_CODE_MINAVAIL+5

#define MYSIG SIGRTMIN+1

int fsnfs_value;

/* Signal handler to handle the death notify of fs-nfs2 */
void MySigHandler(int signo, siginfo_t *info, void *extra)
{
  printf("Received signal %d, with code = %d, value %d\n",
        signo, info->si_code, info->si_value.sival_int);
  if (info->si_value.sival_int == fsnfs_value)
    printf("FS-nfs2 died, this is the notify signal\n");
  return;
}

int main(int argc, char *argv[])
{
  int chid, coid, rcvid;
  struct _pulse pulse;
  pid_t pid;
  int status;
  int value;
  ham_entity_t *ehdl;
  ham_condition_t *chdl;
  ham_action_t *ahdl;
  struct sigaction sa;
  int scode;
  int svalue;

  /* we need a channel to receive the pulse notification on */
  chid = ChannelCreate( 0 ); 

  /* and we need a connection to that channel for the pulse to be
     delivered on */
  coid = ConnectAttach( 0, 0, chid, _NTO_SIDE_CHANNEL, 0 );

  /* fill in the event structure for a pulse */
  pid = getpid();
  value = 13;
  ham_connect(0);
  /* Assumes there is already an entity by the name "inetd" */
  chdl = ham_condition_handle(ND_LOCAL_NODE, "inetd","death",0);
  ahdl = ham_action_notify_pulse(chdl, "notifypulsedeath",ND_LOCAL_NODE, pid, 
             chid, PCODEINETDDEATH, value, HREARMAFTERRESTART);

  ham_action_handle_free(ahdl);
  ham_condition_handle_free(chdl);

  ehdl = ham_entity_handle(ND_LOCAL_NODE, "inetd", 0);
  chdl = ham_condition(ehdl, CONDDETACH, "detach", HREARMAFTERRESTART);
  ahdl = ham_action_notify_pulse(chdl, "notifypulsedetach",ND_LOCAL_NODE, pid, 
           chid, PCODEINETDDETACH, value, HREARMAFTERRESTART);
  ham_action_handle_free(ahdl);
  ham_condition_handle_free(chdl);
  ham_entity_handle_free(ehdl);

  fsnfs_value = 18; /* value we expect when fs-nfs dies */
  scode = 0;
  svalue = fsnfs_value; 
  sa.sa_sigaction = MySigHandler;
  sigemptyset(&sa.sa_mask);
  sa.sa_flags = SA_SIGINFO;
  sigaction(MYSIG, &sa, NULL);

  /*
   Assumes there is an entity by the name "Fs-nfs2".
   We use "Fs-nfs2" to symbolically represent the entity
   fs-nfs2. Any name can be used to represent the
   entity, but it's best to use a readable and meaningful name.
  */
  ehdl = ham_entity_handle(ND_LOCAL_NODE, "Fs-nfs2", 0);

  /*
   Add a new condition, which will be an "independent" condition.
   This means that notifications/actions inside this condition
   are not affected by "waitfor" delays in other action
   sequence threads
  */
  chdl = ham_condition(ehdl,CONDDEATH, "DeathSep",
                    HCONDINDEPENDENT|HREARMAFTERRESTART);
  ahdl = ham_action_notify_signal(chdl, "notifysignaldeath",ND_LOCAL_NODE, 
                    pid, MYSIG, scode, svalue, HREARMAFTERRESTART);
  ham_action_handle_free(ahdl);
  ham_condition_handle_free(chdl);
  ham_entity_handle_free(ehdl);

  chdl = ham_condition_handle(ND_LOCAL_NODE, "Fs-nfs2","Death",0);
  /*
   This action is added to a condition that does not
   have an HCONDNOWAIT. Since we are unaware what the condition
   already contains, we might end up getting a delayed notification
   since the action sequence might have "arbitrary" delays and
   "waits" in it.
  */
  ahdl = ham_action_notify_pulse(chdl, "delayednfsdeathpulse", ND_LOCAL_NODE, 
             pid, chid, PCODENFSDELAYED, value, HREARMAFTERRESTART);

  ham_action_handle_free(ahdl);
  ham_condition_handle_free(chdl);

  ehdl = ham_entity_handle(ND_LOCAL_NODE, "inetd", 0);

  /* We force this condition to be independent of all others. */
  chdl = ham_condition(ehdl, CONDRESTART, "restart", 
                             HREARMAFTERRESTART|HCONDINDEPENDENT);
  ahdl = ham_action_notify_pulse(chdl, "notifyrestart_imm", ND_LOCAL_NODE, 
                    pid, chid, PCODEINETDRESTART1, value, HREARMAFTERRESTART);
  ham_action_handle_free(ahdl);
  ahdl = ham_action_waitfor(chdl, "delay",NULL,6532, HREARMAFTERRESTART); 
  ham_action_handle_free(ahdl);
  ahdl = ham_action_notify_pulse(chdl, "notifyrestart_delayed", ND_LOCAL_NODE, 
                    pid, chid, PCODEINETDRESTART2, value, HREARMAFTERRESTART);

  ham_action_handle_free(ahdl);
  ham_condition_handle_free(chdl);
  ham_entity_handle_free(ehdl);

  while (1) {
    rcvid = MsgReceivePulse( chid, &pulse, sizeof( pulse ), NULL );
    if (rcvid < 0) {
      if (errno != EINTR) {
        exit(-1);
      }
    }
    else {
            switch (pulse.code) {
                case PCODEINETDDEATH:
                      printf("Inetd Death Pulse\n");
                    break;
                case PCODENFSDELAYED:
                      printf("Fs-nfs2 died: this is the possibly delayed pulse\n");
                      break;
                case PCODEINETDDETACH:
                      printf("Inetd detached, so quitting\n");
                     goto the_end;
                case PCODEINETDRESTART1:
                      printf("Inetd Restart Pulse: Immediate\n");
                    break;
                case PCODEINETDRESTART2:
                      printf("Inetd Restart Pulse: Delayed\n");
                    break;
              }
    }
  }
  /*
   At this point we are no longer waiting for the
   information about inetd, since we know that it
   has exited.
   We will still continue to obtain information about the
   death of fs-nfs2, since we did not remove those actions.
   If we exit now, the next time those actions are executed
   they will fail (notifications fail if the receiver does not
   exist anymore), and they will automatically get removed and
   cleaned up.
  */
the_end:
  ham_disconnect(0);
  exit(0);
}


Note: Note that the HAM API has certain restrictions:
  • The names of entities, conditions, and actions (ename, cname, and aname) must not contain a / character.
  • All names are subject to the length restriction imposed by _POSIX_PATH_MAX (as defined in <limits.h>). Since the names are manifested inside the namespace, the effective length of a name is the maximum length of the name as a path component. In other words, the combined length of an entity/condition/action name — including the /proc/ham prefix — must not exceed _POSIX_PATH_MAX.

Starting and stopping a HAM

You start a HAM by running the ham utility at the command line:

ham

The ham utility has these command-line options:

-?|h
Display usage message
-d
Disable internal verbosity.
-f
Log verbose output to a file (default is stderr).
-t none|relative|absolute|shortabs
Specify the timestamping method. The default is relative.
-v
Set verbosity level — extra -v's increase verbosity.
-Vn
Set verbosity level — use a number to specify the level (e.g. -V3).

When a HAM starts, it also starts the Guardian process for itself.


Note: You must start ham with its full path or with the PATH variable set to include the path to ham as a component.

You must be root in order to start or stop a HAM.


Stopping a HAM

To stop the HAM, you must use either the ham_stop() function or the hamctrl utility. These are the only correct (and the only guaranteed) ways to stop the HAM.

The ham_stop() function or the hamctrl utility instructs a HAM to terminate. The HAM in turn first instructs the Guardian to terminate, and then terminates itself. To stop the HAM from the command line, use the hamctrl utility:

hamctrl -stop

To stop a remote HAM, use the -node option to the hamctrl utility:

hamctrl -node "nodename" -stop

To stop the HAM programmatically using the API, use the following functions:

/* terminate                     */
int ham_stop(void);
int ham_stop_nd(int nd);
int ham_stop_node(const char *nodename);

Control functions

The following set of functions have been provided to permit control of entities, conditions, and actions that are currently configured.

/* control operations                           */
int ham_entity_control(ham_entity_t *ehdl, int command, unsigned flags);
int ham_condition_control(ham_condition_t *chdl, int command, unsigned flags);
int ham_action_control(ham_action_t *ahdl, int command, unsigned flags);

The permitted operations (commands) are:

HENABLE                 /* enable item          */
HDISABLE                /* disable item         */
HADDFLAGS               /* add flag             */
HREMOVEFLAGS            /* remove flag          */
HSETFLAGS               /* set flag to specific */
HGETFLAGS               /* get flag             */

The “enable” and “disable” commands can be used to temporarily unhide/hide an entity, condition, or action.

An entity that is hidden is not removed, but will not be monitored for any conditions. Similarly, a condition that is hidden will never be triggered, while actions that are hidden will not be executed. By default the enable and disable operations do not operate recursively (although the disabling of an entity, will prevent the triggering of any conditions below it, and the disabling of a condition will prevent the execution of the actions in it).

To understand the finer distinctions of the recursive operation of the control functions, refer to the API descriptions for:

The “addflags”, “removeflags”, “setflags”, and “getflags” commands can be used to obtain or modify the flags associated with any of the entities, conditions, or actions. For more details, refer to the API descriptions of the ham_*_control_*() functions.

Verbosity control

You can use the ham_verbose() function to programmatically get or set (increase or decrease) the verbosity:

int ham_verbose(const char *nodename, int op, int value);

You can also use the hamctrl utility to interactively control the verbosity:

hamctrl -verbose /* increase    verbosity */
hamctrl +verbose /* decrease    verbosity */
hamctrl =verbose /* get current verbosity */

To operate on a remote HAM, use the hamctrl utility with the -node option:

hamctrl -node "nodename" -verbose /* increase    verbosity */
hamctrl -node "nodename" +verbose /* decrease    verbosity */
hamctrl -node "nodename" =verbose /* get current verbosity */

where nodename is a valid name that represents a remote (or local) node.

Publishing autonomously detected conditions

Entities or other components on the system can publish conditions that they deem interesting to a HAM, and the HAM can in turn deliver these to other components in the system that have expressed interest and subscribed to them. This allows arbitrary system components that are capable of detecting error conditions or potentially erroneous conditions, to report these to the HAM, which in turn can notify other components to start corrective procedures and/or take preventive action.

There are currently two different ways of publishing information to a HAM. Both of these are designed to be general enough to permit clients to build more complex information exchange mechanisms using them.

Publish state transitions

An entity can report its state transitions to a HAM. The HAM maintains the current state of every entity (as reported by the entity). The HAM does not interpret the meaning of the state value itself, neither does it try to validate the state transitions, but can generate events based on transitions from one state to another.

Components can publish transitions that they want the external world to know. These states need not necessarily represent a specific state the application uses internally for decision making.

The following function can be used to notify a HAM of a state transition. Since the HAM is only interested in the next state in the transition, this is the only information that is transmitted to the HAM. The HAM then triggers a condition state change event internally, which other components can subscribe to, using the ham_condition_state() API call described below.

/* report a state transition */
int ham_entity_condition_state(ham_entity_t *ehdl, unsigned tostate, 
    unsigned flags);

Publish other conditions

In addition to the above, components on the system can also publish autonomously detected conditions by using the ham_entity_condition_raise() API call. The component raising the condition can also specify a type, class, and severity of its choice, to allow subscribers further granularity in filtering out specific conditions to subscribe to. This call results in the HAM triggering a condition-raise event internally, which other components can subscribe to using the ham_condition_raise() API call described below.

/* publish autonomously detected condition */
int ham_entity_condition_raise(ham_entity_t *ehdl, unsigned rtype, 
    unsigned rclass, unsigned severity, unsigned flags);

Subscribing to autonomously published conditions

Subscribers can express their interest in events published by other components by using the following API calls:

These calls are similar to the ham_condition() API call, and return a handle to a condition, but allow the subscriber customize which of several possible published conditions they are interested in.

Trigger based on state transitions

When an entity publishes a state transition, a state transition condition is raised for that entity, based on the two states involved in the transition (the from state and the to state). Subscribers indicate which states they are interested in by specifying values for the fromstate and tostate parameters in the API call.

For more details, refer to the API reference documentation for ham_condition_state().

ham_condition_t *ham_condition_state(ham_entity_t *ehdl, const char *cname, 
                 unsigned fromstate, unsigned tostate, unsigned flags);

Trigger based on specific published condition

Subscribers can express interest in conditions raised by entities by using ham_condition_raise(), indicating as parameters to the call what sort of conditions they are interested in.

For more information, refer to the API documentation for ham_condition_raise().

ham_condition_t *ham_condition_raise(ham_entity_t *ehdl, const char *cname, 
                 unsigned rtype, unsigned rclass, unsigned rseverity, 
                 unsigned flags);