Using the Client Recovery Library

In this chapter...

Introduction

The client recovery library provides a drop-in enhancement solution for many standard libc I/O operations. The HA library's cover functions provide automatic recovery mechanisms for failed connections that can be recovered from in an HA scenario.

The goal is to provide an API for high availability I/O that can transparently provide recovery to clients, especially in an environment where the servers must also be highly available. The recovery is configurable to tailor specific client needs; we provide examples of ways to develop more complicated recovery mechanisms.

The main principle of the HA library is to provide drop-in replacements for all the “transmission” functions (e.g. MsgSend*()). The API lets a client choose specific connections that it would like to make highly available — all other connections will operate as ordinary connections.

Normally, when a server that the client is talking to fails, or if there's a transient network fault, the MsgSend*() functions return an error indicating that the connection ID (or file descriptor) is stale or invalid (EBADF).

In an HA-aware scenario, these transient faults are often recovered from almost immediately (on the server end), thus making the services available again. Unfortunately, clients using a standard I/O offering might not be available to benefit from this to the maximum unless they provide mechanisms to recover from these errors, and then retransmit the information/data, which often might involve a nontrivial rework of client programs.

By providing/achieving recovery inside the HA library itself, we can automatically take advantage of the HA-aware services that restart themselves or are automatically restarted or of the services that are provided in a transparent cluster/redundant way.

Since recovery itself is a connection-specific task, we allow clients to provide recovery mechanisms that will be used to restore connections when they fail. Irrecoverable errors are propagated back reliably so that any client that doesn't wish to recover will get the I/O library semantics that it expects.

The recovery mechanism can be anything ranging from a simple reopen of the connection to a more complex scenario that includes the retransmission/renegotiation of connection-specific information.

MsgSend*() functions

Normally, the MsgSend*() functions return EBADF or ESRCH when a connection is stale or closed on the server end (e.g. because the server dies). In many cases, the servers themselves return (e.g. they're restarted) and begin to offer the services properly almost immediately (in an HA scenario). Rather than merely terminate the message transmission with an error, in some cases it might be possible to perform recovery and continue with the message transmission.

The HA library functions that “cover” all the MsgSend*() varieties are designed to do exactly this. When a specific invocation of one of the MsgSend*() functions fails, a client-provided recovery function is called. This recovery function can attempt to reestablish the connection and return control to the HA library's MsgSend*() function. As long as the connection ID returned by the recovery function is the same as the old connection ID (which in many cases is easy to ensure via close/open/dup2() sequences), then the MsgSend*() functions can now attempt to retransmit the data.

If at any point the errors returned by MsgSend*() are anything other than EBADF or ESRCH, these errors are propagated back to the client. Note also that if the connection ID isn't an HA-aware connection ID, or if the client hasn't provided a recovery function or that function can't re-obtain the same connection ID, then the error is allowed to propagate back to the client to handle in whatever way it likes.

Clients can change their recovery functions. And since clients can also pass around “recovery/connection” information (which in turn is passed by the HA library to the recovery function), clients can construct complex recovery mechanisms that can be modified dynamically.

The client-side recovery library lets clients reconstruct the state required to continue the message transmission after reconnecting to either the same server or to a different server. The client is responsible for determining what constitutes the state that must be reconstructed and for performing this appropriately while the recovery function is called.

Other covers and convenience functions

In addition to the cover functions for the standard MsgSend*() calls, the HA library provides clients with two “HA-awareness” functions that let you designate a connection as being HA-aware or similarly remove such a designation for an already HA-aware connection:

HA-awareness functions

ha_attach()
Associate a recovery function with a connection to make it HA-aware.
ha_detach()
Remove a previously specified association between a recovery function and a connection. This makes the connection no longer HA-aware.
ha_connection_ctrl()
Control the operation of a HA-aware connection.

I/O covers

The HA library also provides the following cover functions whose behavior is essentially the same as the original functions being covered, but augmented slightly where the connections are also HA-aware:

ha_open(), ha_open64()
Open a connection and attach it to the HA lib. These functions, in addition to calling the underlying open calls also make the connections HA-aware by calling ha_attach() automatically. As a result, using these calls is equivalent to calling open() or open64() and following that with a call to ha_attach().
ha_creat(), ha_creat64()
Create a connection and attach it to the HA lib. These functions, in addition to calling the underlying creat calls also make the connections HA-aware by calling ha_attach() automatically. As a result, using these calls is equivalent to calling creat() or creat64() and following that with a call to ha_attach().
ha_ConnectAttach(), ha_ConnectAttach_r()
Create a connection using ConnectAttach() and attach it to the HA lib. These functions, in addition to calling the underlying ConnectAttach calls also make the connections HA-aware by calling ha_attach() automatically. As a result, using these calls is equivalent to calling ConnectAttach() or ConnectAttach_r() and following that with a call to ha_attach().
ha_ConnectDetach(), ha_ConnectDetach_r()
Detach an attached fd, then close the connection using ConnectDetach(). These functions, in addition to calling the underlying ConnectDetach calls also make the connections HA-aware by calling ha_attach() automatically. As a result, using these calls is equivalent to calling ConnectDetach() or ConnectDetach_r() and following that with a call to ha_attach().
ha_fopen()
Open a file stream and attach it to the HA lib. This function, in addition to calling the underlying fopen() call also makes connections HA-aware by calling ha_attach() automatically. As a result, using this call is equivalent to calling fopen() and following that with a call to ha_attach().
ha_fclose()
Detach an attached HA fd for a file stream, then close it. This function, in addition to calling the underlying fclose() call also makes connections HA-aware by calling ha_attach() automatically. As a result, using this call is equivalent to calling fclose() and following that with a call to ha_attach().
ha_close()
Detach an attached HA fd, then close it. This function, in addition to calling the underlying close() call also makes connections HA-aware by calling ha_attach() automatically. As a result, using this call is equivalent to calling close() and following that with a call to ha_attach().
ha_dup()
Duplicate an HA connection. This function, in addition to calling the underlying dup() call also makes connections HA-aware by calling ha_attach() automatically. As a result, using this call is equivalent to calling dup() and following that with a call to ha_attach().

Convenience functions

In addition to the covers, the library also provides these two convenience functions that reopen connections for recovery:

ha_reopen()
Reopen a connection while performing recovery.
ha_ReConnectAttach()
Reopen a connection while performing recovery.

Note: For descriptions of all of the HA library functions, see the Client Recovery Library Reference chapter in this guide.

A simple example

Here's a simple example of a client that has a connection open to a server and tries to read data from it. After reading from the descriptor, the client goes off to do something else (possibly causing a delay), and then returns to read again.

During this window of delay, the server might have died and returned, in which case the initial connection to the server (that has died) is now stale.

But since the connection has been made HA-aware, and a recovery function has been associated with it, the connection is able to reestablish itself.



#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <errno.h>
#include <ha/cover.h>

#define SERVER "/path/to/server"

typedef struct handle {
    int nr;
} Handle ;

int recover_conn2(int oldfd, void *hdl)
{
    int newfd;
    Handle *thdl;
    thdl = (Handle *)hdl;
    printf("recovering for fd %d  inside function 2\n",oldfd);
    /* re-open the connection */
    newfd = ha_reopen(oldfd, SERVER, O_RDONLY);
  /* perform any other kind of state re-construction */
    (thdl->nr)++;
    return(newfd);
}

int recover_conn(int oldfd, void *hdl)
{
    int newfd;
    Handle *thdl;
    thdl = (Handle *)hdl;
    printf("recovering for fd %d inside function\n",oldfd);
    /* re-open the connection */
    newfd = ha_reopen(oldfd, SERVER, O_RDONLY);
    /* perform any other kind of state reconstruction */
    (thdl->nr)++;
    return(newfd);
}

int main(int argc, char *argv[])
{
    int status;
    int fd;
    int fd2;
    int fd3;
    Handle hdl;
    char buf[80];
    int i;

    hdl.nr = 0;
    /* open a connection and make it HA aware */
    fd = ha_open(SERVER, O_RDONLY,recover_conn, (void *)&hdl, 0);
    if (fd < 0) {
        printf("could not open %s\n", SERVER);
        exit(-1);
    }

    printf("fd = %d\n",fd);
  /* Dup the FD. the copy will also be HA aware */
    fd2 = ha_dup(fd);

    printf("dup-ped fd2 = %d\n",fd2);
    printf("before sleeping first time\n");

  /*
   Go to sleep... 
   Possibly the SERVER might die and return in this little
   time period.
  */
    sleep(15); 

  /*
   reading from dup-ped fd
   this should work just normally if SERVER has not died.
   But if the SERVER has died and returned, the 
   initial read will fail, but the recovery function
   will be called, and it will re-establish the
   connection, and then re-establish the current
   file position and then re-issue the read call
   which should succeed now.
  */

    printf("trying to read from %s using fd %d\n",SERVER, fd2);
    status = read(fd2,buf,30);
    if (status < 0)
        printf("error: %s\n",strerror(errno));

  /*
   fd and fd2 are dup-ped fd's
   changing the recovery function for fd2
   From this point forwards, the recovery (if at all)
   will performed using "recover_conn2" as the recovery
   function.
  */

    status = ha_attach(fd2, recover_conn2, (void *)&hdl, HAREPLACERECOVERYFN);

    ha_close(fd); /* close fd */

  /* open a new connection */
    fd = open(SERVER, O_RDONLY);
    printf("New fd = %d\n",fd);

  /* make it HA aware. */
    status = ha_attach(fd, recover_conn, (void *)&hdl, 0);

    printf("before sleeping again\n");

  /* copy it again */
    fd3 = ha_dup(fd);

  /* go to sleep...possibly another option for the server to fail. */
    sleep(15);

  /* 
   get rid of one of the fd's
   we still have a copy in fd3, which must have the 
   recovery functions associated with it.
  */
    ha_close(fd);

    printf("trying to read from %s using fd %d\n",SERVER, fd3);

  /*
   if it fails, the call will generate a call back to the
   recovery function "recover_conn"
  */
    status = read(fd3,buf,30); 
    if (status < 0)
        printf("error: %s\n",strerror(errno));

    printf("trying to read from %s once more using fd %d\n",SERVER, fd2);

  /*
   if this call fails, recovery will be via the 
   second function "recover_conn2", since we replaced
   the function for fd2.
  */
    status = read(fd2,buf,30); 
    if (status < 0)
        printf("error: %s\n",strerror(errno));

  /* close the fd2, and detach it from the HA lib */
    ha_close(fd2);

  /*
   finally print out our local statistics that we have been
   retaining along the way.
  */
    printf("total recoveries, %d\n",hdl.nr);
    exit(0);
}

State-reconstruction example

In the following example, in addition to reopening the connection to the server, the client also reconstructs the state of the connection by seeking to the current file (connection) offset.

This example also shows how the client can maintain state information that can be used by the recovery functions to return to a previously check-pointed state before the failure, so that the message transmission can continue properly.

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <errno.h>
#include <ha/cover.h>

#define REMOTEFILE "/path/to/remote/file"

typedef struct handle {
    int nr;
    int curr_offset;
} Handle ;

int recover_conn(int oldfd, void *hdl)
{
    int newfd;
    int newfd2;
    Handle *thdl;
    thdl = (Handle *)hdl;
    printf("recovering for fd %d inside function\n",oldfd);
   /* re-open the file */
    newfd = ha_reopen(oldfd, REMOTEFILE , O_RDONLY);
   /* re-construct state, by seeking to the correct offset. */
    if (newfd >= 0)
      lseek(newfd, thdl->curr_offset, SEEK_SET); 
    (thdl->nr)++;
    return(newfd);
}

int main(int argc, char *argv[])
{
    int status;
    int fd;
    int fd2;
    int fd3;
    Handle hdl;
    char buf[80];
    int i;

    hdl.nr = 0;
    hdl.curr_offset = 0;
    /* open a connection */
    fd = ha_open(REMOTEFILE, O_RDONLY,recover_conn, 
               (void *)&hdl, 0);
    if (fd < 0) {
        printf("could not open file\n");
        exit(-1);
    }
    fd2 = open(REMOTEFILE, O_RDONLY);
    printf("trying to read from file using fd %d\n",fd);
    printf("before sleeping first time\n");
    status = read(fd,buf,15);
    if (status < 0)
        printf("error: %s\n",strerror(errno));
    else {
        for (i=0; i < status; i++)
            printf("%c",buf[i]);
        printf("\n");
   /*
    update state of the connection
    this is a kind of checkpointing method.
    we remember state, so that the recovery functions
    have an easier time.
   */
        hdl.curr_offset += status;
    }

    fd3 = ha_dup(fd);
    sleep(18); 
   /*
    sleep for some arbitrary period
    this could be some other computation
    or some other blocking operation, which gives
    a window within which the server might fail
   */

   /* reading from dup-ped fd */
    printf("trying to read from file using fd %d\n",fd);
    printf("after sleeping\n");

   /*
    if the read initially fails
    it will recover, re-open and seek to the right spot!!
   */
    status = read(fd,buf,15);
    if (status < 0)
        printf("error: %s\n",strerror(errno));
    else {
        for (i=0; i < status; i++)
            printf("%c",buf[i]);
        printf("\n");
        hdl.curr_offset += status;
    }
    printf("trying to read from file using fd %d\n",fd2);
   /*
    try it again.. this time using the copy.
    recovery will again happen upon failure,
    automatically re-connecting/seeking etc.
   */
    status = read(fd2,buf,15);
    if (status < 0)
        printf("error: %s\n",strerror(errno));
    else {
        for (i=0; i < status; i++)
            printf("%c",buf[i]);
        printf("\n");
    }
    printf("total recoveries, %d\n",hdl.nr);
    ha_close(fd);
    close(fd2);
    exit(0);
}