Using the Client Recovery Library

The client recovery library provides a drop-in enhancement solution for many standard libc I/O operations. The HA library's cover functions provide automatic recovery mechanisms for failed connections that can be recovered from in an HA scenario.

The goal is to provide an API for high availability I/O that can transparently provide recovery to clients, especially in an environment where the servers must also be highly available. The recovery is configurable to tailor specific client needs; we provide examples of ways to develop more complicated recovery mechanisms.

The main principle of the HA library is to provide drop-in replacements for all the "transmission" functions (e.g., MsgSend*()). The API lets a client choose specific connections that it would like to make highly available — all other connections will operate as ordinary connections.

Normally, when a server that the client is talking to fails, or if there's a transient network fault, the MsgSend*() functions return an error indicating that the connection ID (or file descriptor) is stale or invalid (EBADF).

In an HA-aware scenario, these transient faults are often recovered from almost immediately (on the server end), thus making the services available again. Unfortunately, clients using a standard I/O offering might not be available to benefit from this to the maximum unless they provide mechanisms to recover from these errors, and then retransmit the information/data, which often might involve a nontrivial rework of client programs.

By providing/achieving recovery inside the HA library itself, we can automatically take advantage of the HA-aware services that restart themselves or are automatically restarted or of the services that are provided in a transparent cluster/redundant way.

Since recovery itself is a connection-specific task, we allow clients to provide recovery mechanisms that will be used to restore connections when they fail. Irrecoverable errors are propagated back reliably so that any client that doesn't wish to recover will get the I/O library semantics that it expects.

The recovery mechanism can be anything ranging from a simple reopen of the connection to a more complex scenario that includes the retransmission/renegotiation of connection-specific information.