Policies

Updated: April 19, 2023

Generally speaking, the HA infrastructure presented here is good but there's one more thing that we need to talk about. What if a process dies, and when restarted, dies again, and keeps dying? A good HA system will cover that aspect as well, by providing per-process policies. A policy defines things such as:

While it's a good idea to restart a process when it faults, some processes can be very expensive to restart (perhaps in terms of the amount of CPU the process takes to start up, or the extent to which it ties up other resources).

An overlord process needs to be able to limit how fast and how often a process is restarted. One technique is an exponential back-off algorithm. When the process dies, it's restarted immediately. If it dies again with a certain time window, it's restarted after 200 milliseconds. If it dies again with a certain time window, it's restarted after 400 milliseconds, then 800 milliseconds, then 1600 milliseconds, and so on, up to a certain limit. If it exceeds the limit, another policy is invoked that determines what to do about this process. One possibility is to run the previous version of the process, in case the new version has some new bug that the older version doesn't. Another might be to raise alarms, or page someone. Other actions are left to your imagination, and depend on the kind of system that you're designing.