RK drones on about his home systems again

On my home system, I had a problem with one of the servers periodically dying. There didn't seem to be any particular situation that manifested the problem. Once every few weeks this server would get hit with a SIGSEGV signal. I wasn't in a position to fix it, and didn't really have the time to analyze the problem and submit a proper bug report. What I did have time to do, though, was hack together a tiny shell script that functions as an overlord. The script polls once per second to see if the server is up. If the server dies, the script restarts it. Client programs simply reconnect to the server once it's back up. Dead simple, ten lines of shell script, an hour of programming and testing, and the problem is now solved (although masked might be a better term).

Even though I had a system with a poor MTBF, by fixing the situation in a matter of a second or two (MTTR), I was able to have a system that met my availability requirements.

Of course, in a proper production environment, the core dumps from the server would be analyzed, the fault would be added to the regression test suite, and there'd be no extra stock options for the designer of the server. :-)