How do I write a bash script to restart a process if it dies?
LinuxBashCronScriptingLinux Problem Overview
I have a python script that'll be checking a queue and performing an action on each item:
# checkqueue.py
while True:
check_queue()
do_something()
How do I write a bash script that will check if it's running, and if not, start it. Roughly the following pseudo code (or maybe it should do something like ps | grep
?):
# keepalivescript.sh
if processidfile exists:
if processid is running:
exit, all ok
run checkqueue.py
write processid to processidfile
I'll call that from a crontab:
# crontab
*/5 * * * * /path/to/keepalivescript.sh
Linux Solutions
Solution 1 - Linux
Avoid PID-files, crons, or anything else that tries to evaluate processes that aren't their children.
There is a very good reason why in UNIX, you can ONLY wait on your children. Any method (ps parsing, pgrep, storing a PID, ...) that tries to work around that is flawed and has gaping holes in it. Just say no.
Instead you need the process that monitors your process to be the process' parent. What does this mean? It means only the process that starts your process can reliably wait for it to end. In bash, this is absolutely trivial.
until myserver; do
echo "Server 'myserver' crashed with exit code $?. Respawning.." >&2
sleep 1
done
The above piece of bash code runs myserver
in an until
loop. The first line starts myserver
and waits for it to end. When it ends, until
checks its exit status. If the exit status is 0
, it means it ended gracefully (which means you asked it to shut down somehow, and it did so successfully). In that case we don't want to restart it (we just asked it to shut down!). If the exit status is not 0
, until
will run the loop body, which emits an error message on STDERR and restarts the loop (back to line 1) after 1 second.
Why do we wait a second? Because if something's wrong with the startup sequence of myserver
and it crashes immediately, you'll have a very intensive loop of constant restarting and crashing on your hands. The sleep 1
takes away the strain from that.
Now all you need to do is start this bash script (asynchronously, probably), and it will monitor myserver
and restart it as necessary. If you want to start the monitor on boot (making the server "survive" reboots), you can schedule it in your user's cron(1) with an @reboot
rule. Open your cron rules with crontab
:
crontab -e
Then add a rule to start your monitor script:
@reboot /usr/local/bin/myservermonitor
Alternatively; look at inittab(5) and /etc/inittab. You can add a line in there to have myserver
start at a certain init level and be respawned automatically.
Edit.
Let me add some information on why not to use PID files. While they are very popular; they are also very flawed and there's no reason why you wouldn't just do it the correct way.
Consider this:
-
PID recycling (killing the wrong process):
/etc/init.d/foo start
: startfoo
, writefoo
's PID to/var/run/foo.pid
- A while later:
foo
dies somehow. - A while later: any random process that starts (call it
bar
) takes a random PID, imagine it takingfoo
's old PID. - You notice
foo
's gone:/etc/init.d/foo/restart
reads/var/run/foo.pid
, checks to see if it's still alive, findsbar
, thinks it'sfoo
, kills it, starts a newfoo
.
-
PID files go stale. You need over-complicated (or should I say, non-trivial) logic to check whether the PID file is stale, and any such logic is again vulnerable to
1.
. -
What if you don't even have write access or are in a read-only environment?
-
It's pointless overcomplication; see how simple my example above is. No need to complicate that, at all.
By the way; even worse than PID files is parsing ps
! Don't ever do this.
ps
is very unportable. While you find it on almost every UNIX system; its arguments vary greatly if you want non-standard output. And standard output is ONLY for human consumption, not for scripted parsing!- Parsing
ps
leads to a LOT of false positives. Take theps aux | grep PID
example, and now imagine someone starting a process with a number somewhere as argument that happens to be the same as the PID you stared your daemon with! Imagine two people starting an X session and you grepping for X to kill yours. It's just all kinds of bad.
If you don't want to manage the process yourself; there are some perfectly good systems out there that will act as monitor for your processes. Look into http://smarden.org/runit/">runit</a>;, for example.
Solution 2 - Linux
Have a look at monit (http://mmonit.com/monit/). It handles start, stop and restart of your script and can do health checks plus restarts if necessary.
Or do a simple script:
while true
do
/your/script
sleep 1
done
Solution 3 - Linux
The easiest way to do it is using flock on file. In Python script you'd do
lf = open('/tmp/script.lock','w')
if(fcntl.flock(lf, fcntl.LOCK_EX|fcntl.LOCK_NB) != 0):
sys.exit('other instance already running')
lf.write('%d\n'%os.getpid())
lf.flush()
In shell you can actually test if it's running:
if [ `flock -xn /tmp/script.lock -c 'echo 1'` ]; then
echo 'it's not running'
restart.
else
echo -n 'it's already running with PID '
cat /tmp/script.lock
fi
But of course you don't have to test, because if it's already running and you restart it, it'll exit with 'other instance already running'
When process dies, all it's file descriptors are closed and all locks are automatically removed.
Solution 4 - Linux
In-line:
while true; do <your-bash-snippet> && break; done
e.g. #1
while true; do openconnect x.x.x.x:xxxx && break; done
e.g. #2
while true; do docker logs -f container-name; sleep 2; done
Solution 5 - Linux
You should use monit, a standard unix tool that can monitor different things on the system and react accordingly.
From the docs: http://mmonit.com/monit/documentation/monit.html#pid_testing
check process checkqueue.py with pidfile /var/run/checkqueue.pid if changed pid then exec "checkqueue_restart.sh"
You can also configure monit to email you when it does do a restart.
Solution 6 - Linux
if ! test -f $PIDFILE || ! psgrep `cat $PIDFILE`; then
restart_process
# Write PIDFILE
echo $! >$PIDFILE
fi
Solution 7 - Linux
watch "yourcommand"
It will restart the process if/when it stops (after a 2s delay).
watch -n 0.1 "yourcommand"
To restart it after 0.1s instead of the default 2 seconds
watch -e "yourcommand"
To stop restarts if the program exits with an error.
Advantages:
- built-in command
- one line
- easy to use and remember.
Drawbacks:
- Only display the result of the command on the screen once it's finished
Solution 8 - Linux
I'm not sure how portable it is across operating systems, but you might check if your system contains the 'run-one' command, i.e. "man run-one". Specifically, this set of commands includes 'run-one-constantly', which seems to be exactly what is needed.
From man page:
> run-one-constantly COMMAND [ARGS]
Note: obviously this could be called from within your script, but also it removes the need for having a script at all.
Solution 9 - Linux
I've used the following script with great success on numerous servers:
pid=`jps -v | grep $INSTALLATION | awk '{print $1}'`
echo $INSTALLATION found at PID $pid
while [ -e /proc/$pid ]; do sleep 0.1; done
notes:
- It's looking for a java process, so I can use jps, this is much more consistent across distributions than ps
$INSTALLATION
contains enough of the process path that's it's totally unambiguous- Use sleep while waiting for the process to die, avoid hogging resources :)
This script is actually used to shut down a running instance of tomcat, which I want to shut down (and wait for) at the command line, so launching it as a child process simply isn't an option for me.
Solution 10 - Linux
I use this for my npm Process
#!/bin/bash
for (( ; ; ))
do
date +"%T"
echo Start Process
cd /toFolder
sudo process
date +"%T"
echo Crash
sleep 1
done