๐ Managing Process Failover: Simple Guide
Letโs set up process failover in Alpine Linux! ๐ก๏ธ This keeps your services running even when something goes wrong. Weโll make it easy! ๐
๐ค What is Process Failover?
Process failover is like having a backup plan for your computer programs! When one stops working, another one takes over automatically.
Think of process failover like:
- ๐ Having spare batteries ready
- ๐ง A backup generator during power outages
- ๐ก A safety net that catches problems
๐ฏ What You Need
Before we start, you need:
- โ Alpine Linux system running
- โ Root or sudo access
- โ Basic terminal knowledge
- โ Multiple network interfaces (optional)
๐ Step 1: Installing Failover Tools
Setting Up Monitoring Tools
Letโs install the tools we need! Itโs easy! ๐
What weโre doing: Install process monitoring and failover software.
# Update package list
apk update
# Install monitoring and process tools
apk add supervisor monit keepalived
# Install useful utilities
apk add curl wget jq netcat-openbsd
# Install system tools
apk add procps htop
What this does: ๐ Installs tools to watch processes and handle failovers.
Example output:
Installing supervisor (4.2.5-r0)
Installing monit (5.33.0-r0)
Installing keepalived (2.2.8-r0)
What this means: Your failover tools are ready! โ
๐ก Important Tips
Tip: Test failover systems before you need them! ๐ก
Warning: Donโt restart critical services during busy times! โ ๏ธ
๐ ๏ธ Step 2: Setting Up Supervisor
Configuring Process Supervisor
Now letโs set up Supervisor to watch our processes! Donโt worry - itโs still easy! ๐
What weโre doing: Configure Supervisor to restart failed processes automatically.
# Start supervisor service
rc-service supervisord start
rc-update add supervisord
# Create supervisor config directory
mkdir -p /etc/supervisor/conf.d
# Create a sample service to monitor
cat > /opt/sample-service.py << 'EOF'
#!/usr/bin/env python3
import time
import sys
import signal
class SampleService:
def __init__(self):
self.running = True
signal.signal(signal.SIGTERM, self.handle_signal)
signal.signal(signal.SIGINT, self.handle_signal)
def handle_signal(self, signum, frame):
print(f"Received signal {signum}, shutting down...")
self.running = False
def run(self):
print("Sample service started!")
counter = 0
while self.running:
counter += 1
print(f"Service heartbeat: {counter}")
time.sleep(10)
print("Sample service stopped.")
if __name__ == "__main__":
service = SampleService()
service.run()
EOF
chmod +x /opt/sample-service.py
# Create supervisor config for our service
cat > /etc/supervisor/conf.d/sample-service.conf << 'EOF'
[program:sample-service]
command=/opt/sample-service.py
directory=/opt
user=root
autostart=true
autorestart=true
startretries=3
stderr_logfile=/var/log/sample-service.err.log
stdout_logfile=/var/log/sample-service.out.log
EOF
# Reload supervisor configuration
supervisorctl reread
supervisorctl update
Code explanation:
supervisor
: Monitors and restarts processes automaticallyautostart=true
: Starts service when supervisor startsautorestart=true
: Restarts service if it crashesstartretries=3
: Tries to restart 3 times before giving up
Expected Output:
sample-service: added process group
What this means: Great job! Your process is now monitored! ๐
๐ฎ Letโs Try It!
Time for hands-on practice! This is the fun part! ๐ฏ
What weโre doing: Test the failover by stopping and starting processes.
# Check supervisor status
supervisorctl status
# Stop the service manually
supervisorctl stop sample-service
# Start it again
supervisorctl start sample-service
# View service logs
tail -f /var/log/sample-service.out.log
You should see:
sample-service RUNNING pid 1234, uptime 0:00:05
Awesome work! ๐
๐ Quick Summary Table
What to Do | Command | Result |
---|---|---|
๐ง Check status | supervisorctl status | โ Shows running processes |
๐ ๏ธ Restart service | supervisorctl restart name | โ Restarts specific process |
๐ฏ View logs | tail -f /var/log/service.log | โ Shows service activity |
๐ ๏ธ Step 3: Setting Up Advanced Failover
Creating Health Check Scripts
Letโs create scripts that check if services are healthy!
What weโre doing: Build smart health monitoring scripts.
# Create health check directory
mkdir -p /opt/health-checks
# Create web service health check
cat > /opt/health-checks/web-check.sh << 'EOF'
#!/bin/bash
# Web service health check
SERVICE_URL="http://localhost:8080/health"
TIMEOUT=10
MAX_FAILURES=3
FAILURE_FILE="/tmp/web-service-failures"
# Check if service responds
if curl -f -s --max-time $TIMEOUT "$SERVICE_URL" > /dev/null 2>&1; then
# Service is healthy
echo "โ
Web service is healthy"
rm -f "$FAILURE_FILE"
exit 0
else
# Service failed
echo "โ Web service failed health check"
# Count failures
if [ -f "$FAILURE_FILE" ]; then
FAILURES=$(cat "$FAILURE_FILE")
else
FAILURES=0
fi
FAILURES=$((FAILURES + 1))
echo $FAILURES > "$FAILURE_FILE"
echo "Failure count: $FAILURES"
# Restart if too many failures
if [ $FAILURES -ge $MAX_FAILURES ]; then
echo "๐ Restarting web service due to repeated failures"
supervisorctl restart web-service
rm -f "$FAILURE_FILE"
fi
exit 1
fi
EOF
chmod +x /opt/health-checks/web-check.sh
# Create database health check
cat > /opt/health-checks/db-check.sh << 'EOF'
#!/bin/bash
# Database health check
DB_HOST="localhost"
DB_PORT="3306"
TIMEOUT=5
# Check if database port is responding
if nc -z -w$TIMEOUT "$DB_HOST" "$DB_PORT"; then
echo "โ
Database is responding on port $DB_PORT"
exit 0
else
echo "โ Database is not responding on port $DB_PORT"
# Try to restart database service
echo "๐ Attempting to restart database"
rc-service mysql restart
sleep 5
# Check again
if nc -z -w$TIMEOUT "$DB_HOST" "$DB_PORT"; then
echo "โ
Database recovered after restart"
exit 0
else
echo "โ Database restart failed"
exit 1
fi
fi
EOF
chmod +x /opt/health-checks/db-check.sh
What this does: Creates smart scripts that can fix problems automatically! ๐
Setting Up Automated Monitoring
What weโre doing: Run health checks automatically with cron.
# Create monitoring script
cat > /opt/monitor-services.sh << 'EOF'
#!/bin/bash
# Main service monitoring script
LOG_FILE="/var/log/service-monitor.log"
DATE=$(date '+%Y-%m-%d %H:%M:%S')
echo "[$DATE] Starting service health checks" >> "$LOG_FILE"
# Run all health checks
for check in /opt/health-checks/*.sh; do
if [ -x "$check" ]; then
CHECK_NAME=$(basename "$check" .sh)
echo "[$DATE] Running $CHECK_NAME" >> "$LOG_FILE"
if "$check" >> "$LOG_FILE" 2>&1; then
echo "[$DATE] $CHECK_NAME: PASSED" >> "$LOG_FILE"
else
echo "[$DATE] $CHECK_NAME: FAILED" >> "$LOG_FILE"
fi
fi
done
echo "[$DATE] Health checks completed" >> "$LOG_FILE"
EOF
chmod +x /opt/monitor-services.sh
# Add to cron (run every 2 minutes)
echo "*/2 * * * * /opt/monitor-services.sh" > /etc/crontabs/root
crond
# Test the monitoring script
/opt/monitor-services.sh
What this does: Automatically monitors your services every 2 minutes! ๐
๐ฎ Practice Time!
Letโs practice what you learned! Try these simple examples:
Example 1: Creating a Backup Service ๐ข
What weโre doing: Set up a backup process that takes over when the main one fails.
# Create primary service
cat > /opt/primary-service.py << 'EOF'
#!/usr/bin/env python3
import time
import socket
import signal
import sys
class PrimaryService:
def __init__(self):
self.running = True
self.port = 9001
signal.signal(signal.SIGTERM, self.stop)
def stop(self, signum, frame):
self.running = False
def run(self):
try:
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.bind(('localhost', self.port))
print(f"๐ข Primary service listening on port {self.port}")
while self.running:
time.sleep(1)
except Exception as e:
print(f"โ Primary service error: {e}")
finally:
sock.close()
if __name__ == "__main__":
service = PrimaryService()
service.run()
EOF
# Create backup service
cat > /opt/backup-service.py << 'EOF'
#!/usr/bin/env python3
import time
import socket
import signal
class BackupService:
def __init__(self):
self.running = True
self.port = 9001
signal.signal(signal.SIGTERM, self.stop)
def stop(self, signum, frame):
self.running = False
def is_primary_running(self):
try:
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
result = sock.connect_ex(('localhost', self.port))
sock.close()
return result == 0
except:
return False
def run(self):
print("๐ก Backup service started (standby mode)")
while self.running:
if not self.is_primary_running():
print("๐ Primary service down! Taking over...")
try:
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.bind(('localhost', self.port))
print(f"๐ข Backup service now active on port {self.port}")
while self.running and not self.is_primary_running():
time.sleep(1)
print("๐ Primary service recovered. Going back to standby.")
sock.close()
except Exception as e:
print(f"โ Backup service error: {e}")
time.sleep(5)
if __name__ == "__main__":
service = BackupService()
service.run()
EOF
chmod +x /opt/primary-service.py /opt/backup-service.py
What this does: Creates primary and backup services that work together! ๐
Example 2: Network Failover with Keepalived ๐ก
What weโre doing: Set up IP address failover between servers.
# Configure keepalived for IP failover
cat > /etc/keepalived/keepalived.conf << 'EOF'
global_defs {
router_id ALPINE_01
}
vrrp_script chk_service {
script "/opt/health-checks/web-check.sh"
interval 10
weight -20
fall 3
rise 2
}
vrrp_instance VI_1 {
state MASTER
interface eth0
virtual_router_id 51
priority 110
advert_int 1
authentication {
auth_type PASS
auth_pass mypassword
}
virtual_ipaddress {
192.168.1.100/24
}
track_script {
chk_service
}
}
EOF
# Start keepalived
rc-service keepalived start
rc-update add keepalived
echo "๐ก Keepalived configured for IP failover"
What this does: Automatically moves IP addresses between servers! ๐
๐จ Fix Common Problems
Problem 1: Service wonโt restart โ
What happened: Process keeps failing after restart attempts. How to fix it: Check logs and fix the underlying issue!
# Check supervisor logs
supervisorctl tail sample-service
# View detailed logs
tail -f /var/log/sample-service.err.log
# Check system resources
top
df -h
Problem 2: Health checks failing โ
What happened: Health check scripts report failures incorrectly. How to fix it: Test and adjust health check logic!
# Test health check manually
/opt/health-checks/web-check.sh
# Check if service is actually running
ps aux | grep sample-service
# Verify network connectivity
netstat -tlnp | grep :8080
Donโt worry! These problems happen to everyone. Youโre doing great! ๐ช
๐ก Simple Tips
- Test failover regularly ๐ - Practice makes perfect
- Monitor logs closely ๐ฑ - Logs tell you whatโs happening
- Keep health checks simple ๐ค - Complex checks can fail too
- Have backup plans ๐ช - Always have multiple layers
โ Check Everything Works
Letโs make sure everything is working:
# Check supervisor status
supervisorctl status
# Test health checks
/opt/monitor-services.sh
# View monitoring logs
tail /var/log/service-monitor.log
# Check keepalived status
rc-service keepalived status
Good output:
โ
Success! Process failover system is working correctly.
๐ What You Learned
Great job! Now you can:
- โ Set up automatic process monitoring
- โ Configure process restart on failure
- โ Create health check scripts
- โ Build backup service systems
๐ฏ Whatโs Next?
Now you can try:
- ๐ Adding email alerts for failures
- ๐ ๏ธ Setting up database failover
- ๐ค Creating multi-server clusters
- ๐ Building load balancing systems
Remember: Every expert was once a beginner. Youโre doing amazing! ๐
Keep practicing and youโll become an expert too! ๐ซ