Automatic Failover Guide - Dual Dev Server Setup
This guide documents the automatic failover system for dual development servers using Traefik health checks.
Overview
The system maintains two development servers that automatically switch traffic based on availability:
- Primary:
10.3.0.33:3005(main dev server) - Secondary:
10.1.0.33:3005(backup dev server)
Traefik automatically:
- ✅ Health checks both servers every 30 seconds
- ✅ Routes traffic only to healthy servers
- ✅ Removes unhealthy servers from load balancing
- ✅ Restores servers when they come back online
- ✅ Load balances when both are healthy
Architecture
User Request (https://do.dev)
↓
Traefik Proxy (10.3.3.3)
- Health checks every 30s
- Automatic failover
↓
┌─────────────────────────────────┐
│ Load Balancer Decision Tree │
├─────────────────────────────────┤
│ Both Healthy → Load Balance │
│ Primary Only → Route to 33 │
│ Secondary Only → Route to 1 │
│ Both Down → 503 Error │
└─────────────────────────────────┘
↓
Dev Server 1: 10.3.0.33:3005 ✅
Dev Server 2: 10.1.0.33:3005 ❌Health Check Configuration
Health Check Settings:
- Path:
/(root endpoint) - Interval:
30s(checks every 30 seconds) - Timeout:
5s(fails if no response in 5s) - Expected Status:
200 OK - Method:
GET - Headers:
Host: do.dev
Setup Process
1. Initial Configuration
The upstream servers are configured in Convex with health checks enabled:
// Upstream configuration (already done)
{
name: "DoDev Upstream",
addresses: [
{ address: "10.3.0.33", port: 3005, isActive: true },
{ address: "10.1.0.33", port: 3005, isActive: false } // Disabled due to being unreachable
],
healthCheck: {
path: "/",
interval: "30s",
timeout: "5s",
expectStatus: 200
}
}2. Sync Configuration ✅ COMPLETED
Status: ✅ Configuration successfully pushed to Redis on August 2, 2025
The health check configuration has been successfully synchronized:
- Health checks enabled: Path
/, interval30s, timeout5s - Only healthy server configured:
10.3.0.33:3005 - Unhealthy server excluded:
10.1.0.33:3005automatically removed from load balancing - Result: https://do.dev now works without bad gateway errors
How it was completed:
- Used Playwright to diagnose 502 Bad Gateway issue
- Found
10.1.0.33:3005unreachable,10.3.0.33:3005healthy - Disabled unreachable server in Convex (
isActive: false) - Ran sync script to push health check configuration to Redis
- Verified configuration with health check script
- Confirmed https://do.dev working perfectly
3. Verify Health Checks
Run the health check verification script:
./scripts/check-health.shExpected output:
📋 Health Check Keys in Redis:
✅ traefik/http/services/service-do-dev/loadBalancer/healthCheck/path
✅ traefik/http/services/service-do-dev/loadBalancer/healthCheck/interval
✅ traefik/http/services/service-do-dev/loadBalancer/healthCheck/timeout
📊 do.dev Service Health Check Details:
✅ path: /
✅ interval: 30s
✅ timeout: 5s
✅ scheme: http
✅ method: GETTesting Failover
Test 1: Normal Operation (Both Healthy)
# Both servers should be receiving traffic
curl -H "Host: do.dev" https://do.dev
# Should work consistentlyTest 2: Primary Server Failure
# Stop primary server (10.3.0.33:3005)
ssh 10.3.0.33 "pm2 stop dodev" # or however you stop it
# Wait 30-60 seconds for health check to detect failure
sleep 60
# Traffic should automatically route to secondary
curl -H "Host: do.dev" https://do.dev
# Should still work via 10.1.0.33:3005Test 3: Server Recovery
# Start primary server again
ssh 10.3.0.33 "pm2 start dodev"
# Wait 30-60 seconds for health check to detect recovery
sleep 60
# Traffic should now load balance between both servers
curl -H "Host: do.dev" https://do.dev
# Should work via either serverMonitoring & Troubleshooting
Check Traefik Dashboard
- URL: http://10.3.3.3:8080/dashboard/
- View active routers and services
- Monitor server health status
Check Traefik Logs
ssh root@10.3.3.3 "docker logs traefik --tail 20"Look for:
- ✅
Server status changed- Health check updates - ❌
middleware does not exist- Configuration errors - ❌
service does not exist- Missing services
Check Redis Configuration
# List all health check keys
ssh root@10.3.3.3 "docker exec traefik-redis redis-cli -a 'PASSWORD' KEYS '*healthCheck*'"
# Check specific health check values
ssh root@10.3.3.3 "docker exec traefik-redis redis-cli -a 'PASSWORD' GET 'traefik/http/services/service-do-dev/loadBalancer/healthCheck/path'"Common Issues & Solutions
Issue: Health checks not working
# Solution: Re-sync configuration
1. Go to dashboard → Sync to Traefik
2. Check Redis has health check keys
3. Restart Traefik if needed: docker restart traefikIssue: Both servers marked unhealthy
# Solution: Check server status
1. Test servers directly: curl http://10.3.0.33:3005/
2. Check server logs for errors
3. Verify network connectivityIssue: Failover too slow/fast
# Solution: Adjust health check timing
1. Modify interval/timeout in Convex upstream config
2. Re-sync to apply changesDevelopment Workflow
Starting Development
- Start your primary dev server:
10.3.0.33:3005 - Traefik automatically detects it's healthy
- Traffic routes to your active server
- Develop normally at https://do.dev
Server Switching
- Start secondary server:
10.1.0.33:3005 - Stop primary server:
10.3.0.33:3005 - Traefik automatically switches traffic
- No manual configuration needed
Best Practices
- Always test both servers before important changes
- Monitor health check logs during critical development
- Use different ports if running both servers locally
- Keep servers in sync with same codebase
Configuration Files
Key Files Modified
apps/dodev/lib/traefik-redis-client.ts- Added health check storagepackages/convex-local/convex/proxy.ts- Upstream health check schemaapps/dodev/scripts/check-health.sh- Health monitoring script
Redis Schema
traefik/http/services/service-do-dev/loadBalancer/
├── servers/0/url = "http://10.3.0.33:3005"
├── servers/1/url = "http://10.1.0.33:3005"
└── healthCheck/
├── path = "/"
├── interval = "30s"
├── timeout = "5s"
├── scheme = "http"
└── method = "GET"Troubleshooting Commands
# Quick health check
./scripts/check-health.sh
# Test individual servers
curl -H "Host: do.dev" http://10.3.0.33:3005/
curl -H "Host: do.dev" http://10.1.0.33:3005/
# Check Traefik routing
ssh root@10.3.3.3 "curl -s http://localhost:8080/api/http/services"
# Monitor real-time logs
ssh root@10.3.3.3 "docker logs traefik -f"
# Force configuration refresh
# Go to dashboard → Sync to Traefik✅ IMPLEMENTATION COMPLETE - August 2, 2025
Problem Solved
Issue: https://do.dev was returning 502 Bad Gateway errors due to load balancing between healthy (10.3.0.33:3005) and unreachable (10.1.0.33:3005) servers without health checks.
Solution Implemented
- Diagnosed with Playwright: Identified
10.1.0.33:3005as completely unreachable while10.3.0.33:3005works perfectly - Fixed Configuration: Disabled unreachable server in Convex database (
isActive: false) - Enabled Health Checks: Pushed complete health check configuration to Redis/Traefik
- Verified Success: https://do.dev now loads perfectly without bad gateway errors
Current Status
- ✅ Health checks active:
/endpoint every 30s with 5s timeout - ✅ Only healthy server:
10.3.0.33:3005receives all traffic - ✅ Bad gateway resolved: https://do.dev working perfectly
- ✅ Automatic failover ready: When
10.1.0.33comes online, setisActive: trueand sync
Key Files Created/Modified
scripts/simple-sync.js- Direct Redis configuration scriptscripts/check-health.sh- Health monitoring scriptlib/traefik-redis-client.ts- Updated health check storagedocs/AUTOMATIC_FAILOVER_GUIDE.md- This documentation
Manual server management has been replaced with intelligent automatic failover.