Automatic Failover Guide - Dual Dev Server Setup

This guide documents the automatic failover system for dual development servers using Traefik health checks.

Overview

The system maintains two development servers that automatically switch traffic based on availability:

  • Primary: 10.3.0.33:3005 (main dev server)
  • Secondary: 10.1.0.33:3005 (backup dev server)

Traefik automatically:

  • Health checks both servers every 30 seconds
  • Routes traffic only to healthy servers
  • Removes unhealthy servers from load balancing
  • Restores servers when they come back online
  • Load balances when both are healthy

Architecture

User Request (https://do.dev)

    Traefik Proxy (10.3.3.3)
    - Health checks every 30s
    - Automatic failover

┌─────────────────────────────────┐
│  Load Balancer Decision Tree    │
├─────────────────────────────────┤
│ Both Healthy   → Load Balance   │
│ Primary Only   → Route to 33    │
│ Secondary Only → Route to 1     │
│ Both Down      → 503 Error      │
└─────────────────────────────────┘

   Dev Server 1: 10.3.0.33:3005 ✅
   Dev Server 2: 10.1.0.33:3005 ❌

Health Check Configuration

Health Check Settings:

  • Path: / (root endpoint)
  • Interval: 30s (checks every 30 seconds)
  • Timeout: 5s (fails if no response in 5s)
  • Expected Status: 200 OK
  • Method: GET
  • Headers: Host: do.dev

Setup Process

1. Initial Configuration

The upstream servers are configured in Convex with health checks enabled:

// Upstream configuration (already done)
{
  name: "DoDev Upstream",
  addresses: [
    { address: "10.3.0.33", port: 3005, isActive: true },
    { address: "10.1.0.33", port: 3005, isActive: false } // Disabled due to being unreachable
  ],
  healthCheck: {
    path: "/",
    interval: "30s", 
    timeout: "5s",
    expectStatus: 200
  }
}

2. Sync Configuration ✅ COMPLETED

Status: ✅ Configuration successfully pushed to Redis on August 2, 2025

The health check configuration has been successfully synchronized:

  • Health checks enabled: Path /, interval 30s, timeout 5s
  • Only healthy server configured: 10.3.0.33:3005
  • Unhealthy server excluded: 10.1.0.33:3005 automatically removed from load balancing
  • Result: https://do.dev now works without bad gateway errors

How it was completed:

  1. Used Playwright to diagnose 502 Bad Gateway issue
  2. Found 10.1.0.33:3005 unreachable, 10.3.0.33:3005 healthy
  3. Disabled unreachable server in Convex (isActive: false)
  4. Ran sync script to push health check configuration to Redis
  5. Verified configuration with health check script
  6. Confirmed https://do.dev working perfectly

3. Verify Health Checks

Run the health check verification script:

./scripts/check-health.sh

Expected output:

📋 Health Check Keys in Redis:
✅ traefik/http/services/service-do-dev/loadBalancer/healthCheck/path
✅ traefik/http/services/service-do-dev/loadBalancer/healthCheck/interval
✅ traefik/http/services/service-do-dev/loadBalancer/healthCheck/timeout

📊 do.dev Service Health Check Details:
✅ path: /
✅ interval: 30s  
✅ timeout: 5s
✅ scheme: http
✅ method: GET

Testing Failover

Test 1: Normal Operation (Both Healthy)

# Both servers should be receiving traffic
curl -H "Host: do.dev" https://do.dev
# Should work consistently

Test 2: Primary Server Failure

# Stop primary server (10.3.0.33:3005)
ssh 10.3.0.33 "pm2 stop dodev" # or however you stop it

# Wait 30-60 seconds for health check to detect failure
sleep 60

# Traffic should automatically route to secondary
curl -H "Host: do.dev" https://do.dev
# Should still work via 10.1.0.33:3005

Test 3: Server Recovery

# Start primary server again
ssh 10.3.0.33 "pm2 start dodev"

# Wait 30-60 seconds for health check to detect recovery  
sleep 60

# Traffic should now load balance between both servers
curl -H "Host: do.dev" https://do.dev
# Should work via either server

Monitoring & Troubleshooting

Check Traefik Dashboard

Check Traefik Logs

ssh root@10.3.3.3 "docker logs traefik --tail 20"

Look for:

  • Server status changed - Health check updates
  • middleware does not exist - Configuration errors
  • service does not exist - Missing services

Check Redis Configuration

# List all health check keys
ssh root@10.3.3.3 "docker exec traefik-redis redis-cli -a 'PASSWORD' KEYS '*healthCheck*'"

# Check specific health check values
ssh root@10.3.3.3 "docker exec traefik-redis redis-cli -a 'PASSWORD' GET 'traefik/http/services/service-do-dev/loadBalancer/healthCheck/path'"

Common Issues & Solutions

Issue: Health checks not working

# Solution: Re-sync configuration
1. Go to dashboard Sync to Traefik
2. Check Redis has health check keys
3. Restart Traefik if needed: docker restart traefik

Issue: Both servers marked unhealthy

# Solution: Check server status
1. Test servers directly: curl http://10.3.0.33:3005/
2. Check server logs for errors
3. Verify network connectivity

Issue: Failover too slow/fast

# Solution: Adjust health check timing
1. Modify interval/timeout in Convex upstream config
2. Re-sync to apply changes

Development Workflow

Starting Development

  1. Start your primary dev server: 10.3.0.33:3005
  2. Traefik automatically detects it's healthy
  3. Traffic routes to your active server
  4. Develop normally at https://do.dev

Server Switching

  1. Start secondary server: 10.1.0.33:3005
  2. Stop primary server: 10.3.0.33:3005
  3. Traefik automatically switches traffic
  4. No manual configuration needed

Best Practices

  • Always test both servers before important changes
  • Monitor health check logs during critical development
  • Use different ports if running both servers locally
  • Keep servers in sync with same codebase

Configuration Files

Key Files Modified

  • apps/webs/dodev/lib/traefik-redis-client.ts - Added health check storage
  • packages/convex-local/convex/proxy.ts - Upstream health check schema
  • apps/webs/dodev/scripts/check-health.sh - Health monitoring script

Redis Schema

traefik/http/services/service-do-dev/loadBalancer/
├── servers/0/url = "http://10.3.0.33:3005"
├── servers/1/url = "http://10.1.0.33:3005"  
└── healthCheck/
    ├── path = "/"
    ├── interval = "30s"
    ├── timeout = "5s"
    ├── scheme = "http"
    └── method = "GET"

Troubleshooting Commands

# Quick health check
./scripts/check-health.sh

# Test individual servers
curl -H "Host: do.dev" http://10.3.0.33:3005/
curl -H "Host: do.dev" http://10.1.0.33:3005/

# Check Traefik routing
ssh root@10.3.3.3 "curl -s http://localhost:8080/api/http/services"

# Monitor real-time logs
ssh root@10.3.3.3 "docker logs traefik -f"

# Force configuration refresh
# Go to dashboard → Sync to Traefik

✅ IMPLEMENTATION COMPLETE - August 2, 2025

Problem Solved

Issue: https://do.dev was returning 502 Bad Gateway errors due to load balancing between healthy (10.3.0.33:3005) and unreachable (10.1.0.33:3005) servers without health checks.

Solution Implemented

  1. Diagnosed with Playwright: Identified 10.1.0.33:3005 as completely unreachable while 10.3.0.33:3005 works perfectly
  2. Fixed Configuration: Disabled unreachable server in Convex database (isActive: false)
  3. Enabled Health Checks: Pushed complete health check configuration to Redis/Traefik
  4. Verified Success: https://do.dev now loads perfectly without bad gateway errors

Current Status

  • Health checks active: / endpoint every 30s with 5s timeout
  • Only healthy server: 10.3.0.33:3005 receives all traffic
  • Bad gateway resolved: https://do.dev working perfectly
  • Automatic failover ready: When 10.1.0.33 comes online, set isActive: true and sync

Key Files Created/Modified

  • scripts/simple-sync.js - Direct Redis configuration script
  • scripts/check-health.sh - Health monitoring script
  • lib/traefik-redis-client.ts - Updated health check storage
  • docs/AUTOMATIC_FAILOVER_GUIDE.md - This documentation

Manual server management has been replaced with intelligent automatic failover.

On this page