Multi-Upstream Failover Architecture

Overview

The system is designed to provide automatic failover between multiple upstream servers (10.1.0.33 and 10.3.0.33) using Caddy's health check and load balancing features.

Architecture Components

1. Database Schema

  • caddy_upstreams: Primary upstream configuration (single address per upstream)
  • caddy_upstream_addresses: Multiple addresses per upstream with priority ordering
    • Each upstream can have multiple IP:port combinations
    • Priority field determines failover order (1 = primary, 2 = secondary, etc.)
    • Health status tracked per address

2. Data Flow

  1. Database Query (getConfigForSync in convex/caddy.ts):

    • Fetches routes and upstreams
    • For each upstream, queries caddy_upstream_addresses table
    • Adds addresses array to upstream object
  2. API Route (/api/caddy/sync/route.ts):

    • Calls getConfigForSync to get configuration
    • Passes routes with upstream data to CaddyClient
  3. CaddyClient (lib/caddy-client.ts):

    • loadConfiguration() accepts routes with upstream addresses
    • buildCaddyRoute() should build multiple upstreams from addresses array
    • Configures health checks and load balancing

3. Expected Caddy Configuration

{
  "handle": [{
    "handler": "reverse_proxy",
    "upstreams": [
      { "dial": "10.1.0.33:3005" },
      { "dial": "10.3.0.33:3005" }
    ],
    "health_checks": {
      "active": {
        "path": "/",
        "interval": "30s",
        "timeout": "5s",
        "expect_status": 200
      },
      "passive": {
        "fail_duration": "30s",
        "max_fails": 3,
        "unhealthy_status": [502, 503, 504]
      }
    },
    "load_balancing": {
      "selection_policy": {
        "policy": "first"  // Use first healthy upstream
      },
      "try_duration": "10s",
      "try_interval": "250ms"
    }
  }]
}

Resolution

Root Cause

The data transformation was working correctly, but the CaddyClient was trying to update the entire server configuration instead of just the routes, causing conflicts.

Fix Applied

Modified caddy-client.ts to update only the routes instead of the entire server:

// Instead of PUT to /config/apps/http/servers/srv2
// Use PUT to /config/apps/http/servers/srv2/routes
const httpsResponse = await fetch(`${this.apiUrl}/config/apps/http/servers/srv2/routes`, {
  method: 'PUT',
  headers: this.headers,
  body: JSON.stringify(httpsServer.routes),
})

Verification

All routes now have:

  • ✅ Multiple upstream addresses (10.1.0.33 and 10.3.0.33)
  • ✅ Active health checks (30s interval)
  • ✅ Passive health checks (fail after 3 errors)
  • ✅ Load balancing with "first" policy (failover)
  • ✅ All sites responding with HTTP 200

How Failover Works

  1. Caddy tries the first upstream (10.1.0.33)
  2. If it fails or times out, Caddy marks it unhealthy
  3. Automatically fails over to second upstream (10.3.0.33)
  4. Continues health checks every 30s
  5. Returns to primary when it becomes healthy again

On this page