Skip to main content
See also: Production Deployment for the two-server production topology.

Local Development

docker compose -f docker-compose.dev.yml up
That’s it. Vault is automatically started, configured, and seeded. No manual .env setup needed.

What Happens on Startup

  1. Infrastructure starts: PostgreSQL, Redis, LocalStack (S3), Temporal
  2. Vault starts in dev mode (in-memory, auto-unseal, no TLS)
  3. vault-dev-init.sh runs via the vault-init sidecar:
    • Enables KV v2 and Transit secret engines
    • Creates AppRole hoodcloud-dev with permissive dev policy
    • Seeds master encryption key, JWT RSA keys, ECDSA signing key, X25519 sealed box keypair, dev passwords
    • Creates Transit key hoodcloud-master (aes256-gcm96)
    • Writes role-id and secret-id to /vault/config/ shared volume
  4. Control plane services start, reading credentials from the shared volume
  5. Idempotent, completes in ~5 seconds

Vault UI

Dev Notes

  • Data is ephemeral (Vault dev mode uses in-memory storage)
  • No TLS (VAULT_TLS_SKIP_VERIFY=true on all services)
  • The same vault-dev-init.sh is used in E2E tests (tests/e2e/docker-compose.e2e.yml) on port 8201

Single-Server Staging

For staging or internal testing, the entire stack (control plane + observability) runs on a single VM.

Architecture

Prerequisites

  1. Hetzner Cloud VM: cx31 minimum (4 vCPU, 8GB RAM), Ubuntu 22.04
  2. AWS resources: S3 buckets + DynamoDB table (see Production Deployment - AWS Resources)
  3. DNS: api., auth., grafana. subdomains pointing to the VM

Setup

# Server dependencies
apt update && apt upgrade -y
curl -fsSL https://get.docker.com | sh
systemctl enable docker && systemctl start docker
apt install docker-compose-plugin -y

# Clone and configure
git clone https://github.com/hoodrunio/hoodcloud /opt/hoodcloud
cd /opt/hoodcloud/infrastructure/docker
cp .env.production.example .env
# Edit .env with your values (see environment-variables)
Required .env values: see Environment Variables.

Generate Certificates

cd /opt/hoodcloud
./scripts/generate-certs.sh  # ops-agent gRPC certs

Deploy

# Option 1: Script
./scripts/deploy-control-plane.sh
# Options: --build-only, --no-build, --migrate

# Option 2: Manual
cd infrastructure/docker
docker compose build

# Infrastructure
docker compose up -d postgres redis nats
sleep 10
docker compose up -d temporal temporal-init temporal-ui
sleep 15

# Run migrations BEFORE starting application services
docker compose run --rm migrate

# Application
docker compose up -d auth-server api-server agent-gateway orchestrator health-evaluator

# Observability
docker compose up -d prometheus victoria-metrics loki grafana alertmanager
docker compose up -d caddy
Important: Migrations must run before application services start. The cmd/migrate binary is idempotent — safe to run multiple times. Services no longer run migrations at startup.

Verify

curl https://api.hoodcloud.io/health      # {"status":"healthy"}
curl https://auth.hoodcloud.io/health     # {"status":"healthy"}
docker compose ps                          # All services running
Note: /health returns only top-level status. Use /readyz for detailed component health.

Internal Ports

ServicePortPurpose
PostgreSQL5432Database
Redis6379Cache
Temporal7233Workflow engine
Temporal UI8088Workflow dashboard
Prometheus9090Metrics
NATS4222Event streaming

Day-to-Day Operations

View Logs

docker compose logs -f api-server          # Single service
docker compose logs --tail 100 api-server  # Last 100 lines
docker compose logs -f                      # All services

Restart Services

docker compose restart api-server          # Single service (keeps env)
docker compose up -d api-server            # Recreate (reloads .env)
Important: restart does not reload .env changes. Use up -d to pick up env var updates.

Update Deployment

cd /opt/hoodcloud
git pull

# Run migrations first (idempotent, safe to run even if no new migrations)
cd infrastructure/docker
docker compose run --rm migrate

# Then restart services
./scripts/deploy-control-plane.sh

Release Chain Configs

When chain config files are updated in the hoodcloud-chain-configs repo:
cd hoodcloud-chain-configs
git tag v1.1.0
git push origin v1.1.0
This triggers the GitHub Actions workflow:
  1. Validates config files against schemas
  2. Creates tarball + checksums
  3. Uploads to S3 (s3://hoodcloud-chain-configs/v1.1.0/)
  4. Copies to latest/ for auto-update mode
  5. Creates GitHub release
Auto-reload (default): Services using CHAIN_PROFILE_VERSION=latest poll checksums.txt at CHAINS_CHECK_INTERVAL (default: 1m). Config swaps atomically, no restart needed.
docker compose logs orchestrator | grep "Chain configs loaded"
Pinned version (controlled rollouts):
# In .env
CHAIN_PROFILE_VERSION=v1.1.0
docker compose restart api-server orchestrator health-evaluator

# Rollback
CHAIN_PROFILE_VERSION=v1.0.2
docker compose restart api-server orchestrator health-evaluator
Ops-agent config distribution:
# Update CONFIG_BUNDLE_VERSION in .env, then:
docker compose restart orchestrator

Database Backup & Restore

# Backup
docker exec hoodcloud-postgres pg_dump -U hoodcloud hoodcloud > backup-$(date +%Y%m%d).sql

# Restore
cat backup.sql | docker exec -i hoodcloud-postgres psql -U hoodcloud hoodcloud

Secrets Rotation

After updating secrets in Vault:
# Services cache credentials for up to 5 minutes (VAULT_CACHE_TTL)
docker compose restart api-server auth-server orchestrator health-evaluator agent-gateway
See also: Vault Operations for secret rotation and AppRole credential management.

Upgrade Rollout Operations

Create and Start a Rollout

Two modes are available for rollout creation: Auto mode (recommended): Provide upgrade_id — the control plane loads the upgrade manifest from chain-configs and auto-populates binary coordinates, state_compatibility, config_changes, and manifest_content_hash.
# Auto mode: manifest-based creation
curl -X POST https://api.hoodcloud.io/api/v1/rollouts \
  -H "Authorization: Bearer $TOKEN" \
  -d '{
    "upgrade_id": "v0.22.0",
    "chain_profile_id": "celestia-mocha",
    "node_type": "full",
    "source_version": "v0.21.0",
    "strategy": "canary",
    "batch_size": 5,
    "canary_size": 2,
    "failure_threshold": 0.2,
    "health_wait_duration": "5m",
    "reason": "Security patch for consensus bug"
  }'
Manual mode: Omit upgrade_id — operator provides all fields explicitly.
# Manual mode: all fields provided
curl -X POST https://api.hoodcloud.io/api/v1/rollouts \
  -H "Authorization: Bearer $TOKEN" \
  -d '{
    "chain_profile_id": "celestia-mocha",
    "node_type": "full",
    "binary_name": "celestia-node",
    "source_version": "v0.21.0",
    "target_version": "v0.22.0",
    "target_binary_url": "https://...",
    "target_binary_checksum": "sha256:...",
    "strategy": "canary",
    "batch_size": 5,
    "canary_size": 2,
    "failure_threshold": 0.2,
    "health_wait_duration": "5m",
    "reason": "Security patch for consensus bug"
  }'
# Start the rollout
curl -X POST https://api.hoodcloud.io/api/v1/rollouts/{id}/start \
  -H "Authorization: Bearer $TOKEN"

Monitor Rollout Progress

# Get rollout status and progress counters
curl https://api.hoodcloud.io/api/v1/rollouts/{id} \
  -H "Authorization: Bearer $TOKEN"

# Per-node status (includes upgrade_phase)
curl https://api.hoodcloud.io/api/v1/rollouts/{id}/nodes \
  -H "Authorization: Bearer $TOKEN"

# Temporal workflow visibility
# Open Temporal UI -> search for workflow ID: rollout-{rollout_id}

Pause, Resume, Cancel

# Pause (completes current batch, then stops)
curl -X POST https://api.hoodcloud.io/api/v1/rollouts/{id}/pause \
  -H "Authorization: Bearer $TOKEN"

# Resume from paused state
curl -X POST https://api.hoodcloud.io/api/v1/rollouts/{id}/resume \
  -H "Authorization: Bearer $TOKEN"

# Cancel (stops new batches, lets in-flight nodes complete)
curl -X POST https://api.hoodcloud.io/api/v1/rollouts/{id}/cancel \
  -H "Authorization: Bearer $TOKEN"

Rollback

# Full rollback (all succeeded nodes reverted)
curl -X POST https://api.hoodcloud.io/api/v1/rollouts/{id}/rollback \
  -H "Authorization: Bearer $TOKEN"

# Retry a single failed node
curl -X POST https://api.hoodcloud.io/api/v1/rollouts/{id}/nodes/{nodeId}/retry \
  -H "Authorization: Bearer $TOKEN"

Multi-Binary Rollout (Rollout Groups)

For chains with multiple binaries (e.g., ethereum-holesky: geth + lighthouse):
# Create rollout group — auto mode (components reference upgrade_id)
curl -X POST https://api.hoodcloud.io/api/v1/rollout-groups \
  -H "Authorization: Bearer $TOKEN" \
  -d '{
    "chain_profile_id": "ethereum-holesky",
    "node_type": "full",
    "failure_policy": "rollback_all",
    "desired_versions": {"geth": "v1.15.0", "lighthouse": "v5.2.0"},
    "component_order": [
      {"upgrade_id": "geth-v1.15.0", "binary_name": "geth", "node_type": "full", "version": "v1.15.0"},
      {"upgrade_id": "lighthouse-v5.2.0", "binary_name": "lighthouse", "node_type": "full", "version": "v5.2.0"}
    ],
    "strategy": "rolling",
    "batch_size": 5
  }'

# Start group (components upgrade sequentially)
curl -X POST https://api.hoodcloud.io/api/v1/rollout-groups/{id}/start \
  -H "Authorization: Bearer $TOKEN"

# Monitor group status
curl https://api.hoodcloud.io/api/v1/rollout-groups/{id}/status \
  -H "Authorization: Bearer $TOKEN"

Database Queries (Rollout Debugging)

-- Active rollouts
SELECT id, chain_profile_id, status, strategy, total_nodes, succeeded_nodes, failed_nodes
FROM rollouts WHERE status NOT IN ('completed', 'failed', 'canceled', 'rolled_back');

-- Per-node upgrade status for a rollout
SELECT rn.node_id, rn.status, rn.upgrade_phase, rn.error_message, rn.attempt_count
FROM rollout_nodes rn WHERE rn.rollout_id = '{rollout_id}';

-- Nodes stuck in maintenance (potential cleanup targets)
SELECT n.id, n.chain_profile_id, n.state, nhs.updated_at
FROM nodes n JOIN node_health_state nhs ON n.id = nhs.node_id
WHERE n.state = 'maintenance' AND nhs.updated_at < now() - interval '30 minutes';

-- Version drift detection
SELECT n.id, n.chain_profile_id, ncs.binary_version
FROM nodes n LEFT JOIN node_config_state ncs ON n.id = ncs.node_id
WHERE n.state NOT IN ('terminated', 'failed')
ORDER BY n.chain_profile_id, ncs.binary_version;

Rolling Uptime

The UptimeWorker runs inside the health-evaluator process — no separate deployment needed.

How It Works

  • State transitions are logged to node_state_log (append-only) via the StateLogHandler registered on CompositeTransitionHandler
  • The UptimeWorker materializes hourly uptime buckets (node_uptime_hourly) every 5 minutes
  • API endpoints read pre-computed buckets for fast queries

Configuration Defaults

No environment variables are required. Defaults are defined in internal/defaults/defaults.go:
SettingDefaultDescription
Interval5mHow often the worker computes buckets
BatchSize500Nodes processed per batch
GracePeriod10mProvisioning grace period (counts as uptime)
RetentionDays90Hourly buckets older than this are purged

API Endpoints

Both require nodes:read scope + ownership verification:
  • GET /api/v1/nodes/{nodeID}/uptime?window={24h|7d|30d} — Rolling uptime summary
  • GET /api/v1/nodes/{nodeID}/uptime/history?window={24h|7d|30d}&granularity={hourly|daily} — Per-period breakdown

Operational Characteristics

  • Idempotent: Safe to restart. The worker catches up from the last complete bucket.
  • Self-healing: Individual node failures are logged but don’t block other nodes.
  • Auto-purge: Buckets older than 90 days are automatically deleted.
  • No external cursor: Progress is implicit in is_complete flags on hourly buckets.

Monitoring

The worker logs warnings on per-node failures:
docker compose logs health-evaluator | grep -i "uptime"
Key log patterns:
  • "Failed to compute uptime for node" — Per-node error (non-blocking)
  • "Failed to compute uptime batch" — Batch-level error (continues to next batch)

Database Queries (Debugging)

-- Check recent bucket computation
SELECT node_id, bucket_hour, uptime_seconds, downtime_seconds, is_complete
FROM node_uptime_hourly
ORDER BY computed_at DESC LIMIT 20;

-- State transitions for a node
SELECT state, entered_at, exited_at, trigger
FROM node_state_log
WHERE node_id = '{node_id}'
ORDER BY entered_at DESC LIMIT 20;

-- Bucket retention (rows older than 90 days)
SELECT COUNT(*) FROM node_uptime_hourly
WHERE bucket_hour < now() - interval '90 days';

Manual Bucket Recomputation

If a bug produced incorrect buckets, mark affected buckets for recomputation:
UPDATE node_uptime_hourly
SET is_complete = false
WHERE node_id = '{node_id}'
  AND bucket_hour BETWEEN '{start}' AND '{end}';
The worker recomputes these on its next tick.

Leader Election (Health Evaluator)

The health evaluator uses PostgreSQL advisory lock-based leader election. When running 2+ instances, one acquires the lock and runs evaluation cycles. The others are hot standby.

How It Works

  • Leader acquires advisory lock via a dedicated, non-pooled pgx.Conn
  • Lock is held for the lifetime of leader tenure (connection stays open)
  • On leader failure, a standby acquires the lock within 15-30 seconds
  • Leader-gated goroutines: evaluation, subscription cleanup, terraform cleanup, maintenance recovery, S3 backup cleanup, incident dispatcher, rate limiter eviction, uptime worker, policy evaluation (9 of 11 total)
  • NOT leader-gated: outbox worker (uses FOR UPDATE SKIP LOCKED), metrics ingester (idempotent writes)

Monitoring

MetricDescription
health_evaluator_is_leader1 if this instance is leader, 0 if standby
Alert: If health_evaluator_is_leader == 0 across ALL instances for > 60s, no instance holds the lock. Check PostgreSQL connectivity.

Troubleshooting

No leader for > 60 seconds:
  1. Check PostgreSQL connectivity from all health evaluator instances
  2. Check advisory lock status:
    SELECT * FROM pg_locks WHERE locktype = 'advisory';
    
  3. Check if a stale connection holds the lock:
    SELECT pid, state, query, backend_start
    FROM pg_stat_activity
    WHERE pid IN (SELECT pid FROM pg_locks WHERE locktype = 'advisory');
    
  4. If a stale backend holds the lock, terminate it:
    SELECT pg_terminate_backend(<pid>);
    
Failover behavior: When the leader dies, its PostgreSQL connection closes and the advisory lock is released. A standby acquires the lock on its next retry interval (default 5s). Maximum monitoring gap during failover: one evaluation cycle (15-30s).

Auth Server Rate Limiting

The auth server uses Redis-backed rate limiting (same implementation as the API server). This ensures rate limits are enforced globally across multiple auth server instances.
  • Circuit breaker fallback: If Redis is unavailable, falls back to in-memory rate limiting per instance
  • Configuration: AUTH_RATE_LIMIT_PER_MINUTE (default: 20)
  • Requires: Redis connection (REDIS_HOST)

Runbooks

Cold Start / Disaster Recovery

Full system restart sequence:
  1. Start infrastructure: PostgreSQL, Redis, NATS
  2. Wait for PostgreSQL to be ready (accept connections)
  3. Run cmd/migrate to apply any pending migrations
  4. Start Temporal server, wait for it to be ready
  5. Start application services: auth-server, api-server, agent-gateway, orchestrator, health-evaluator
  6. Verify health endpoints respond
  7. Verify health evaluator leader election (health_evaluator_is_leader metric)

Dirty Migration Recovery

If cmd/migrate fails mid-migration, the schema_migrations table is marked dirty=true. Subsequent migration runs will refuse to proceed.
  1. Identify the failed migration:
    SELECT version, dirty FROM schema_migrations;
    
  2. Assess the damage: Check if the migration partially applied:
    -- Check if the table/column from the migration exists
    \d+ <table_name>
    
  3. Fix the state:
    • If the migration fully applied but the dirty flag wasn’t cleared:
      UPDATE schema_migrations SET dirty = false WHERE version = <version>;
      
    • If the migration partially applied, manually complete or revert it, then update the version:
      -- After manual fix:
      UPDATE schema_migrations SET dirty = false WHERE version = <version>;
      
  4. Re-run migrations:
    go run ./cmd/migrate
    

Health Evaluator Leader Troubleshooting

See Leader Election - Troubleshooting above.

PostgreSQL Failover Verification

After a managed PostgreSQL failover event:
  1. Verify all services reconnected:
    docker compose logs --tail 50 api-server | grep -i "connect\|error\|pool"
    
  2. Check connection pool health:
    curl -s http://localhost:8080/readyz | jq .
    
  3. Verify health evaluator re-acquired leader lock after failover
  4. Check for any statement timeout spikes during failover window

Redis Failover Verification

After a Redis Sentinel failover:
  1. Verify command queue is operational (agents receiving commands)
  2. Verify rate limiter circuit breaker recovered:
    docker compose logs api-server | grep -i "circuit"
    
  3. Check auth server rate limiter recovered (same circuit breaker pattern)

Connection Pool Tuning

When db_pool_active_connections > 80% of MaxConns or db_pool_acquire_duration_seconds p99 > 1s:
  1. Identify which service is exhausting its pool
  2. Check if slow queries are holding connections (see Statement Timeout Runbook in Database Guardrails)
  3. Increase DB_MAX_CONNS for the affected service
  4. Verify the total connection budget stays under PostgreSQL max_connections
  5. If total budget exceeds limits, consider deploying PgBouncer

Query Performance Degradation

When p95 SLO is breached:
  1. Identify the degraded service from hoodcloud_db_query_duration_seconds metric
  2. Query pg_stat_statements for the top queries by mean execution time:
    SELECT query, calls, mean_exec_time, total_exec_time
    FROM pg_stat_statements
    ORDER BY mean_exec_time DESC
    LIMIT 20;
    
  3. Run EXPLAIN ANALYZE on the slow query
  4. Check for missing indexes, table bloat, or lock contention
  5. If the health evaluator’s ListSnapshots query is degrading, check node count growth

Troubleshooting

Caddy Certificate Issues

  1. Verify DNS records: dig api.hoodcloud.io
  2. Check ports 80 and 443 are open
  3. Check logs: docker compose logs caddy

Database Connection Issues

docker compose ps postgres
docker exec hoodcloud-api-server nc -zv postgres 5432

Temporal Issues

docker compose logs temporal
# Access Temporal UI via SSH tunnel:
ssh -L 8088:localhost:8088 user@server
# Then open http://localhost:8088

API Server Won’t Start - Missing Binaries Dir

The API server builds ops-agent into /app/binaries/ during Docker build. Rebuild if missing:
docker compose build api-server
docker compose up -d --force-recreate api-server

Services Fail After Vault Restart

Vault seals on every restart. Unseal with 3 of 5 keys before services can authenticate. See Vault Operations.

Database Errors After Secrets Rotation

# Wait for cache TTL (default 5m), or force restart:
docker compose restart api-server auth-server orchestrator health-evaluator agent-gateway

Security Notes

  • All internal services bind to 127.0.0.1 only
  • External access only through Caddy reverse proxy (auto TLS)
  • gRPC uses self-signed CA for ops-agent authentication
  • Security boundaries: see CLAUDE.md