Deployment & Operations

See also: Production Deployment for the two-server production topology.

Local Development

docker compose -f docker-compose.dev.yml up

That’s it. Vault is automatically started, configured, and seeded. No manual .env setup needed.

What Happens on Startup

Infrastructure starts: PostgreSQL, Redis, LocalStack (S3), Temporal
Vault starts in dev mode (in-memory, auto-unseal, no TLS)
vault-dev-init.sh runs via the vault-init sidecar:
- Enables KV v2 and Transit secret engines
- Creates AppRole hoodcloud-dev with permissive dev policy
- Seeds master encryption key, JWT RSA keys, ECDSA signing key, X25519 sealed box keypair, dev passwords
- Creates Transit key hoodcloud-master (aes256-gcm96)
- Writes role-id and secret-id to /vault/config/ shared volume
Control plane services start, reading credentials from the shared volume
Idempotent, completes in ~5 seconds

Vault UI

URL: http://localhost:8200
Token: dev-token

Dev Notes

Data is ephemeral (Vault dev mode uses in-memory storage)
No TLS (VAULT_TLS_SKIP_VERIFY=true on all services)
The same vault-dev-init.sh is used in E2E tests (tests/e2e/docker-compose.e2e.yml) on port 8201

Single-Server Staging

For staging or internal testing, the entire stack (control plane + observability) runs on a single VM.

Architecture

Prerequisites

Hetzner Cloud VM: cx31 minimum (4 vCPU, 8GB RAM), Ubuntu 22.04
AWS resources: S3 buckets + DynamoDB table (see Production Deployment - AWS Resources)
DNS: api., auth., grafana. subdomains pointing to the VM

Setup

# Server dependencies
apt update && apt upgrade -y
curl -fsSL https://get.docker.com | sh
systemctl enable docker && systemctl start docker
apt install docker-compose-plugin -y

# Clone and configure
git clone https://github.com/hoodrunio/hoodcloud /opt/hoodcloud
cd /opt/hoodcloud/infrastructure/docker
cp .env.production.example .env
# Edit .env with your values (see environment-variables)

Required .env values: see Environment Variables.

Generate Certificates

cd /opt/hoodcloud
./scripts/generate-certs.sh  # ops-agent gRPC certs

Deploy

# Option 1: Script
./scripts/deploy-control-plane.sh
# Options: --build-only, --no-build, --migrate

# Option 2: Manual
cd infrastructure/docker
docker compose build

# Infrastructure
docker compose up -d postgres redis nats
sleep 10
docker compose up -d temporal temporal-init temporal-ui
sleep 15

# Run migrations BEFORE starting application services
docker compose run --rm migrate

# Application
docker compose up -d auth-server api-server agent-gateway orchestrator health-evaluator

# Observability
docker compose up -d prometheus victoria-metrics loki grafana alertmanager
docker compose up -d caddy

Important: Migrations must run before application services start. The cmd/migrate binary is idempotent — safe to run multiple times. Services no longer run migrations at startup.

Verify

curl https://api.hoodcloud.io/health      # {"status":"healthy"}
curl https://auth.hoodcloud.io/health     # {"status":"healthy"}
docker compose ps                          # All services running

Note: /health returns only top-level status. Use /readyz for detailed component health.

Internal Ports

Service	Port	Purpose
PostgreSQL	5432	Database
Redis	6379	Cache
Temporal	7233	Workflow engine
Temporal UI	8088	Workflow dashboard
Prometheus	9090	Metrics
NATS	4222	Event streaming

Day-to-Day Operations

View Logs

docker compose logs -f api-server          # Single service
docker compose logs --tail 100 api-server  # Last 100 lines
docker compose logs -f                      # All services

Restart Services

docker compose restart api-server          # Single service (keeps env)
docker compose up -d api-server            # Recreate (reloads .env)

Important: restart does not reload .env changes. Use up -d to pick up env var updates.

Update Deployment

cd /opt/hoodcloud
git pull

# Run migrations first (idempotent, safe to run even if no new migrations)
cd infrastructure/docker
docker compose run --rm migrate

# Then restart services
./scripts/deploy-control-plane.sh

Release Chain Configs

When chain config files are updated in the hoodcloud-chain-configs repo:

cd hoodcloud-chain-configs
git tag v1.1.0
git push origin v1.1.0

This triggers the GitHub Actions workflow:

Validates config files against schemas
Creates tarball + checksums
Uploads to S3 (s3://hoodcloud-chain-configs/v1.1.0/)
Copies to latest/ for auto-update mode
Creates GitHub release

Auto-reload (default): Services using CHAIN_PROFILE_VERSION=latest poll checksums.txt at CHAINS_CHECK_INTERVAL (default: 1m). Config swaps atomically, no restart needed.

docker compose logs orchestrator | grep "Chain configs loaded"

Pinned version (controlled rollouts):

# In .env
CHAIN_PROFILE_VERSION=v1.1.0
docker compose restart api-server orchestrator health-evaluator

# Rollback
CHAIN_PROFILE_VERSION=v1.0.2
docker compose restart api-server orchestrator health-evaluator

Ops-agent config distribution:

# Update CONFIG_BUNDLE_VERSION in .env, then:
docker compose restart orchestrator

Database Backup & Restore

# Backup
docker exec hoodcloud-postgres pg_dump -U hoodcloud hoodcloud > backup-$(date +%Y%m%d).sql

# Restore
cat backup.sql | docker exec -i hoodcloud-postgres psql -U hoodcloud hoodcloud

Secrets Rotation

After updating secrets in Vault:

# Services cache credentials for up to 5 minutes (VAULT_CACHE_TTL)
docker compose restart api-server auth-server orchestrator health-evaluator agent-gateway

See also: Vault Operations for secret rotation and AppRole credential management.

Upgrade Rollout Operations

Create and Start a Rollout

Two modes are available for rollout creation: Auto mode (recommended): Provide upgrade_id — the control plane loads the upgrade manifest from chain-configs and auto-populates binary coordinates, state_compatibility, config_changes, and manifest_content_hash.

# Auto mode: manifest-based creation
curl -X POST https://api.hoodcloud.io/api/v1/rollouts \
  -H "Authorization: Bearer $TOKEN" \
  -d '{
    "upgrade_id": "v0.22.0",
    "chain_profile_id": "celestia-mocha",
    "node_type": "full",
    "source_version": "v0.21.0",
    "strategy": "canary",
    "batch_size": 5,
    "canary_size": 2,
    "failure_threshold": 0.2,
    "health_wait_duration": "5m",
    "reason": "Security patch for consensus bug"
  }'

Manual mode: Omit upgrade_id — operator provides all fields explicitly.

# Manual mode: all fields provided
curl -X POST https://api.hoodcloud.io/api/v1/rollouts \
  -H "Authorization: Bearer $TOKEN" \
  -d '{
    "chain_profile_id": "celestia-mocha",
    "node_type": "full",
    "binary_name": "celestia-node",
    "source_version": "v0.21.0",
    "target_version": "v0.22.0",
    "target_binary_url": "https://...",
    "target_binary_checksum": "sha256:...",
    "strategy": "canary",
    "batch_size": 5,
    "canary_size": 2,
    "failure_threshold": 0.2,
    "health_wait_duration": "5m",
    "reason": "Security patch for consensus bug"
  }'

# Start the rollout
curl -X POST https://api.hoodcloud.io/api/v1/rollouts/{id}/start \
  -H "Authorization: Bearer $TOKEN"

Monitor Rollout Progress

# Get rollout status and progress counters
curl https://api.hoodcloud.io/api/v1/rollouts/{id} \
  -H "Authorization: Bearer $TOKEN"

# Per-node status (includes upgrade_phase)
curl https://api.hoodcloud.io/api/v1/rollouts/{id}/nodes \
  -H "Authorization: Bearer $TOKEN"

# Temporal workflow visibility
# Open Temporal UI -> search for workflow ID: rollout-{rollout_id}

Pause, Resume, Cancel

# Pause (completes current batch, then stops)
curl -X POST https://api.hoodcloud.io/api/v1/rollouts/{id}/pause \
  -H "Authorization: Bearer $TOKEN"

# Resume from paused state
curl -X POST https://api.hoodcloud.io/api/v1/rollouts/{id}/resume \
  -H "Authorization: Bearer $TOKEN"

# Cancel (stops new batches, lets in-flight nodes complete)
curl -X POST https://api.hoodcloud.io/api/v1/rollouts/{id}/cancel \
  -H "Authorization: Bearer $TOKEN"

Rollback

# Full rollback (all succeeded nodes reverted)
curl -X POST https://api.hoodcloud.io/api/v1/rollouts/{id}/rollback \
  -H "Authorization: Bearer $TOKEN"

# Retry a single failed node
curl -X POST https://api.hoodcloud.io/api/v1/rollouts/{id}/nodes/{nodeId}/retry \
  -H "Authorization: Bearer $TOKEN"

Multi-Binary Rollout (Rollout Groups)

For chains with multiple binaries (e.g., ethereum-holesky: geth + lighthouse):

# Create rollout group — auto mode (components reference upgrade_id)
curl -X POST https://api.hoodcloud.io/api/v1/rollout-groups \
  -H "Authorization: Bearer $TOKEN" \
  -d '{
    "chain_profile_id": "ethereum-holesky",
    "node_type": "full",
    "failure_policy": "rollback_all",
    "desired_versions": {"geth": "v1.15.0", "lighthouse": "v5.2.0"},
    "component_order": [
      {"upgrade_id": "geth-v1.15.0", "binary_name": "geth", "node_type": "full", "version": "v1.15.0"},
      {"upgrade_id": "lighthouse-v5.2.0", "binary_name": "lighthouse", "node_type": "full", "version": "v5.2.0"}
    ],
    "strategy": "rolling",
    "batch_size": 5
  }'

# Start group (components upgrade sequentially)
curl -X POST https://api.hoodcloud.io/api/v1/rollout-groups/{id}/start \
  -H "Authorization: Bearer $TOKEN"

# Monitor group status
curl https://api.hoodcloud.io/api/v1/rollout-groups/{id}/status \
  -H "Authorization: Bearer $TOKEN"

Database Queries (Rollout Debugging)

-- Active rollouts
SELECT id, chain_profile_id, status, strategy, total_nodes, succeeded_nodes, failed_nodes
FROM rollouts WHERE status NOT IN ('completed', 'failed', 'canceled', 'rolled_back');

-- Per-node upgrade status for a rollout
SELECT rn.node_id, rn.status, rn.upgrade_phase, rn.error_message, rn.attempt_count
FROM rollout_nodes rn WHERE rn.rollout_id = '{rollout_id}';

-- Nodes stuck in maintenance (potential cleanup targets)
SELECT n.id, n.chain_profile_id, n.state, nhs.updated_at
FROM nodes n JOIN node_health_state nhs ON n.id = nhs.node_id
WHERE n.state = 'maintenance' AND nhs.updated_at < now() - interval '30 minutes';

-- Version drift detection
SELECT n.id, n.chain_profile_id, ncs.binary_version
FROM nodes n LEFT JOIN node_config_state ncs ON n.id = ncs.node_id
WHERE n.state NOT IN ('terminated', 'failed')
ORDER BY n.chain_profile_id, ncs.binary_version;

Rolling Uptime

The UptimeWorker runs inside the health-evaluator process — no separate deployment needed.

How It Works

State transitions are logged to node_state_log (append-only) via the StateLogHandler registered on CompositeTransitionHandler
The UptimeWorker materializes hourly uptime buckets (node_uptime_hourly) every 5 minutes
API endpoints read pre-computed buckets for fast queries

Configuration Defaults

No environment variables are required. Defaults are defined in internal/defaults/defaults.go:

Setting	Default	Description
Interval	5m	How often the worker computes buckets
BatchSize	500	Nodes processed per batch
GracePeriod	10m	Provisioning grace period (counts as uptime)
RetentionDays	90	Hourly buckets older than this are purged

API Endpoints

Both require nodes:read scope + ownership verification:

GET /api/v1/nodes/{nodeID}/uptime?window={24h|7d|30d} — Rolling uptime summary
GET /api/v1/nodes/{nodeID}/uptime/history?window={24h|7d|30d}&granularity={hourly|daily} — Per-period breakdown

Operational Characteristics

Idempotent: Safe to restart. The worker catches up from the last complete bucket.
Self-healing: Individual node failures are logged but don’t block other nodes.
Auto-purge: Buckets older than 90 days are automatically deleted.
No external cursor: Progress is implicit in is_complete flags on hourly buckets.

Monitoring

The worker logs warnings on per-node failures:

docker compose logs health-evaluator | grep -i "uptime"

Key log patterns:

"Failed to compute uptime for node" — Per-node error (non-blocking)
"Failed to compute uptime batch" — Batch-level error (continues to next batch)

Database Queries (Debugging)

-- Check recent bucket computation
SELECT node_id, bucket_hour, uptime_seconds, downtime_seconds, is_complete
FROM node_uptime_hourly
ORDER BY computed_at DESC LIMIT 20;

-- State transitions for a node
SELECT state, entered_at, exited_at, trigger
FROM node_state_log
WHERE node_id = '{node_id}'
ORDER BY entered_at DESC LIMIT 20;

-- Bucket retention (rows older than 90 days)
SELECT COUNT(*) FROM node_uptime_hourly
WHERE bucket_hour < now() - interval '90 days';

Manual Bucket Recomputation

If a bug produced incorrect buckets, mark affected buckets for recomputation:

UPDATE node_uptime_hourly
SET is_complete = false
WHERE node_id = '{node_id}'
  AND bucket_hour BETWEEN '{start}' AND '{end}';

The worker recomputes these on its next tick.

Leader Election (Health Evaluator)

The health evaluator uses PostgreSQL advisory lock-based leader election. When running 2+ instances, one acquires the lock and runs evaluation cycles. The others are hot standby.

How It Works

Leader acquires advisory lock via a dedicated, non-pooled pgx.Conn
Lock is held for the lifetime of leader tenure (connection stays open)
On leader failure, a standby acquires the lock within 15-30 seconds
Leader-gated goroutines: evaluation, subscription cleanup, terraform cleanup, maintenance recovery, S3 backup cleanup, incident dispatcher, rate limiter eviction, uptime worker, policy evaluation (9 of 11 total)
NOT leader-gated: outbox worker (uses FOR UPDATE SKIP LOCKED), metrics ingester (idempotent writes)

Monitoring

Metric	Description
`health_evaluator_is_leader`	1 if this instance is leader, 0 if standby

Alert: If health_evaluator_is_leader == 0 across ALL instances for > 60s, no instance holds the lock. Check PostgreSQL connectivity.

Troubleshooting

No leader for > 60 seconds:

Check PostgreSQL connectivity from all health evaluator instances

Check advisory lock status:

SELECT * FROM pg_locks WHERE locktype = 'advisory';

Check if a stale connection holds the lock:

SELECT pid, state, query, backend_start
FROM pg_stat_activity
WHERE pid IN (SELECT pid FROM pg_locks WHERE locktype = 'advisory');

If a stale backend holds the lock, terminate it:
```
SELECT pg_terminate_backend(<pid>);
```

Failover behavior: When the leader dies, its PostgreSQL connection closes and the advisory lock is released. A standby acquires the lock on its next retry interval (default 5s). Maximum monitoring gap during failover: one evaluation cycle (15-30s).

Auth Server Rate Limiting

The auth server uses Redis-backed rate limiting (same implementation as the API server). This ensures rate limits are enforced globally across multiple auth server instances.

Circuit breaker fallback: If Redis is unavailable, falls back to in-memory rate limiting per instance
Configuration: AUTH_RATE_LIMIT_PER_MINUTE (default: 20)
Requires: Redis connection (REDIS_HOST)

Runbooks

Cold Start / Disaster Recovery

Full system restart sequence:

Start infrastructure: PostgreSQL, Redis, NATS
Wait for PostgreSQL to be ready (accept connections)
Run cmd/migrate to apply any pending migrations
Start Temporal server, wait for it to be ready
Start application services: auth-server, api-server, agent-gateway, orchestrator, health-evaluator
Verify health endpoints respond
Verify health evaluator leader election (health_evaluator_is_leader metric)

Dirty Migration Recovery

If cmd/migrate fails mid-migration, the schema_migrations table is marked dirty=true. Subsequent migration runs will refuse to proceed.

Identify the failed migration:

SELECT version, dirty FROM schema_migrations;

Assess the damage: Check if the migration partially applied:

-- Check if the table/column from the migration exists
\d+ <table_name>

Fix the state:

If the migration fully applied but the dirty flag wasn’t cleared:

UPDATE schema_migrations SET dirty = false WHERE version = <version>;

If the migration partially applied, manually complete or revert it, then update the version:

-- After manual fix:
UPDATE schema_migrations SET dirty = false WHERE version = <version>;

Re-run migrations:
```
go run ./cmd/migrate
```

Health Evaluator Leader Troubleshooting

See Leader Election - Troubleshooting above.

PostgreSQL Failover Verification

After a managed PostgreSQL failover event:

Verify all services reconnected:

docker compose logs --tail 50 api-server | grep -i "connect\|error\|pool"

Check connection pool health:

curl -s http://localhost:8080/readyz | jq .

Verify health evaluator re-acquired leader lock after failover
Check for any statement timeout spikes during failover window

Redis Failover Verification

After a Redis Sentinel failover:

Verify command queue is operational (agents receiving commands)

Verify rate limiter circuit breaker recovered:

docker compose logs api-server | grep -i "circuit"

Check auth server rate limiter recovered (same circuit breaker pattern)

Connection Pool Tuning

When db_pool_active_connections > 80% of MaxConns or db_pool_acquire_duration_seconds p99 > 1s:

Identify which service is exhausting its pool
Check if slow queries are holding connections (see Statement Timeout Runbook in Database Guardrails)
Increase DB_MAX_CONNS for the affected service
Verify the total connection budget stays under PostgreSQL max_connections
If total budget exceeds limits, consider deploying PgBouncer

Query Performance Degradation

When p95 SLO is breached:

Identify the degraded service from hoodcloud_db_query_duration_seconds metric

Query pg_stat_statements for the top queries by mean execution time:

SELECT query, calls, mean_exec_time, total_exec_time
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 20;

Run EXPLAIN ANALYZE on the slow query
Check for missing indexes, table bloat, or lock contention
If the health evaluator’s ListSnapshots query is degrading, check node count growth

Troubleshooting

Caddy Certificate Issues

Verify DNS records: dig api.hoodcloud.io
Check ports 80 and 443 are open
Check logs: docker compose logs caddy

Database Connection Issues

docker compose ps postgres
docker exec hoodcloud-api-server nc -zv postgres 5432

Temporal Issues

docker compose logs temporal
# Access Temporal UI via SSH tunnel:
ssh -L 8088:localhost:8088 user@server
# Then open http://localhost:8088

API Server Won’t Start - Missing Binaries Dir

The API server builds ops-agent into /app/binaries/ during Docker build. Rebuild if missing:

docker compose build api-server
docker compose up -d --force-recreate api-server

Services Fail After Vault Restart

Vault seals on every restart. Unseal with 3 of 5 keys before services can authenticate. See Vault Operations.

Database Errors After Secrets Rotation

# Wait for cache TTL (default 5m), or force restart:
docker compose restart api-server auth-server orchestrator health-evaluator agent-gateway

Security Notes

All internal services bind to 127.0.0.1 only
External access only through Caddy reverse proxy (auto TLS)
gRPC uses self-signed CA for ops-agent authentication
Security boundaries: see CLAUDE.md

​Local Development

​What Happens on Startup

​Vault UI

​Dev Notes

​Single-Server Staging

​Architecture

​Prerequisites

​Setup

​Generate Certificates

​Deploy

​Verify

​Internal Ports

​Day-to-Day Operations

​View Logs

​Restart Services

​Update Deployment

​Release Chain Configs

​Database Backup & Restore

​Secrets Rotation

​Upgrade Rollout Operations

​Create and Start a Rollout

​Monitor Rollout Progress

​Pause, Resume, Cancel

​Rollback

​Multi-Binary Rollout (Rollout Groups)

​Database Queries (Rollout Debugging)

​Rolling Uptime

​How It Works

​Configuration Defaults

​API Endpoints

​Operational Characteristics

​Monitoring

​Database Queries (Debugging)

​Manual Bucket Recomputation

​Leader Election (Health Evaluator)

​How It Works

​Monitoring

​Troubleshooting

​Auth Server Rate Limiting

​Runbooks

​Cold Start / Disaster Recovery

​Dirty Migration Recovery

​Health Evaluator Leader Troubleshooting

​PostgreSQL Failover Verification

​Redis Failover Verification

​Connection Pool Tuning

​Query Performance Degradation

​Troubleshooting

​Caddy Certificate Issues

​Database Connection Issues

​Temporal Issues

​API Server Won’t Start - Missing Binaries Dir

​Services Fail After Vault Restart

​Database Errors After Secrets Rotation

​Security Notes

Local Development

What Happens on Startup

Vault UI

Dev Notes

Single-Server Staging

Architecture

Prerequisites

Setup

Generate Certificates

Deploy

Verify

Internal Ports

Day-to-Day Operations

View Logs

Restart Services

Update Deployment

Release Chain Configs

Database Backup & Restore

Secrets Rotation

Upgrade Rollout Operations

Create and Start a Rollout

Monitor Rollout Progress

Pause, Resume, Cancel

Rollback

Multi-Binary Rollout (Rollout Groups)

Database Queries (Rollout Debugging)

Rolling Uptime

How It Works

Configuration Defaults

API Endpoints

Operational Characteristics

Monitoring

Database Queries (Debugging)

Manual Bucket Recomputation

Leader Election (Health Evaluator)

How It Works

Monitoring

Troubleshooting

Auth Server Rate Limiting

Runbooks

Cold Start / Disaster Recovery

Dirty Migration Recovery

Health Evaluator Leader Troubleshooting

PostgreSQL Failover Verification

Redis Failover Verification

Connection Pool Tuning

Query Performance Degradation

Troubleshooting

Caddy Certificate Issues

Database Connection Issues

Temporal Issues

API Server Won’t Start - Missing Binaries Dir

Services Fail After Vault Restart

Database Errors After Secrets Rotation

Security Notes