See also: Production Deployment for the two-server production topology.
Local Development
.env setup needed.
What Happens on Startup
- Infrastructure starts: PostgreSQL, Redis, LocalStack (S3), Temporal
- Vault starts in dev mode (in-memory, auto-unseal, no TLS)
vault-dev-init.shruns via thevault-initsidecar:- Enables KV v2 and Transit secret engines
- Creates AppRole
hoodcloud-devwith permissive dev policy - Seeds master encryption key, JWT RSA keys, ECDSA signing key, X25519 sealed box keypair, dev passwords
- Creates Transit key
hoodcloud-master(aes256-gcm96) - Writes role-id and secret-id to
/vault/config/shared volume
- Control plane services start, reading credentials from the shared volume
- Idempotent, completes in ~5 seconds
Vault UI
- URL: http://localhost:8200
- Token:
dev-token
Dev Notes
- Data is ephemeral (Vault dev mode uses in-memory storage)
- No TLS (
VAULT_TLS_SKIP_VERIFY=trueon all services) - The same
vault-dev-init.shis used in E2E tests (tests/e2e/docker-compose.e2e.yml) on port 8201
Single-Server Staging
For staging or internal testing, the entire stack (control plane + observability) runs on a single VM.Architecture
Prerequisites
- Hetzner Cloud VM: cx31 minimum (4 vCPU, 8GB RAM), Ubuntu 22.04
- AWS resources: S3 buckets + DynamoDB table (see Production Deployment - AWS Resources)
- DNS:
api.,auth.,grafana.subdomains pointing to the VM
Setup
.env values: see Environment Variables.
Generate Certificates
Deploy
Important: Migrations must run before application services start. The cmd/migrate binary is idempotent — safe to run multiple times. Services no longer run migrations at startup.
Verify
Note:/healthreturns only top-level status. Use/readyzfor detailed component health.
Internal Ports
| Service | Port | Purpose |
|---|---|---|
| PostgreSQL | 5432 | Database |
| Redis | 6379 | Cache |
| Temporal | 7233 | Workflow engine |
| Temporal UI | 8088 | Workflow dashboard |
| Prometheus | 9090 | Metrics |
| NATS | 4222 | Event streaming |
Day-to-Day Operations
View Logs
Restart Services
Important:restartdoes not reload.envchanges. Useup -dto pick up env var updates.
Update Deployment
Release Chain Configs
When chain config files are updated in thehoodcloud-chain-configs repo:
- Validates config files against schemas
- Creates tarball + checksums
- Uploads to S3 (
s3://hoodcloud-chain-configs/v1.1.0/) - Copies to
latest/for auto-update mode - Creates GitHub release
CHAIN_PROFILE_VERSION=latest poll checksums.txt at CHAINS_CHECK_INTERVAL (default: 1m). Config swaps atomically, no restart needed.
Database Backup & Restore
Secrets Rotation
After updating secrets in Vault:See also: Vault Operations for secret rotation and AppRole credential management.
Upgrade Rollout Operations
Create and Start a Rollout
Two modes are available for rollout creation: Auto mode (recommended): Provideupgrade_id — the control plane loads the upgrade manifest from chain-configs and auto-populates binary coordinates, state_compatibility, config_changes, and manifest_content_hash.
upgrade_id — operator provides all fields explicitly.
Monitor Rollout Progress
Pause, Resume, Cancel
Rollback
Multi-Binary Rollout (Rollout Groups)
For chains with multiple binaries (e.g., ethereum-holesky: geth + lighthouse):Database Queries (Rollout Debugging)
Rolling Uptime
The UptimeWorker runs inside thehealth-evaluator process — no separate deployment needed.
How It Works
- State transitions are logged to
node_state_log(append-only) via theStateLogHandlerregistered onCompositeTransitionHandler - The UptimeWorker materializes hourly uptime buckets (
node_uptime_hourly) every 5 minutes - API endpoints read pre-computed buckets for fast queries
Configuration Defaults
No environment variables are required. Defaults are defined ininternal/defaults/defaults.go:
| Setting | Default | Description |
|---|---|---|
| Interval | 5m | How often the worker computes buckets |
| BatchSize | 500 | Nodes processed per batch |
| GracePeriod | 10m | Provisioning grace period (counts as uptime) |
| RetentionDays | 90 | Hourly buckets older than this are purged |
API Endpoints
Both requirenodes:read scope + ownership verification:
GET /api/v1/nodes/{nodeID}/uptime?window={24h|7d|30d}— Rolling uptime summaryGET /api/v1/nodes/{nodeID}/uptime/history?window={24h|7d|30d}&granularity={hourly|daily}— Per-period breakdown
Operational Characteristics
- Idempotent: Safe to restart. The worker catches up from the last complete bucket.
- Self-healing: Individual node failures are logged but don’t block other nodes.
- Auto-purge: Buckets older than 90 days are automatically deleted.
- No external cursor: Progress is implicit in
is_completeflags on hourly buckets.
Monitoring
The worker logs warnings on per-node failures:"Failed to compute uptime for node"— Per-node error (non-blocking)"Failed to compute uptime batch"— Batch-level error (continues to next batch)
Database Queries (Debugging)
Manual Bucket Recomputation
If a bug produced incorrect buckets, mark affected buckets for recomputation:Leader Election (Health Evaluator)
The health evaluator uses PostgreSQL advisory lock-based leader election. When running 2+ instances, one acquires the lock and runs evaluation cycles. The others are hot standby.How It Works
- Leader acquires advisory lock via a dedicated, non-pooled
pgx.Conn - Lock is held for the lifetime of leader tenure (connection stays open)
- On leader failure, a standby acquires the lock within 15-30 seconds
- Leader-gated goroutines: evaluation, subscription cleanup, terraform cleanup, maintenance recovery, S3 backup cleanup, incident dispatcher, rate limiter eviction, uptime worker, policy evaluation (9 of 11 total)
- NOT leader-gated: outbox worker (uses
FOR UPDATE SKIP LOCKED), metrics ingester (idempotent writes)
Monitoring
| Metric | Description |
|---|---|
health_evaluator_is_leader | 1 if this instance is leader, 0 if standby |
health_evaluator_is_leader == 0 across ALL instances for > 60s, no instance holds the lock. Check PostgreSQL connectivity.
Troubleshooting
No leader for > 60 seconds:- Check PostgreSQL connectivity from all health evaluator instances
- Check advisory lock status:
- Check if a stale connection holds the lock:
- If a stale backend holds the lock, terminate it:
Auth Server Rate Limiting
The auth server uses Redis-backed rate limiting (same implementation as the API server). This ensures rate limits are enforced globally across multiple auth server instances.- Circuit breaker fallback: If Redis is unavailable, falls back to in-memory rate limiting per instance
- Configuration:
AUTH_RATE_LIMIT_PER_MINUTE(default: 20) - Requires: Redis connection (
REDIS_HOST)
Runbooks
Cold Start / Disaster Recovery
Full system restart sequence:- Start infrastructure: PostgreSQL, Redis, NATS
- Wait for PostgreSQL to be ready (accept connections)
- Run
cmd/migrateto apply any pending migrations - Start Temporal server, wait for it to be ready
- Start application services: auth-server, api-server, agent-gateway, orchestrator, health-evaluator
- Verify health endpoints respond
- Verify health evaluator leader election (
health_evaluator_is_leadermetric)
Dirty Migration Recovery
Ifcmd/migrate fails mid-migration, the schema_migrations table is marked dirty=true. Subsequent migration runs will refuse to proceed.
-
Identify the failed migration:
-
Assess the damage: Check if the migration partially applied:
-
Fix the state:
- If the migration fully applied but the dirty flag wasn’t cleared:
- If the migration partially applied, manually complete or revert it, then update the version:
- If the migration fully applied but the dirty flag wasn’t cleared:
-
Re-run migrations:
Health Evaluator Leader Troubleshooting
See Leader Election - Troubleshooting above.PostgreSQL Failover Verification
After a managed PostgreSQL failover event:- Verify all services reconnected:
- Check connection pool health:
- Verify health evaluator re-acquired leader lock after failover
- Check for any statement timeout spikes during failover window
Redis Failover Verification
After a Redis Sentinel failover:- Verify command queue is operational (agents receiving commands)
- Verify rate limiter circuit breaker recovered:
- Check auth server rate limiter recovered (same circuit breaker pattern)
Connection Pool Tuning
Whendb_pool_active_connections > 80% of MaxConns or db_pool_acquire_duration_seconds p99 > 1s:
- Identify which service is exhausting its pool
- Check if slow queries are holding connections (see Statement Timeout Runbook in Database Guardrails)
- Increase
DB_MAX_CONNSfor the affected service - Verify the total connection budget stays under PostgreSQL
max_connections - If total budget exceeds limits, consider deploying PgBouncer
Query Performance Degradation
When p95 SLO is breached:- Identify the degraded service from
hoodcloud_db_query_duration_secondsmetric - Query
pg_stat_statementsfor the top queries by mean execution time: - Run
EXPLAIN ANALYZEon the slow query - Check for missing indexes, table bloat, or lock contention
- If the health evaluator’s
ListSnapshotsquery is degrading, check node count growth
Troubleshooting
Caddy Certificate Issues
- Verify DNS records:
dig api.hoodcloud.io - Check ports 80 and 443 are open
- Check logs:
docker compose logs caddy
Database Connection Issues
Temporal Issues
API Server Won’t Start - Missing Binaries Dir
The API server builds ops-agent into/app/binaries/ during Docker build. Rebuild if missing:
Services Fail After Vault Restart
Vault seals on every restart. Unseal with 3 of 5 keys before services can authenticate. See Vault Operations.Database Errors After Secrets Rotation
Security Notes
- All internal services bind to
127.0.0.1only - External access only through Caddy reverse proxy (auto TLS)
- gRPC uses self-signed CA for ops-agent authentication
- Security boundaries: see
CLAUDE.md