Deployment Checklist

Quick-reference checklist before deploying to production. For full setup steps, see Production Deployment.

Required Configuration

All required env vars set (see Environment Variables)
SERVER_PUBLIC_URL set to production domain
GRPC_PUBLIC_URL set to production gRPC endpoint
Database credentials configured (via Vault)
Redis credentials configured (via Vault)
Temporal connection configured
AWS credentials and S3 buckets configured
Cloud provider credentials (HCLOUD_TOKEN) configured

Authentication

JWT RSA keys stored in Vault (jwt_private_key, jwt_public_key in app-credentials)
AUTH_ALLOWED_ORIGINS set to production frontend origins
AUTH_PROVIDER set to clerk (default)
AUTH_CLERK_SECRET_KEY stored in Vault (from Clerk dashboard -> API Keys)
AUTH_CLERK_AUTHORIZED_PARTY matches production frontend URL
AUTH_CLERK_WEBHOOK_SIGNING_SECRET set and matches Clerk dashboard
Clerk webhook endpoint configured: https://<auth-server>/webhooks/clerk
Clerk webhook events: user.created, user.updated, user.deleted

See also: Clerk Setup for full Clerk configuration guide.

Vault & Secrets

Vault initialized, unsealed, and accessible
AppRole credentials deployed to control plane (/opt/hoodcloud/secrets/vault-secret-id)
Vault CA cert deployed (/opt/hoodcloud/secrets/vault-ca.crt)
VAULT_ADDR, VAULT_ROLE_ID, VAULT_SECRET_ID_PATH, VAULT_TLS_CA_FILE set in .env
All secrets populated in secret/hoodcloud/app-credentials
sealed_box_public_key and sealed_box_private_key present and valid (mandatory)
Master key stored at secret/hoodcloud/master-key
Root token revoked after setup

See also: Vault Operations for setup steps and secret structure.

Security

Sealed Box Keypair (Mandatory)

sealed_box_public_key and sealed_box_private_key must be in secret/hoodcloud/app-credentials. These are base64-encoded 32-byte X25519 keys for NaCl sealed box encryption. api-server and orchestrator fail fast at startup if missing or invalid. See Vault - Generate X25519 Keypair.

Incident Notifications (Optional)

The health-evaluator reads notification credentials from Vault (secret/hoodcloud/app-credentials):

Channel	Vault Fields
Slack	`incident_slack_webhook_url`
Telegram	`incident_telegram_bot_token`, `incident_telegram_chat_id`
Email	`incident_email_api_url`, `incident_email_api_key`, `incident_email_from`, `incident_email_to`
Webhook	`incident_webhook_url`, `incident_webhook_secret`

If none configured, incidents are tracked in DB only. At least one channel recommended for production.

NATS JWT Operator Mode

Run nats-jwt-setup on the control plane server:

go run cmd/nats-jwt-setup/main.go -output-dir infrastructure/docker/nats/jwt

Apply the printed Vault commands to store signing seeds and public keys
Verify Vault has all NATS fields in secret/hoodcloud/app-credentials:
- nats_operator_signing_seed, nats_agent_account_signing_seed, nats_ctrl_account_signing_seed
- nats_agent_account_pub, nats_ctrl_account_pub
If payment service uses same NATS: add nats_ctrl_signing_seed and nats_ctrl_account_pub to secret/payment-service/credentials
Set env vars in .env:
- NATS_JWT_ENABLED=true
- NATS_JWT_CTRL_ACCOUNT_PUB=<from nats-jwt-setup output>
- NATS_JWT_AGENT_ACCOUNT_PUB=<from nats-jwt-setup output>
- NATS_JWT_EXTERNAL_URL=tls://nats.hoodcloud.io:4223
If migrating from token auth: wipe NATS data volume (docker volume rm hoodcloud_nats_data)
Restart all services and verify "NATS JWT credentials created" in logs
Verify cross-account metrics flow: nats stream info HOODCLOUD_METRICS

See also: NATS JWT Architecture for account structure, credential flows, and operational notes.

NATS TLS (Let’s Encrypt)

Certbot installed on control plane server (apt-get install certbot python3-certbot-dns-cloudflare)
Cloudflare API token stored in Vault at secret/infra/certbot/cloudflare (key: api_token, scoped to Zone:DNS:Edit + Zone:Zone:Read)
Vault policy certbot-nats created (read-only on secret/data/infra/certbot/cloudflare)
Vault AppRole certbot-nats created (TTL 5m, max TTL 10m)
AppRole credentials deployed:
- RoleID at /opt/hoodcloud/secrets/certbot-role-id (mode 0400, root only)
- SecretID at /opt/hoodcloud/secrets/certbot-secret-id (mode 0400, root only)
Certbot wrapper script at /opt/hoodcloud/scripts/certbot-renew.sh (mode 0700) — authenticates via AppRole, no persistent Cloudflare token on disk

Certbot systemd service overridden to use wrapper script (prevents certbot renew from running without Vault-sourced credentials):

mkdir -p /etc/systemd/system/certbot.service.d
cat > /etc/systemd/system/certbot.service.d/override.conf <<EOF
[Service]
ExecStart=
ExecStart=/opt/hoodcloud/scripts/certbot-renew.sh
EOF
systemctl daemon-reload

Initial certificate issued via wrapper script: /opt/hoodcloud/scripts/certbot-renew.sh
Deploy hook at /opt/hoodcloud/scripts/nats-cert-deploy.sh is executable and symlinked to /etc/letsencrypt/renewal-hooks/deploy/nats-reload.sh
Certificates present at /opt/hoodcloud/nats-tls/fullchain.pem and privkey.pem
NATS container mounts /opt/hoodcloud/nats-tls:/etc/nats/tls:ro
x509-certificate-exporter running and scraping /opt/hoodcloud/nats-tls/fullchain.pem
Prometheus scraping x509-exporter metrics
Alert rules loaded: NATSTLSCertExpiringSoon (14 days), NATSTLSCertExpiryCritical (3 days)
Certificate expiry > 30 days: openssl x509 -enddate -noout -in /opt/hoodcloud/nats-tls/fullchain.pem

Payment Service

PAYMENT_SERVICE_ENABLED=true
PAYMENT_SERVICE_ADDRESS set to payment server gRPC endpoint
mTLS certificates generated and deployed to both servers
TLS_ALLOWED_CN set on payment service
DB_SSL_MODE=require (not disable)
NATS JWT operator mode configured (signing seeds in Vault, account public keys in .env)

Database Migrations

Migrations are run via the dedicated cmd/migrate binary as a pre-deploy step. Services no longer run migrations at startup.

# Run migrations before deploying services
go run ./cmd/migrate
# Or use compiled binary:
./migrate

Idempotent — safe to run multiple times
Must run BEFORE starting any service after a code update
Exits non-zero on failure — deployment pipeline should abort
Falls back to config-only credentials if Vault is unavailable

Key migrations:

Migration	Description
`027_node_health_state`	Persistent health state with optimistic locking
`028_incidents`	Incident records with dedup index
`029_notification_outbox`	Persistent retry outbox for notifications
`030_health_event_outbox`	Event outbox for health transition events
`031_incident_pipeline_fixes`	`is_flapping` and `resolution_debounce` columns
`039_rolling_uptime`	`node_state_log` and `node_uptime_hourly` tables for rolling uptime
`XXX_add_node_state_version`	`state_version` column on `nodes` table for optimistic locking

Dirty migration recovery: If cmd/migrate fails mid-migration, the schema_migrations table is marked dirty. See Runbooks - Dirty Migration Recovery.

Production Validation

ValidateProductionReadiness() checks when ENVIRONMENT=production:

NATS_ACCOUNTS_ENABLED must be true when NATS enabled
VAULT_TLS_SKIP_VERIFY must not be true
DB_SSL_MODE must not be disable
PAYMENT_CONSUMER_ENABLED must be explicitly true
GRAFANA_ADMIN_PASSWORD must be set

Rollout Authorization

Admin API keys provisioned for authorized operators (backend-only, not self-assignable)
Regular users receive 403 on rollout endpoints (verify with test request)
APIKeyScopeAdmin cannot be created via the API key creation endpoint

Creating admin API keys: Admin API keys with admin scope must be inserted directly into the database or via a backend script. The API key creation endpoint rejects admin scope requests.

-- Example: insert admin API key (hash the key first)
INSERT INTO api_keys (id, user_id, name, key_hash, scopes, created_at, updated_at)
VALUES (gen_random_uuid(), '<admin-user-id>', 'admin-key', '<bcrypt-hash>', ARRAY['admin'], now(), now());

Leader Election (Health Evaluator)

Health evaluator deployed with 2+ instances (one active leader, N-1 standby)
health_evaluator_is_leader metric visible in Prometheus
Verify only 1 instance runs evaluation cycles (check logs for “Acquired leader lock”)
Verify standby instances run only outbox worker and metrics ingester

Connection Pool & Statement Timeouts

Per-service DB_MAX_CONNS and DB_STATEMENT_TIMEOUT configured (see Environment Variables)
Total connection budget verified: sum(MaxConns × instances) < PostgreSQL max_connections
pg_stat_statements enabled in PostgreSQL config
log_min_duration_statement = 500 set in PostgreSQL config

NATS Cluster

3-node NATS cluster deployed
NATS_STREAM_REPLICAS=3 set for production
Verify streams created with R=3: nats stream info HOODCLOUD_EVENTS

Deployment Order

Run cmd/migrate (pre-deploy step)
Deploy infrastructure (PostgreSQL HA, Redis HA, NATS 3-node cluster)
Deploy services (recommended order: 2C → 2E → 2D for incremental protection)
Verify leader election, connection pools, statement timeouts

Operational

LOG_LEVEL set to info or warn
Health evaluation intervals configured
Subscription cleanup enabled
NATS event streaming enabled for monitoring
Observability URLs configured (LOKI_URL, PROMETHEUS_URL)
Backup cleanup configured (AWS_S3_BUCKET set)

Infrastructure

IAM policy applied (see infrastructure/iam/hoodcloud-minimal-policy.json)
Terraform state: local backend persistent or S3 backend configured
If S3 backend: TERRAFORM_S3_BUCKET, TERRAFORM_DYNAMODB_TABLE set
Chain profiles available (local dir or S3 bucket)
Ops agent binaries available for download
Hetzner Cloud firewall: ports 80, 443, 9090 open

Pre-Deployment File Checks

# Ops-agent binary exists
ls -la $OPS_AGENT_BINARIES_DIR/ops-agent-${OPS_AGENT_VERSION}-linux-amd64

# Chain configs uploaded (if using S3)
aws s3 ls s3://${CHAINS_S3_BUCKET}/latest/configs.tar.gz

Verification Commands

# Uptime worker running inside health-evaluator
docker compose logs health-evaluator | grep -E "(uptime|UptimeWorker)"

# gRPC connectivity
nc -zv <GRPC_HOST> 9090

# Download endpoint
curl -I https://<SERVER_PUBLIC_URL>/downloads/ops-agent-${OPS_AGENT_VERSION}-linux-amd64

# Service startup
docker compose logs api-server | grep -E "(OPS_AGENT|error|Error)"
docker compose logs orchestrator | grep -E "(OPS_AGENT|error|Error)"
docker compose logs health-evaluator | grep -E "(incident|notification)"

# Vault connectivity
curl -s ${VAULT_ADDR}/v1/sys/health | jq .
vault kv get secret/hoodcloud/app-credentials

# Health checks
curl -s https://api.hoodcloud.io/health | jq .
curl -s https://auth.hoodcloud.io/health | jq .

Common Deployment Issues

Issue	Symptom	Fix
Empty OPS_AGENT_VERSION	Double dash in download URL (`ops-agent--linux-amd64`)	Set `OPS_AGENT_VERSION=0.1.0`
Missing binaries dir	`/downloads/` returns 404	Set `OPS_AGENT_BINARIES_DIR` and verify binary exists
runtime.yaml not found	ops-agent “load runtime config” error	Verify `CONFIG_DIR` and config bundle download
Port 9090 blocked	Nodes can’t connect to gRPC	Add firewall rule for TCP 9090
Vault auth failure	”authenticate with Vault” error	Check `VAULT_ROLE_ID`, `VAULT_SECRET_ID_PATH`, unseal state
Vault secrets missing	”get secret from Vault” error	Verify secrets: `vault kv get secret/hoodcloud/app-credentials`

​Required Configuration

​Authentication

​Vault & Secrets

​Security

​Sealed Box Keypair (Mandatory)

​Incident Notifications (Optional)

​NATS JWT Operator Mode

​NATS TLS (Let’s Encrypt)

​Payment Service

​Database Migrations

​Production Validation

​Rollout Authorization

​Leader Election (Health Evaluator)

​Connection Pool & Statement Timeouts

​NATS Cluster

​Deployment Order

​Operational

​Infrastructure

​Pre-Deployment File Checks

​Verification Commands

​Common Deployment Issues