Skip to main content
Quick-reference checklist before deploying to production. For full setup steps, see Production Deployment.

Required Configuration

  • All required env vars set (see Environment Variables)
  • SERVER_PUBLIC_URL set to production domain
  • GRPC_PUBLIC_URL set to production gRPC endpoint
  • Database credentials configured (via Vault)
  • Redis credentials configured (via Vault)
  • Temporal connection configured
  • AWS credentials and S3 buckets configured
  • Cloud provider credentials (HCLOUD_TOKEN) configured

Authentication

  • JWT RSA keys stored in Vault (jwt_private_key, jwt_public_key in app-credentials)
  • AUTH_ALLOWED_ORIGINS set to production frontend origins
  • AUTH_PROVIDER set to clerk (default)
  • AUTH_CLERK_SECRET_KEY stored in Vault (from Clerk dashboard -> API Keys)
  • AUTH_CLERK_AUTHORIZED_PARTY matches production frontend URL
  • AUTH_CLERK_WEBHOOK_SIGNING_SECRET set and matches Clerk dashboard
  • Clerk webhook endpoint configured: https://<auth-server>/webhooks/clerk
  • Clerk webhook events: user.created, user.updated, user.deleted
See also: Clerk Setup for full Clerk configuration guide.

Vault & Secrets

  • Vault initialized, unsealed, and accessible
  • AppRole credentials deployed to control plane (/opt/hoodcloud/secrets/vault-secret-id)
  • Vault CA cert deployed (/opt/hoodcloud/secrets/vault-ca.crt)
  • VAULT_ADDR, VAULT_ROLE_ID, VAULT_SECRET_ID_PATH, VAULT_TLS_CA_FILE set in .env
  • All secrets populated in secret/hoodcloud/app-credentials
  • sealed_box_public_key and sealed_box_private_key present and valid (mandatory)
  • Master key stored at secret/hoodcloud/master-key
  • Root token revoked after setup
See also: Vault Operations for setup steps and secret structure.

Security

  • All secrets in HashiCorp Vault (not committed to version control)
  • No default passwords (GRAFANA_ADMIN_PASSWORD must be set explicitly)
  • TLS enabled for gRPC (GRPC_USE_TLS=true)
  • API authentication enabled (API_AUTH_ENABLED=true) — disabling requires DANGEROUS_DISABLE_AUTH=true
  • Rate limiting configured (API_RATE_LIMIT_ENABLED=true)
  • Auth rate limiting configured (AUTH_RATE_LIMIT_PER_MINUTE)
  • Config signature verification enabled (REQUIRE_CONFIG_SIGNATURE=true)
  • Database SSL enabled (DB_SSL_MODE=require or verify-full)
  • NATS_ACCOUNTS_ENABLED=true when NATS is enabled in production
  • VAULT_TLS_SKIP_VERIFY is NOT true in production
  • Payment service TLS_ALLOWED_CN set for client certificate verification

Sealed Box Keypair (Mandatory)

sealed_box_public_key and sealed_box_private_key must be in secret/hoodcloud/app-credentials. These are base64-encoded 32-byte X25519 keys for NaCl sealed box encryption. api-server and orchestrator fail fast at startup if missing or invalid. See Vault - Generate X25519 Keypair.

Incident Notifications (Optional)

The health-evaluator reads notification credentials from Vault (secret/hoodcloud/app-credentials):
ChannelVault Fields
Slackincident_slack_webhook_url
Telegramincident_telegram_bot_token, incident_telegram_chat_id
Emailincident_email_api_url, incident_email_api_key, incident_email_from, incident_email_to
Webhookincident_webhook_url, incident_webhook_secret
If none configured, incidents are tracked in DB only. At least one channel recommended for production.

NATS JWT Operator Mode

  • Run nats-jwt-setup on the control plane server:
    go run cmd/nats-jwt-setup/main.go -output-dir infrastructure/docker/nats/jwt
    
  • Apply the printed Vault commands to store signing seeds and public keys
  • Verify Vault has all NATS fields in secret/hoodcloud/app-credentials:
    • nats_operator_signing_seed, nats_agent_account_signing_seed, nats_ctrl_account_signing_seed
    • nats_agent_account_pub, nats_ctrl_account_pub
  • If payment service uses same NATS: add nats_ctrl_signing_seed and nats_ctrl_account_pub to secret/payment-service/credentials
  • Set env vars in .env:
    • NATS_JWT_ENABLED=true
    • NATS_JWT_CTRL_ACCOUNT_PUB=<from nats-jwt-setup output>
    • NATS_JWT_AGENT_ACCOUNT_PUB=<from nats-jwt-setup output>
    • NATS_JWT_EXTERNAL_URL=tls://nats.hoodcloud.io:4223
  • If migrating from token auth: wipe NATS data volume (docker volume rm hoodcloud_nats_data)
  • Restart all services and verify "NATS JWT credentials created" in logs
  • Verify cross-account metrics flow: nats stream info HOODCLOUD_METRICS
See also: NATS JWT Architecture for account structure, credential flows, and operational notes.

NATS TLS (Let’s Encrypt)

  • Certbot installed on control plane server (apt-get install certbot python3-certbot-dns-cloudflare)
  • Cloudflare API token stored in Vault at secret/infra/certbot/cloudflare (key: api_token, scoped to Zone:DNS:Edit + Zone:Zone:Read)
  • Vault policy certbot-nats created (read-only on secret/data/infra/certbot/cloudflare)
  • Vault AppRole certbot-nats created (TTL 5m, max TTL 10m)
  • AppRole credentials deployed:
    • RoleID at /opt/hoodcloud/secrets/certbot-role-id (mode 0400, root only)
    • SecretID at /opt/hoodcloud/secrets/certbot-secret-id (mode 0400, root only)
  • Certbot wrapper script at /opt/hoodcloud/scripts/certbot-renew.sh (mode 0700) — authenticates via AppRole, no persistent Cloudflare token on disk
  • Certbot systemd service overridden to use wrapper script (prevents certbot renew from running without Vault-sourced credentials):
    mkdir -p /etc/systemd/system/certbot.service.d
    cat > /etc/systemd/system/certbot.service.d/override.conf <<EOF
    [Service]
    ExecStart=
    ExecStart=/opt/hoodcloud/scripts/certbot-renew.sh
    EOF
    systemctl daemon-reload
    
  • Initial certificate issued via wrapper script: /opt/hoodcloud/scripts/certbot-renew.sh
  • Deploy hook at /opt/hoodcloud/scripts/nats-cert-deploy.sh is executable and symlinked to /etc/letsencrypt/renewal-hooks/deploy/nats-reload.sh
  • Certificates present at /opt/hoodcloud/nats-tls/fullchain.pem and privkey.pem
  • NATS container mounts /opt/hoodcloud/nats-tls:/etc/nats/tls:ro
  • x509-certificate-exporter running and scraping /opt/hoodcloud/nats-tls/fullchain.pem
  • Prometheus scraping x509-exporter metrics
  • Alert rules loaded: NATSTLSCertExpiringSoon (14 days), NATSTLSCertExpiryCritical (3 days)
  • Certificate expiry > 30 days: openssl x509 -enddate -noout -in /opt/hoodcloud/nats-tls/fullchain.pem

Payment Service

  • PAYMENT_SERVICE_ENABLED=true
  • PAYMENT_SERVICE_ADDRESS set to payment server gRPC endpoint
  • mTLS certificates generated and deployed to both servers
  • TLS_ALLOWED_CN set on payment service
  • DB_SSL_MODE=require (not disable)
  • NATS JWT operator mode configured (signing seeds in Vault, account public keys in .env)

Database Migrations

Migrations are run via the dedicated cmd/migrate binary as a pre-deploy step. Services no longer run migrations at startup.
# Run migrations before deploying services
go run ./cmd/migrate
# Or use compiled binary:
./migrate
  • Idempotent — safe to run multiple times
  • Must run BEFORE starting any service after a code update
  • Exits non-zero on failure — deployment pipeline should abort
  • Falls back to config-only credentials if Vault is unavailable
Key migrations:
MigrationDescription
027_node_health_statePersistent health state with optimistic locking
028_incidentsIncident records with dedup index
029_notification_outboxPersistent retry outbox for notifications
030_health_event_outboxEvent outbox for health transition events
031_incident_pipeline_fixesis_flapping and resolution_debounce columns
039_rolling_uptimenode_state_log and node_uptime_hourly tables for rolling uptime
XXX_add_node_state_versionstate_version column on nodes table for optimistic locking
Dirty migration recovery: If cmd/migrate fails mid-migration, the schema_migrations table is marked dirty. See Runbooks - Dirty Migration Recovery.

Production Validation

ValidateProductionReadiness() checks when ENVIRONMENT=production:
  • NATS_ACCOUNTS_ENABLED must be true when NATS enabled
  • VAULT_TLS_SKIP_VERIFY must not be true
  • DB_SSL_MODE must not be disable
  • PAYMENT_CONSUMER_ENABLED must be explicitly true
  • GRAFANA_ADMIN_PASSWORD must be set

Rollout Authorization

  • Admin API keys provisioned for authorized operators (backend-only, not self-assignable)
  • Regular users receive 403 on rollout endpoints (verify with test request)
  • APIKeyScopeAdmin cannot be created via the API key creation endpoint
Creating admin API keys: Admin API keys with admin scope must be inserted directly into the database or via a backend script. The API key creation endpoint rejects admin scope requests.
-- Example: insert admin API key (hash the key first)
INSERT INTO api_keys (id, user_id, name, key_hash, scopes, created_at, updated_at)
VALUES (gen_random_uuid(), '<admin-user-id>', 'admin-key', '<bcrypt-hash>', ARRAY['admin'], now(), now());

Leader Election (Health Evaluator)

  • Health evaluator deployed with 2+ instances (one active leader, N-1 standby)
  • health_evaluator_is_leader metric visible in Prometheus
  • Verify only 1 instance runs evaluation cycles (check logs for “Acquired leader lock”)
  • Verify standby instances run only outbox worker and metrics ingester

Connection Pool & Statement Timeouts

  • Per-service DB_MAX_CONNS and DB_STATEMENT_TIMEOUT configured (see Environment Variables)
  • Total connection budget verified: sum(MaxConns × instances) < PostgreSQL max_connections
  • pg_stat_statements enabled in PostgreSQL config
  • log_min_duration_statement = 500 set in PostgreSQL config

NATS Cluster

  • 3-node NATS cluster deployed
  • NATS_STREAM_REPLICAS=3 set for production
  • Verify streams created with R=3: nats stream info HOODCLOUD_EVENTS

Deployment Order

  1. Run cmd/migrate (pre-deploy step)
  2. Deploy infrastructure (PostgreSQL HA, Redis HA, NATS 3-node cluster)
  3. Deploy services (recommended order: 2C → 2E → 2D for incremental protection)
  4. Verify leader election, connection pools, statement timeouts

Operational

  • LOG_LEVEL set to info or warn
  • Health evaluation intervals configured
  • Subscription cleanup enabled
  • NATS event streaming enabled for monitoring
  • Observability URLs configured (LOKI_URL, PROMETHEUS_URL)
  • Backup cleanup configured (AWS_S3_BUCKET set)

Infrastructure

  • IAM policy applied (see infrastructure/iam/hoodcloud-minimal-policy.json)
  • Terraform state: local backend persistent or S3 backend configured
  • If S3 backend: TERRAFORM_S3_BUCKET, TERRAFORM_DYNAMODB_TABLE set
  • Chain profiles available (local dir or S3 bucket)
  • Ops agent binaries available for download
  • Hetzner Cloud firewall: ports 80, 443, 9090 open

Pre-Deployment File Checks

# Ops-agent binary exists
ls -la $OPS_AGENT_BINARIES_DIR/ops-agent-${OPS_AGENT_VERSION}-linux-amd64

# Chain configs uploaded (if using S3)
aws s3 ls s3://${CHAINS_S3_BUCKET}/latest/configs.tar.gz

Verification Commands

# Uptime worker running inside health-evaluator
docker compose logs health-evaluator | grep -E "(uptime|UptimeWorker)"

# gRPC connectivity
nc -zv <GRPC_HOST> 9090

# Download endpoint
curl -I https://<SERVER_PUBLIC_URL>/downloads/ops-agent-${OPS_AGENT_VERSION}-linux-amd64

# Service startup
docker compose logs api-server | grep -E "(OPS_AGENT|error|Error)"
docker compose logs orchestrator | grep -E "(OPS_AGENT|error|Error)"
docker compose logs health-evaluator | grep -E "(incident|notification)"

# Vault connectivity
curl -s ${VAULT_ADDR}/v1/sys/health | jq .
vault kv get secret/hoodcloud/app-credentials

# Health checks
curl -s https://api.hoodcloud.io/health | jq .
curl -s https://auth.hoodcloud.io/health | jq .

Common Deployment Issues

IssueSymptomFix
Empty OPS_AGENT_VERSIONDouble dash in download URL (ops-agent--linux-amd64)Set OPS_AGENT_VERSION=0.1.0
Missing binaries dir/downloads/ returns 404Set OPS_AGENT_BINARIES_DIR and verify binary exists
runtime.yaml not foundops-agent “load runtime config” errorVerify CONFIG_DIR and config bundle download
Port 9090 blockedNodes can’t connect to gRPCAdd firewall rule for TCP 9090
Vault auth failure”authenticate with Vault” errorCheck VAULT_ROLE_ID, VAULT_SECRET_ID_PATH, unseal state
Vault secrets missing”get secret from Vault” errorVerify secrets: vault kv get secret/hoodcloud/app-credentials