Required Configuration
- All required env vars set (see Environment Variables)
-
SERVER_PUBLIC_URLset to production domain -
GRPC_PUBLIC_URLset to production gRPC endpoint - Database credentials configured (via Vault)
- Redis credentials configured (via Vault)
- Temporal connection configured
- AWS credentials and S3 buckets configured
- Cloud provider credentials (
HCLOUD_TOKEN) configured
Authentication
- JWT RSA keys stored in Vault (
jwt_private_key,jwt_public_keyinapp-credentials) -
AUTH_ALLOWED_ORIGINSset to production frontend origins -
AUTH_PROVIDERset toclerk(default) -
AUTH_CLERK_SECRET_KEYstored in Vault (from Clerk dashboard -> API Keys) -
AUTH_CLERK_AUTHORIZED_PARTYmatches production frontend URL -
AUTH_CLERK_WEBHOOK_SIGNING_SECRETset and matches Clerk dashboard - Clerk webhook endpoint configured:
https://<auth-server>/webhooks/clerk - Clerk webhook events:
user.created,user.updated,user.deleted
See also: Clerk Setup for full Clerk configuration guide.
Vault & Secrets
- Vault initialized, unsealed, and accessible
- AppRole credentials deployed to control plane (
/opt/hoodcloud/secrets/vault-secret-id) - Vault CA cert deployed (
/opt/hoodcloud/secrets/vault-ca.crt) -
VAULT_ADDR,VAULT_ROLE_ID,VAULT_SECRET_ID_PATH,VAULT_TLS_CA_FILEset in.env - All secrets populated in
secret/hoodcloud/app-credentials -
sealed_box_public_keyandsealed_box_private_keypresent and valid (mandatory) - Master key stored at
secret/hoodcloud/master-key - Root token revoked after setup
See also: Vault Operations for setup steps and secret structure.
Security
- All secrets in HashiCorp Vault (not committed to version control)
- No default passwords (
GRAFANA_ADMIN_PASSWORDmust be set explicitly) - TLS enabled for gRPC (
GRPC_USE_TLS=true) - API authentication enabled (
API_AUTH_ENABLED=true) — disabling requiresDANGEROUS_DISABLE_AUTH=true - Rate limiting configured (
API_RATE_LIMIT_ENABLED=true) - Auth rate limiting configured (
AUTH_RATE_LIMIT_PER_MINUTE) - Config signature verification enabled (
REQUIRE_CONFIG_SIGNATURE=true) - Database SSL enabled (
DB_SSL_MODE=requireorverify-full) -
NATS_ACCOUNTS_ENABLED=truewhen NATS is enabled in production -
VAULT_TLS_SKIP_VERIFYis NOTtruein production - Payment service
TLS_ALLOWED_CNset for client certificate verification
Sealed Box Keypair (Mandatory)
sealed_box_public_key and sealed_box_private_key must be in secret/hoodcloud/app-credentials. These are base64-encoded 32-byte X25519 keys for NaCl sealed box encryption. api-server and orchestrator fail fast at startup if missing or invalid.
See Vault - Generate X25519 Keypair.
Incident Notifications (Optional)
The health-evaluator reads notification credentials from Vault (secret/hoodcloud/app-credentials):
| Channel | Vault Fields |
|---|---|
| Slack | incident_slack_webhook_url |
| Telegram | incident_telegram_bot_token, incident_telegram_chat_id |
incident_email_api_url, incident_email_api_key, incident_email_from, incident_email_to | |
| Webhook | incident_webhook_url, incident_webhook_secret |
NATS JWT Operator Mode
- Run
nats-jwt-setupon the control plane server: - Apply the printed Vault commands to store signing seeds and public keys
- Verify Vault has all NATS fields in
secret/hoodcloud/app-credentials:nats_operator_signing_seed,nats_agent_account_signing_seed,nats_ctrl_account_signing_seednats_agent_account_pub,nats_ctrl_account_pub
- If payment service uses same NATS: add
nats_ctrl_signing_seedandnats_ctrl_account_pubtosecret/payment-service/credentials - Set env vars in
.env:NATS_JWT_ENABLED=trueNATS_JWT_CTRL_ACCOUNT_PUB=<from nats-jwt-setup output>NATS_JWT_AGENT_ACCOUNT_PUB=<from nats-jwt-setup output>NATS_JWT_EXTERNAL_URL=tls://nats.hoodcloud.io:4223
- If migrating from token auth: wipe NATS data volume (
docker volume rm hoodcloud_nats_data) - Restart all services and verify
"NATS JWT credentials created"in logs - Verify cross-account metrics flow:
nats stream info HOODCLOUD_METRICS
See also: NATS JWT Architecture for account structure, credential flows, and operational notes.
NATS TLS (Let’s Encrypt)
- Certbot installed on control plane server (
apt-get install certbot python3-certbot-dns-cloudflare) - Cloudflare API token stored in Vault at
secret/infra/certbot/cloudflare(key:api_token, scoped toZone:DNS:Edit+Zone:Zone:Read) - Vault policy
certbot-natscreated (read-only onsecret/data/infra/certbot/cloudflare) - Vault AppRole
certbot-natscreated (TTL 5m, max TTL 10m) - AppRole credentials deployed:
- RoleID at
/opt/hoodcloud/secrets/certbot-role-id(mode 0400, root only) - SecretID at
/opt/hoodcloud/secrets/certbot-secret-id(mode 0400, root only)
- RoleID at
- Certbot wrapper script at
/opt/hoodcloud/scripts/certbot-renew.sh(mode 0700) — authenticates via AppRole, no persistent Cloudflare token on disk - Certbot systemd service overridden to use wrapper script (prevents
certbot renewfrom running without Vault-sourced credentials): - Initial certificate issued via wrapper script:
/opt/hoodcloud/scripts/certbot-renew.sh - Deploy hook at
/opt/hoodcloud/scripts/nats-cert-deploy.shis executable and symlinked to/etc/letsencrypt/renewal-hooks/deploy/nats-reload.sh - Certificates present at
/opt/hoodcloud/nats-tls/fullchain.pemandprivkey.pem - NATS container mounts
/opt/hoodcloud/nats-tls:/etc/nats/tls:ro - x509-certificate-exporter running and scraping
/opt/hoodcloud/nats-tls/fullchain.pem - Prometheus scraping x509-exporter metrics
- Alert rules loaded:
NATSTLSCertExpiringSoon(14 days),NATSTLSCertExpiryCritical(3 days) - Certificate expiry > 30 days:
openssl x509 -enddate -noout -in /opt/hoodcloud/nats-tls/fullchain.pem
Payment Service
-
PAYMENT_SERVICE_ENABLED=true -
PAYMENT_SERVICE_ADDRESSset to payment server gRPC endpoint - mTLS certificates generated and deployed to both servers
-
TLS_ALLOWED_CNset on payment service -
DB_SSL_MODE=require(notdisable) - NATS JWT operator mode configured (signing seeds in Vault, account public keys in
.env)
Database Migrations
Migrations are run via the dedicatedcmd/migrate binary as a pre-deploy step. Services no longer run migrations at startup.
- Idempotent — safe to run multiple times
- Must run BEFORE starting any service after a code update
- Exits non-zero on failure — deployment pipeline should abort
- Falls back to config-only credentials if Vault is unavailable
| Migration | Description |
|---|---|
027_node_health_state | Persistent health state with optimistic locking |
028_incidents | Incident records with dedup index |
029_notification_outbox | Persistent retry outbox for notifications |
030_health_event_outbox | Event outbox for health transition events |
031_incident_pipeline_fixes | is_flapping and resolution_debounce columns |
039_rolling_uptime | node_state_log and node_uptime_hourly tables for rolling uptime |
XXX_add_node_state_version | state_version column on nodes table for optimistic locking |
cmd/migrate fails mid-migration, the schema_migrations table is marked dirty. See Runbooks - Dirty Migration Recovery.
Production Validation
ValidateProductionReadiness() checks when ENVIRONMENT=production:
NATS_ACCOUNTS_ENABLEDmust betruewhen NATS enabledVAULT_TLS_SKIP_VERIFYmust not betrueDB_SSL_MODEmust not bedisablePAYMENT_CONSUMER_ENABLEDmust be explicitlytrueGRAFANA_ADMIN_PASSWORDmust be set
Rollout Authorization
- Admin API keys provisioned for authorized operators (backend-only, not self-assignable)
- Regular users receive 403 on rollout endpoints (verify with test request)
-
APIKeyScopeAdmincannot be created via the API key creation endpoint
admin scope must be inserted directly into the database or via a backend script. The API key creation endpoint rejects admin scope requests.
Leader Election (Health Evaluator)
- Health evaluator deployed with 2+ instances (one active leader, N-1 standby)
-
health_evaluator_is_leadermetric visible in Prometheus - Verify only 1 instance runs evaluation cycles (check logs for “Acquired leader lock”)
- Verify standby instances run only outbox worker and metrics ingester
Connection Pool & Statement Timeouts
- Per-service
DB_MAX_CONNSandDB_STATEMENT_TIMEOUTconfigured (see Environment Variables) - Total connection budget verified:
sum(MaxConns × instances) < PostgreSQL max_connections -
pg_stat_statementsenabled in PostgreSQL config -
log_min_duration_statement = 500set in PostgreSQL config
NATS Cluster
- 3-node NATS cluster deployed
-
NATS_STREAM_REPLICAS=3set for production - Verify streams created with R=3:
nats stream info HOODCLOUD_EVENTS
Deployment Order
- Run
cmd/migrate(pre-deploy step) - Deploy infrastructure (PostgreSQL HA, Redis HA, NATS 3-node cluster)
- Deploy services (recommended order: 2C → 2E → 2D for incremental protection)
- Verify leader election, connection pools, statement timeouts
Operational
-
LOG_LEVELset toinfoorwarn - Health evaluation intervals configured
- Subscription cleanup enabled
- NATS event streaming enabled for monitoring
- Observability URLs configured (
LOKI_URL,PROMETHEUS_URL) - Backup cleanup configured (
AWS_S3_BUCKETset)
Infrastructure
- IAM policy applied (see
infrastructure/iam/hoodcloud-minimal-policy.json) - Terraform state: local backend persistent or S3 backend configured
- If S3 backend:
TERRAFORM_S3_BUCKET,TERRAFORM_DYNAMODB_TABLEset - Chain profiles available (local dir or S3 bucket)
- Ops agent binaries available for download
- Hetzner Cloud firewall: ports 80, 443, 9090 open
Pre-Deployment File Checks
Verification Commands
Common Deployment Issues
| Issue | Symptom | Fix |
|---|---|---|
| Empty OPS_AGENT_VERSION | Double dash in download URL (ops-agent--linux-amd64) | Set OPS_AGENT_VERSION=0.1.0 |
| Missing binaries dir | /downloads/ returns 404 | Set OPS_AGENT_BINARIES_DIR and verify binary exists |
| runtime.yaml not found | ops-agent “load runtime config” error | Verify CONFIG_DIR and config bundle download |
| Port 9090 blocked | Nodes can’t connect to gRPC | Add firewall rule for TCP 9090 |
| Vault auth failure | ”authenticate with Vault” error | Check VAULT_ROLE_ID, VAULT_SECRET_ID_PATH, unseal state |
| Vault secrets missing | ”get secret from Vault” error | Verify secrets: vault kv get secret/hoodcloud/app-credentials |