Last verified: 2026-02-14
Overview
NATS uses JWT operator mode for authentication and authorization. This replaces single-token auth with a cryptographic trust hierarchy where the NATS server verifies user JWTs against pre-loaded account JWTs — no round-trips to an external auth service. Why JWT operator mode:- Per-node cryptographic identity (each ops-agent has its own NKey)
- Scoped permissions per account (agents can only publish metrics, not consume)
- Cross-account stream isolation (agent metrics flow to control plane via explicit export/import)
- No shared secrets — compromising one node’s credentials doesn’t affect others
Trust Hierarchy
resolver_preload). When a client connects with a user JWT, the server:
- Extracts
IssuerAccountfrom the user JWT - Finds the matching account JWT in the preload
- Verifies the user JWT signature against the account’s signing key
- Applies permissions from the user JWT claims
Account Architecture
AGENT Account
Purpose: Ops-agents on node VMs connect under this account.| Property | Value |
|---|---|
| Name | AGENT |
| Pub permissions | metrics.> |
| Sub permissions | _INBOX.> |
| JetStream | None (no disk/memory allocation) |
| Exports | metrics.> (Stream type) — consumed by CONTROL_PLANE |
| User JWT TTL | 14 days |
CONTROL_PLANE Account
Purpose: Internal services (agent-gateway, health-evaluator, API server, payment service) connect under this account.| Property | Value |
|---|---|
| Name | CONTROL_PLANE |
| Pub permissions | > (full) |
| Sub permissions | > (full) |
| JetStream | Unlimited disk/memory, 10 streams, 50 consumers |
| Imports | agent-metrics from AGENT account (metrics.>) |
| User JWT TTL | None (no expiry for internal services) |
HOODCLOUD_METRICS JetStream stream. Metrics published by agents in the AGENT account flow into this stream via the cross-account import.
NATS clustering: In production, NATS runs as a 3-node cluster with R=3 (3 replicas) for all JetStream streams (HOODCLOUD_EVENTS, HOODCLOUD_METRICS, PAYMENTS). Replicas are configurable via stream config (dev=1, prod=3). Raft consensus ensures no message loss on single-node failure.
SYS Account
Purpose: NATS system monitoring ($SYS subject hierarchy).
| Property | Value |
|---|---|
| Name | SYS |
| Designated as | system_account in nats.conf |
Agent Credential Flow
First Boot: NKey Generation
When the ops-agent starts for the first time, it generates an Ed25519 NKey pair:internal/opsagent/natscreds/manager.go
The seed file persists across restarts. The public key (prefix U) identifies this agent to NATS.
JWT Request via gRPC
The agent requests a signed user JWT from the agent-gateway:proto/agent.proto — GetNATSCredentialsRequest (fields: node_id, node_token, nats_public_key), GetNATSCredentialsResponse (fields: nats_url, user_jwt, account, expires_at)
Signer: internal/natsjwt/signer.go — Signer.SignUserJWT(agentPubKey, nodeID) returns (jwt, expiry, error)
NATS Connection
The agent connects using thenats.UserJWT callback pattern:
internal/opsagent/natscreds/manager.go — UserJWTHandler(), SignHandler()
Credential Persistence
After receiving a JWT, the agent writes a combined creds file:jwt.ParseDecoratedJWT() and extracts the expiry from the JWT claims. If the JWT is still valid, it connects immediately without requesting a new one.
Proactive Renewal
A background goroutine checks hourly whether the JWT needs renewal:cmd/ops-agent/main.go — runJWTRenewal() goroutine
If the control plane is unreachable during renewal, the agent retries every hour. The 2.8-day renewal window provides substantial buffer for transient outages.
JWT Expiry Behavior
When a JWT expires, the NATS server disconnects the client. Thenats.UserJWT callback pattern handles this: on reconnect, the JWT handler returns the (now-renewed) JWT. If renewal hasn’t happened yet, the connection fails and retries according to MaxReconnects: -1 (unlimited).
Bootstrap Order
Shutdown
manager.Close() calls userKP.Wipe() to zero the NKey from memory.
Internal Service Auth
Internal services (agent-gateway, health-evaluator, API server) authenticate to NATS using self-signed ephemeral JWTs under the CONTROL_PLANE account. Package:internal/natsjwt/internal.go — CreateInternalCredentials(signingKeyPEM, accountPubKey)
internal/natsjwt/connect.go — ConnectWithJWT(url, jwt, userKP) and NATSAuthOption(jwt, userKP)
Payment Service
The payment service has its ownnatsjwt package (payment-service/internal/natsjwt/natsjwt.go) following the same pattern — CreateCredentials() generates an ephemeral user JWT using the CTRL account signing seed.
Infrastructure Setup
nats-jwt-setup Tool
Location:cmd/nats-jwt-setup/main.go
One-time setup tool that generates all JWT infrastructure:
- Operator keypair + JWT
- Operator signing keypair
- AGENT account keypair + signing keypair + JWT (with
metrics.>export) - CONTROL_PLANE account keypair + signing keypair + JWT (with metrics import, JetStream limits)
- SYS account keypair + JWT
nats.confwithresolver_preloadoperator.jwtfile
- JWT files written to
{output-dir}/ nats.confwritten to{output-dir}/../nats.conf- Vault storage commands printed to stdout (seeds never written to files)
- All seeds zeroed after use
NATS Server Configuration
File:infrastructure/docker/nats/nats.conf
MEMORY resolver means account JWTs are embedded in the config. No external resolver needed.
Docker Volumes
Vault Secret Structure
All signing seeds are stored in Vault at the application credentials path:| Vault Field | JSON Key | Purpose |
|---|---|---|
| Operator signing seed | nats_operator_signing_seed | Signs account JWTs (used only by nats-jwt-setup) |
| AGENT account signing seed | nats_agent_account_signing_seed | Signs user JWTs for ops-agents (used by agent-gateway) |
| CTRL account signing seed | nats_ctrl_account_signing_seed | Signs user JWTs for internal services |
| AGENT account public key | nats_agent_account_pub | Embedded in user JWT IssuerAccount claim |
| CTRL account public key | nats_ctrl_account_pub | Embedded in user JWT IssuerAccount claim |
internal/secrets/credentials.go — NATSOperatorSigningSeed(), NATSAgentAccountSigningSeed(), NATSCtrlAccountSigningSeed(), NATSAgentAccountPub(), NATSCtrlAccountPub()
Config integration: internal/vault/client.go — GetAppCredentials() calls appCreds.SetNATSJWTCredentials() when operator seed is present in Vault.
Environment Variables
| Variable | Service | Purpose |
|---|---|---|
NATS_JWT_ENABLED | agent-gateway, API server, health-evaluator | Enable JWT operator mode |
NATS_JWT_AGENT_ACCOUNT_PUB | agent-gateway | AGENT account public key for JWT signing |
NATS_JWT_CTRL_ACCOUNT_PUB | agent-gateway, API server, health-evaluator | CTRL account public key |
NATS_JWT_EXTERNAL_URL | agent-gateway | Public NATS URL for ops-agents (tls://...) |
NATS_AGENT_SIGNING_SEED | agent-gateway (dev only) | Env fallback for AGENT signing seed |
NATS_CTRL_SIGNING_SEED | payment-service | CTRL account signing seed |
NATS_CTRL_ACCOUNT_PUB | payment-service | CTRL account public key |
TLS Configuration
Overview
NATS uses a Let’s Encrypt TLS certificate obtained via certbot with DNS-01 challenge (Cloudflare). All clients — internal services and external ops-agents — trust the certificate via system root CAs. No custom CA distribution is needed. Key properties:- Automatic renewal via certbot (every 60–90 days)
- Zero-downtime reload: certbot deploy hook sends SIGHUP to NATS, which reloads TLS config without dropping connections
- No manual certificate transfer between servers
Port Layout
| Port | Purpose | TLS | Audience |
|---|---|---|---|
| 4222 | NATS listener | Yes (Let’s Encrypt) | All clients (internal + external) |
tls:// URLs. There is a single TLS-enabled port for both internal Docker services and external ops-agents.
Certificate Files
| File | Location | Purpose |
|---|---|---|
fullchain.pem | /opt/hoodcloud/nats-tls/ | Let’s Encrypt certificate chain |
privkey.pem | /opt/hoodcloud/nats-tls/ | Private key for the certificate |
/etc/letsencrypt/live/<domain>/. The deploy hook copies them to /opt/hoodcloud/nats-tls/ for NATS to read.
Certificate Issuance and Renewal
Certbot uses the DNS-01 challenge with a Cloudflare API token to prove domain ownership. The token is stored in Vault and retrieved on-demand via a wrapper script — no persistent Cloudflare credentials on disk.- Stored in Vault at
secret/infra/certbot/cloudflare(key:api_token) - Scoped to
Zone:DNS:Edit+Zone:Zone:Read(minimum privilege) - Retrieved at renewal time by the wrapper script via Vault AppRole authentication
- Written to a temporary file (mode 0600), deleted after certbot completes
certbot-nats):
- Policy: read-only on
secret/data/infra/certbot/cloudflare - Token TTL: 5 minutes (short-lived, single-use per renewal)
- Credentials: RoleID at
/opt/hoodcloud/secrets/certbot-role-id, SecretID at/opt/hoodcloud/secrets/certbot-secret-id(both mode 0400, root only)
Deploy Hook
Location:/opt/hoodcloud/scripts/nats-cert-deploy.sh
The deploy hook:
- Copies renewed certificates to
/opt/hoodcloud/nats-tls/ - Sets correct permissions
- Sends SIGHUP to the NATS process, triggering a TLS config reload
- Logs the renewal via syslog (
certbot-natstag)
Client Trust
All clients use system root CAs to verify the NATS server certificate. Let’s Encrypt’s root CA (ISRG Root X1) is included in all modern OS trust stores. No custom CA file, environment variable, or volume mount is needed on any client.Usage Matrix
| Component | Location | Mount | Env Var | Purpose |
|---|---|---|---|---|
| NATS server | Control plane | /etc/nats/tls/fullchain.pem + privkey.pem | — | TLS listener |
| api-server | Control plane | — | — | System root CAs |
| agent-gateway | Control plane | — | — | System root CAs |
| health-evaluator | Control plane | — | — | System root CAs |
| orchestrator | Control plane | — | — | System root CAs |
| ops-agent | Node VMs | — | — | System root CAs |
| payment-service | Payment host | — | — | System root CAs |
NATS Server TLS Config
Innats.conf:
/opt/hoodcloud/nats-tls/ on the host. Docker mounts this directory into the NATS container:
Certificate Monitoring
- x509-certificate-exporter scrapes
/opt/hoodcloud/nats-tls/fullchain.pemand exposes expiry as a Prometheus metric - Alert rules:
NATSTLSCertExpiringSoon— warning at 14 days remainingNATSTLSCertExpiryCritical— critical at 3 days remaining
- Certbot renewal logging: Renewals logged via syslog with the
certbot-natstag
Key Rotation
Account Signing Key Rotation
Account signing keys can be rotated without disrupting existing connections:- Generate new signing keypair
- Add new signing key to account JWT claims (accounts support multiple signing keys)
- Re-encode account JWT with operator signing key
- Update
resolver_preloadin nats.conf - Reload NATS server (
nats-server --signal reload) - Update Vault with new signing seed
- Restart agent-gateway to pick up new seed
- After all existing JWTs expire (14 days), remove old signing key
Operator Key Rotation
Operator key rotation requires regenerating all account JWTs and reloading the NATS server config. This is a heavier operation — plan for maintenance.User JWT Rotation
User JWTs rotate automatically via the renewal mechanism (20% TTL threshold). No manual intervention needed.Key Packages
| Package | Purpose |
|---|---|
internal/natsjwt/signer.go | Signer.SignUserJWT() — signs scoped user JWTs for ops-agents |
internal/natsjwt/internal.go | CreateInternalCredentials() — ephemeral JWTs for internal services |
internal/natsjwt/connect.go | ConnectWithJWT(), NATSAuthOption() — NATS connection helpers |
internal/opsagent/natscreds/manager.go | NKey generation, JWT storage, renewal callbacks |
internal/grpc/server.go | GetNATSCredentials handler — validates agent, signs JWT |
cmd/nats-jwt-setup/main.go | One-time infrastructure generation tool |
internal/configtypes/configtypes.go | NATSJWTConfig struct |
internal/secrets/credentials.go | Vault accessor methods for NATS seeds/keys |
payment-service/internal/natsjwt/natsjwt.go | Payment service JWT credentials |
Operational Notes
Lessons from initial JWT operator mode deployment. Follow these to avoid repeat issues on fresh installs.JetStream Volume Wipe on Mode Switch
Switching from token-based auth to JWT operator mode changes the account hierarchy. Existing JetStream streams created under the old$G (global) account cannot be recovered by the new accounts. Wipe the NATS data volume before starting in JWT mode:
Payment Service Vault Path
The main app reads credentials fromsecret/hoodcloud/app-credentials. The payment service uses a separate path: secret/payment-service/credentials. When adding NATS JWT seeds to Vault, ensure nats_ctrl_signing_seed is added to both paths if the payment service connects to the same NATS cluster. The nats_ctrl_account_pub (public key) is not a secret — the payment service reads it from the NATS_CTRL_ACCOUNT_PUB environment variable.
Generated Artifacts — Do Not Commit
nats.conf and operator.jwt are generated by cmd/nats-jwt-setup and contain environment-specific account JWTs and public keys. They are gitignored:
infrastructure/docker/nats/nats.conf— in root.gitignoreinfrastructure/docker/nats/jwt/*— injwt/.gitignore
git pull overwrites server-generated files, regenerate with nats-jwt-setup or restore from the server backup.
TLS Certificates
See TLS Configuration for Let’s Encrypt setup, deploy hook, usage matrix, and certificate monitoring.nats-jwt-setup Execution Order
- Run
nats-jwt-setupon the server (secrets stay on the server, never transit network) - Copy the printed Vault commands and apply them to Vault
- Restart services to pick up new credentials from Vault
- Verify NATS connectivity with
nats server check connection
kv put commands to stdout — do not discard this output. Seeds are zeroed in memory after printing and are not written to any file.
Troubleshooting
Agent Cannot Connect to NATS
- Check agent has a seed file:
ls -la {dataDir}/nats-user.seed - Check agent has a creds file:
ls -la {dataDir}/nats-user.creds - Verify the agent-gateway has NATS JWT signing configured: look for
"NATS JWT signer initialized"in gateway logs - Verify the NATS server is running in operator mode:
nats-server --signal ldmshould show account information
JWT Expired / Connection Rejected
- Check agent logs for
"NATS JWT nearing expiry, renewing"— renewal should trigger at ~11.2 days - If renewal fails, check agent-gateway connectivity
- Verify signing seed in Vault matches the account JWT in nats.conf
Cross-Account Metrics Not Flowing
- Verify AGENT account JWT has export for
metrics.>(Stream type) - Verify CONTROL_PLANE account JWT has matching import
- Check JetStream stream exists:
nats stream info HOODCLOUD_METRICS
Internal Service Auth Failure
- Verify
NATS_JWT_ENABLED=trueandNATS_JWT_CTRL_ACCOUNT_PUBis set - Verify CTRL signing seed is present in Vault
- Check service logs for
"NATS JWT auth configured"at startup
Related Documents
- Workflows — Metrics Flow — Observation system architecture
- Environment Variables — Full env var reference
- Vault — Vault setup and secret structure