Skip to main content
Last verified: 2026-02-14

Overview

NATS uses JWT operator mode for authentication and authorization. This replaces single-token auth with a cryptographic trust hierarchy where the NATS server verifies user JWTs against pre-loaded account JWTs — no round-trips to an external auth service. Why JWT operator mode:
  • Per-node cryptographic identity (each ops-agent has its own NKey)
  • Scoped permissions per account (agents can only publish metrics, not consume)
  • Cross-account stream isolation (agent metrics flow to control plane via explicit export/import)
  • No shared secrets — compromising one node’s credentials doesn’t affect others

Trust Hierarchy

Operator (root of trust)
├── AGENT Account (ops-agents)
│   └── User JWTs (one per node, signed by AGENT signing key)
├── CONTROL_PLANE Account (internal services)
│   └── User JWTs (one per service instance, self-signed at startup)
└── SYS Account (NATS system monitoring)
Verification flow: The NATS server holds account JWTs in memory (resolver_preload). When a client connects with a user JWT, the server:
  1. Extracts IssuerAccount from the user JWT
  2. Finds the matching account JWT in the preload
  3. Verifies the user JWT signature against the account’s signing key
  4. Applies permissions from the user JWT claims
No external resolver or HTTP call required.

Account Architecture

AGENT Account

Purpose: Ops-agents on node VMs connect under this account.
PropertyValue
NameAGENT
Pub permissionsmetrics.>
Sub permissions_INBOX.>
JetStreamNone (no disk/memory allocation)
Exportsmetrics.> (Stream type) — consumed by CONTROL_PLANE
User JWT TTL14 days
Agents can only publish metrics and subscribe to reply inboxes. They cannot consume streams, subscribe to other subjects, or access JetStream directly.

CONTROL_PLANE Account

Purpose: Internal services (agent-gateway, health-evaluator, API server, payment service) connect under this account.
PropertyValue
NameCONTROL_PLANE
Pub permissions> (full)
Sub permissions> (full)
JetStreamUnlimited disk/memory, 10 streams, 50 consumers
Importsagent-metrics from AGENT account (metrics.>)
User JWT TTLNone (no expiry for internal services)
The CONTROL_PLANE account owns the HOODCLOUD_METRICS JetStream stream. Metrics published by agents in the AGENT account flow into this stream via the cross-account import. NATS clustering: In production, NATS runs as a 3-node cluster with R=3 (3 replicas) for all JetStream streams (HOODCLOUD_EVENTS, HOODCLOUD_METRICS, PAYMENTS). Replicas are configurable via stream config (dev=1, prod=3). Raft consensus ensures no message loss on single-node failure.

SYS Account

Purpose: NATS system monitoring ($SYS subject hierarchy).
PropertyValue
NameSYS
Designated assystem_account in nats.conf
Standard NATS system account for server monitoring and diagnostics.

Agent Credential Flow

First Boot: NKey Generation

When the ops-agent starts for the first time, it generates an Ed25519 NKey pair:
1. nkeys.CreateUser() → user keypair
2. Write seed to {dataDir}/nats-user.seed (mode 0600)
3. On subsequent boots: load seed from file
Package: internal/opsagent/natscreds/manager.go The seed file persists across restarts. The public key (prefix U) identifies this agent to NATS.

JWT Request via gRPC

The agent requests a signed user JWT from the agent-gateway:
Agent                          Agent-Gateway
  |                                |
  |-- GetNATSCredentials --------->|
  |   (node_id, node_token,       |
  |    nats_public_key)            |
  |                                |-- Validate node_token
  |                                |-- Validate nats_public_key (IsValidPublicUserKey)
  |                                |-- Sign user JWT with AGENT signing key
  |                                |   (IssuerAccount: AGENT pub key)
  |                                |   (Pub.Allow: "metrics.>")
  |                                |   (Sub.Allow: "_INBOX.>")
  |                                |   (Expires: now + 14 days)
  |<-- (nats_url, user_jwt, ------|
  |     expires_at, account)       |
  |                                |
  |-- nats.Connect(nats_url) ----->| NATS Server
  |   UserJWT callback: return jwt |   verifies JWT against
  |   Sign callback: sign nonce    |   AGENT account JWT
  |<-- Connected ------------------|
Proto: proto/agent.protoGetNATSCredentialsRequest (fields: node_id, node_token, nats_public_key), GetNATSCredentialsResponse (fields: nats_url, user_jwt, account, expires_at) Signer: internal/natsjwt/signer.goSigner.SignUserJWT(agentPubKey, nodeID) returns (jwt, expiry, error)

NATS Connection

The agent connects using the nats.UserJWT callback pattern:
nats.UserJWT(
    manager.UserJWTHandler(),  // returns current JWT string
    manager.SignHandler(),     // signs server nonce with NKey
)
This pattern supports automatic reconnection — NATS calls the JWT handler on every connect/reconnect, so the agent always presents the latest JWT. Package: internal/opsagent/natscreds/manager.goUserJWTHandler(), SignHandler()

Credential Persistence

After receiving a JWT, the agent writes a combined creds file:
jwt.FormatUserConfig(userJWT, seed) → {dataDir}/nats-user.creds
On restart, the agent loads the existing creds file via jwt.ParseDecoratedJWT() and extracts the expiry from the JWT claims. If the JWT is still valid, it connects immediately without requesting a new one.

Proactive Renewal

A background goroutine checks hourly whether the JWT needs renewal:
Renewal trigger: remaining lifetime < 20% of 14-day TTL (~2.8 days)
Check interval: 1 hour
Renewal action: call GetNATSCredentials RPC with current NKey public key
Package: cmd/ops-agent/main.gorunJWTRenewal() goroutine If the control plane is unreachable during renewal, the agent retries every hour. The 2.8-day renewal window provides substantial buffer for transient outages.

JWT Expiry Behavior

When a JWT expires, the NATS server disconnects the client. The nats.UserJWT callback pattern handles this: on reconnect, the JWT handler returns the (now-renewed) JWT. If renewal hasn’t happened yet, the connection fails and retries according to MaxReconnects: -1 (unlimited).

Bootstrap Order

register with control plane → fetch DEK → init NKey manager →
fetch JWT (if needed) → init observation runner (with UserJWT callbacks) →
start renewal goroutine

Shutdown

manager.Close() calls userKP.Wipe() to zero the NKey from memory.

Internal Service Auth

Internal services (agent-gateway, health-evaluator, API server) authenticate to NATS using self-signed ephemeral JWTs under the CONTROL_PLANE account. Package: internal/natsjwt/internal.goCreateInternalCredentials(signingKeyPEM, accountPubKey)
1. nkeys.CreateUser() → ephemeral user keypair (not persisted)
2. jwt.NewUserClaims(userPub)
   - IssuerAccount: CONTROL_PLANE account pub key
   - Pub.Allow: ">" (full)
   - Sub.Allow: ">" (full)
   - Expires: 0 (no expiry)
3. Encode with CONTROL_PLANE signing key
4. Return (userJWT, userKeyPair)
Each service instance generates fresh credentials at startup. No persistence, no renewal. The signing seed comes from Vault (or env var in dev). Connection helper: internal/natsjwt/connect.goConnectWithJWT(url, jwt, userKP) and NATSAuthOption(jwt, userKP)

Payment Service

The payment service has its own natsjwt package (payment-service/internal/natsjwt/natsjwt.go) following the same pattern — CreateCredentials() generates an ephemeral user JWT using the CTRL account signing seed.

Infrastructure Setup

nats-jwt-setup Tool

Location: cmd/nats-jwt-setup/main.go One-time setup tool that generates all JWT infrastructure:
go run cmd/nats-jwt-setup/main.go -output-dir infrastructure/docker/nats/jwt
Generates:
  • Operator keypair + JWT
  • Operator signing keypair
  • AGENT account keypair + signing keypair + JWT (with metrics.> export)
  • CONTROL_PLANE account keypair + signing keypair + JWT (with metrics import, JetStream limits)
  • SYS account keypair + JWT
  • nats.conf with resolver_preload
  • operator.jwt file
Output:
  • JWT files written to {output-dir}/
  • nats.conf written to {output-dir}/../nats.conf
  • Vault storage commands printed to stdout (seeds never written to files)
  • All seeds zeroed after use

NATS Server Configuration

File: infrastructure/docker/nats/nats.conf
operator: /etc/nats/jwt/operator.jwt
system_account: <SYS_ACCOUNT_PUB>

resolver: MEMORY
resolver_preload: {
    <AGENT_ACCOUNT_PUB>: <AGENT_JWT>
    <CTRL_ACCOUNT_PUB>: <CTRL_JWT>
    <SYS_ACCOUNT_PUB>: <SYS_JWT>
}

tls {
    cert_file: /etc/nats/tls/fullchain.pem
    key_file: /etc/nats/tls/privkey.pem
    timeout: 5
}
MEMORY resolver means account JWTs are embedded in the config. No external resolver needed.

Docker Volumes

nats:
  image: nats:2.10-alpine
  volumes:
    - ./nats/nats.conf:/etc/nats/nats.conf:ro
    - ./nats/jwt:/etc/nats/jwt:ro          # operator.jwt
    - /opt/hoodcloud/nats-tls:/etc/nats/tls:ro  # TLS certificates (Let's Encrypt)
    - nats_data:/data                       # JetStream storage

Vault Secret Structure

All signing seeds are stored in Vault at the application credentials path:
Vault FieldJSON KeyPurpose
Operator signing seednats_operator_signing_seedSigns account JWTs (used only by nats-jwt-setup)
AGENT account signing seednats_agent_account_signing_seedSigns user JWTs for ops-agents (used by agent-gateway)
CTRL account signing seednats_ctrl_account_signing_seedSigns user JWTs for internal services
AGENT account public keynats_agent_account_pubEmbedded in user JWT IssuerAccount claim
CTRL account public keynats_ctrl_account_pubEmbedded in user JWT IssuerAccount claim
Accessor methods: internal/secrets/credentials.goNATSOperatorSigningSeed(), NATSAgentAccountSigningSeed(), NATSCtrlAccountSigningSeed(), NATSAgentAccountPub(), NATSCtrlAccountPub() Config integration: internal/vault/client.goGetAppCredentials() calls appCreds.SetNATSJWTCredentials() when operator seed is present in Vault.

Environment Variables

VariableServicePurpose
NATS_JWT_ENABLEDagent-gateway, API server, health-evaluatorEnable JWT operator mode
NATS_JWT_AGENT_ACCOUNT_PUBagent-gatewayAGENT account public key for JWT signing
NATS_JWT_CTRL_ACCOUNT_PUBagent-gateway, API server, health-evaluatorCTRL account public key
NATS_JWT_EXTERNAL_URLagent-gatewayPublic NATS URL for ops-agents (tls://...)
NATS_AGENT_SIGNING_SEEDagent-gateway (dev only)Env fallback for AGENT signing seed
NATS_CTRL_SIGNING_SEEDpayment-serviceCTRL account signing seed
NATS_CTRL_ACCOUNT_PUBpayment-serviceCTRL account public key
Seeds should come from Vault in production. Environment variable fallbacks exist only for local development.

TLS Configuration

Overview

NATS uses a Let’s Encrypt TLS certificate obtained via certbot with DNS-01 challenge (Cloudflare). All clients — internal services and external ops-agents — trust the certificate via system root CAs. No custom CA distribution is needed. Key properties:
  • Automatic renewal via certbot (every 60–90 days)
  • Zero-downtime reload: certbot deploy hook sends SIGHUP to NATS, which reloads TLS config without dropping connections
  • No manual certificate transfer between servers

Port Layout

PortPurposeTLSAudience
4222NATS listenerYes (Let’s Encrypt)All clients (internal + external)
All connections use tls:// URLs. There is a single TLS-enabled port for both internal Docker services and external ops-agents.

Certificate Files

FileLocationPurpose
fullchain.pem/opt/hoodcloud/nats-tls/Let’s Encrypt certificate chain
privkey.pem/opt/hoodcloud/nats-tls/Private key for the certificate
Certbot writes certificates to /etc/letsencrypt/live/<domain>/. The deploy hook copies them to /opt/hoodcloud/nats-tls/ for NATS to read.

Certificate Issuance and Renewal

Certbot uses the DNS-01 challenge with a Cloudflare API token to prove domain ownership. The token is stored in Vault and retrieved on-demand via a wrapper script — no persistent Cloudflare credentials on disk.
# Issue or renew via the wrapper script (authenticates to Vault via AppRole)
/opt/hoodcloud/scripts/certbot-renew.sh
Cloudflare API token:
  • Stored in Vault at secret/infra/certbot/cloudflare (key: api_token)
  • Scoped to Zone:DNS:Edit + Zone:Zone:Read (minimum privilege)
  • Retrieved at renewal time by the wrapper script via Vault AppRole authentication
  • Written to a temporary file (mode 0600), deleted after certbot completes
Vault AppRole (certbot-nats):
  • Policy: read-only on secret/data/infra/certbot/cloudflare
  • Token TTL: 5 minutes (short-lived, single-use per renewal)
  • Credentials: RoleID at /opt/hoodcloud/secrets/certbot-role-id, SecretID at /opt/hoodcloud/secrets/certbot-secret-id (both mode 0400, root only)
Renewal: Certbot auto-renews certificates before expiry (systemd timer or cron). The wrapper script handles the full lifecycle: AppRole login, token retrieval, certbot execution, credential cleanup, and Vault token revocation.

Deploy Hook

Location: /opt/hoodcloud/scripts/nats-cert-deploy.sh The deploy hook:
  1. Copies renewed certificates to /opt/hoodcloud/nats-tls/
  2. Sets correct permissions
  3. Sends SIGHUP to the NATS process, triggering a TLS config reload
  4. Logs the renewal via syslog (certbot-nats tag)
NATS reloads TLS configuration on SIGHUP without dropping existing connections. Clients reconnect transparently with the new certificate on their next connection.

Client Trust

All clients use system root CAs to verify the NATS server certificate. Let’s Encrypt’s root CA (ISRG Root X1) is included in all modern OS trust stores. No custom CA file, environment variable, or volume mount is needed on any client.

Usage Matrix

ComponentLocationMountEnv VarPurpose
NATS serverControl plane/etc/nats/tls/fullchain.pem + privkey.pemTLS listener
api-serverControl planeSystem root CAs
agent-gatewayControl planeSystem root CAs
health-evaluatorControl planeSystem root CAs
orchestratorControl planeSystem root CAs
ops-agentNode VMsSystem root CAs
payment-servicePayment hostSystem root CAs

NATS Server TLS Config

In nats.conf:
tls {
    cert_file: /etc/nats/tls/fullchain.pem
    key_file: /etc/nats/tls/privkey.pem
    timeout: 5
}
The deploy hook copies certificates to /opt/hoodcloud/nats-tls/ on the host. Docker mounts this directory into the NATS container:
volumes:
  - /opt/hoodcloud/nats-tls:/etc/nats/tls:ro

Certificate Monitoring

  • x509-certificate-exporter scrapes /opt/hoodcloud/nats-tls/fullchain.pem and exposes expiry as a Prometheus metric
  • Alert rules:
    • NATSTLSCertExpiringSoon — warning at 14 days remaining
    • NATSTLSCertExpiryCritical — critical at 3 days remaining
  • Certbot renewal logging: Renewals logged via syslog with the certbot-nats tag

Key Rotation

Account Signing Key Rotation

Account signing keys can be rotated without disrupting existing connections:
  1. Generate new signing keypair
  2. Add new signing key to account JWT claims (accounts support multiple signing keys)
  3. Re-encode account JWT with operator signing key
  4. Update resolver_preload in nats.conf
  5. Reload NATS server (nats-server --signal reload)
  6. Update Vault with new signing seed
  7. Restart agent-gateway to pick up new seed
  8. After all existing JWTs expire (14 days), remove old signing key
Existing user JWTs signed with the old key remain valid until expiry because the account JWT still contains the old signing key’s public half during the transition.

Operator Key Rotation

Operator key rotation requires regenerating all account JWTs and reloading the NATS server config. This is a heavier operation — plan for maintenance.

User JWT Rotation

User JWTs rotate automatically via the renewal mechanism (20% TTL threshold). No manual intervention needed.

Key Packages

PackagePurpose
internal/natsjwt/signer.goSigner.SignUserJWT() — signs scoped user JWTs for ops-agents
internal/natsjwt/internal.goCreateInternalCredentials() — ephemeral JWTs for internal services
internal/natsjwt/connect.goConnectWithJWT(), NATSAuthOption() — NATS connection helpers
internal/opsagent/natscreds/manager.goNKey generation, JWT storage, renewal callbacks
internal/grpc/server.goGetNATSCredentials handler — validates agent, signs JWT
cmd/nats-jwt-setup/main.goOne-time infrastructure generation tool
internal/configtypes/configtypes.goNATSJWTConfig struct
internal/secrets/credentials.goVault accessor methods for NATS seeds/keys
payment-service/internal/natsjwt/natsjwt.goPayment service JWT credentials

Operational Notes

Lessons from initial JWT operator mode deployment. Follow these to avoid repeat issues on fresh installs.

JetStream Volume Wipe on Mode Switch

Switching from token-based auth to JWT operator mode changes the account hierarchy. Existing JetStream streams created under the old $G (global) account cannot be recovered by the new accounts. Wipe the NATS data volume before starting in JWT mode:
docker compose down
docker volume rm hoodcloud_nats_data
docker compose up -d nats
This only applies during the one-time migration from token auth to JWT mode.

Payment Service Vault Path

The main app reads credentials from secret/hoodcloud/app-credentials. The payment service uses a separate path: secret/payment-service/credentials. When adding NATS JWT seeds to Vault, ensure nats_ctrl_signing_seed is added to both paths if the payment service connects to the same NATS cluster. The nats_ctrl_account_pub (public key) is not a secret — the payment service reads it from the NATS_CTRL_ACCOUNT_PUB environment variable.

Generated Artifacts — Do Not Commit

nats.conf and operator.jwt are generated by cmd/nats-jwt-setup and contain environment-specific account JWTs and public keys. They are gitignored:
  • infrastructure/docker/nats/nats.conf — in root .gitignore
  • infrastructure/docker/nats/jwt/* — in jwt/.gitignore
If git pull overwrites server-generated files, regenerate with nats-jwt-setup or restore from the server backup.

TLS Certificates

See TLS Configuration for Let’s Encrypt setup, deploy hook, usage matrix, and certificate monitoring.

nats-jwt-setup Execution Order

  1. Run nats-jwt-setup on the server (secrets stay on the server, never transit network)
  2. Copy the printed Vault commands and apply them to Vault
  3. Restart services to pick up new credentials from Vault
  4. Verify NATS connectivity with nats server check connection
The tool outputs Vault kv put commands to stdout — do not discard this output. Seeds are zeroed in memory after printing and are not written to any file.

Troubleshooting

Agent Cannot Connect to NATS

  1. Check agent has a seed file: ls -la {dataDir}/nats-user.seed
  2. Check agent has a creds file: ls -la {dataDir}/nats-user.creds
  3. Verify the agent-gateway has NATS JWT signing configured: look for "NATS JWT signer initialized" in gateway logs
  4. Verify the NATS server is running in operator mode: nats-server --signal ldm should show account information

JWT Expired / Connection Rejected

  1. Check agent logs for "NATS JWT nearing expiry, renewing" — renewal should trigger at ~11.2 days
  2. If renewal fails, check agent-gateway connectivity
  3. Verify signing seed in Vault matches the account JWT in nats.conf

Cross-Account Metrics Not Flowing

  1. Verify AGENT account JWT has export for metrics.> (Stream type)
  2. Verify CONTROL_PLANE account JWT has matching import
  3. Check JetStream stream exists: nats stream info HOODCLOUD_METRICS

Internal Service Auth Failure

  1. Verify NATS_JWT_ENABLED=true and NATS_JWT_CTRL_ACCOUNT_PUB is set
  2. Verify CTRL signing seed is present in Vault
  3. Check service logs for "NATS JWT auth configured" at startup