Last verified: 2026-02-13 | Commit scope: bc0fb41
HTTP Request Flow
Auth middleware:internal/api/middleware.go. Both JWT and API key requests are subject to scope enforcement via RequireScope(). JWT users without the required scope receive 403 Forbidden. The admin scope is backend-granted only — users cannot self-assign it via the API key creation endpoint. Rollout endpoints require admin scope.
See also: CLAUDE.md — Server Separation for endpoint lists and auth details.
Subscription Lifecycle
Subscriptions follow a two-phase creation flow: API creates withpending_payment, payment consumer activates.
See also: Domain Model — Subscription State Machine for the canonical state machine and status definitions.Creation:
POST /api/v1/subscriptions (internal/api/handler_subscription.go)
- Check for existing
pending_paymentmatching (user, chain, nodeType, duration) — return if found (idempotent) - Validate chain profile and node type
- Resolve
provider,region,instance_typefrom chain profile + duration mapping - Create subscription with status
pending_payment, ExpiresAt placeholder, PaymentID null - Return HTTP 201 (created) or HTTP 200 (existing returned)
internal/consumers/payment_consumer.go — see NATS Consumers below.
Temporal Workflow Execution
When telemetry is enabled, activities are instrumented viaTracingInterceptor — each activity creates an OpenTelemetry span.
Provision Node Workflow
File:internal/workflows/provision.go
Steps marked with ? are conditional based on provisioning input specs.
Compensation (on any failure): compensate() runs DestroyNode (Terraform destroy + key deletion) and UpdateNodeState -> failed.
Key Activity Details
| Activity | File | Purpose |
|---|---|---|
ValidateProvisionInputs | internal/activities/provision.go | Checks all required pre_provision inputs are satisfied |
GenerateNodeKeys | internal/activities/keys.go | Creates DEK + chain keys (only source=generated), encrypts, stores in node_keys, backs up to S3 |
PrepareUserProvidedKeys | internal/activities/provision.go | Decrypts sealed box ciphertext -> re-encrypts with node DEK (conditional) |
RunTerraform | internal/activities/provision.go | Provisions VM via Terraform module registry |
WaitForAgentReady | internal/activities/agent.go | Polls agent_registrations for node_id (5-min timeout) |
SendConfigureCommand | internal/activities/provision.go | Pushes config command to Redis queue, waits for ACK |
WaitForCommandCompletion | internal/activities/provision.go | Liveness-based waiting with progress monitoring |
CaptureChainGeneratedKeys | internal/activities/provision.go | Sends CAPTURE_KEY to agent, stores encrypted key (conditional) |
InjectKeys | internal/activities/keys.go | Delivers encrypted keys to agent via gRPC |
StartNode | internal/activities/agent.go | Sends START command, agent runs systemctl start |
MarkInputsApplied | internal/activities/provision.go | Updates input statuses to applied |
Migrate Node Workflow
File:internal/workflows/migrate.go
compensateMigration): Only if new infrastructure exists — destroys new host, restores node to degraded state. Old host preserved if failure occurs before new infra is ready.
Terminate Node Workflow
File:internal/workflows/terminate.go
PartialFailures[]. Only final state update is critical.
Command Execution and Progress Monitoring
Async Command Flow
Commands run asynchronously to avoid blocking heartbeats during long operations (e.g., 130GB snapshot downloads).- Orchestrator pushes command to Redis queue via
SendConfigureCommand - Agent receives command in heartbeat response, sends
command_acknowledgedevent - Agent adds command ID to
running_command_idsin subsequent heartbeats - Agent executes command, reports progress via
command_progressevents - Agent sends
command_completed/command_failed, removes fromrunning_command_ids
Liveness-Based Waiting
WaitForCommandCompletion (internal/activities/provision.go) uses liveness checks instead of fixed timeouts:
- Verify command ID in agent’s
running_command_ids(retry: 3 attempts, 100ms backoff) - Read progress from Redis
progress:{commandID}for Temporal heartbeats - Detect stalls:
StallCount >= StallMaxThreshold(20) -> fail - On completion event -> return. On liveness failure -> grace window -> fail.
Stall Detection
| Side | Threshold | Action |
|---|---|---|
| Agent | 5 consecutive no-progress checks (~2.5min) | Marks Stalled: true in progress report |
| Orchestrator | StallMaxThreshold 20 (~10min at 30s) | Returns false from liveness -> command fails |
internal/defaults/defaults.go under defaults.Commands.*
Provisioning Input Framework
Chains declare what inputs they need viaprovisioning_inputs: in chain config YAML. The framework handles schema resolution, validation, storage, and resolution for workflow activities.
Architecture
Package:internal/provision/input/
| Component | File | Responsibility |
|---|---|---|
InputSchemaProvider | schema.go | Resolves InputSpec[] from chain config |
InputValidator | validator.go | Validates user input against spec (regex, required) |
InputStore | store.go | Persists text/proof inputs, checks satisfaction |
InputResolver | resolver.go | Merges text/proof + secret inputs for workflow activities |
Input Types
| Type | Storage | Example |
|---|---|---|
text | provisioning_inputs (plaintext) | Validator address, network ID |
proof | provisioning_inputs (plaintext) | Attestation proof |
secret | Depends on source (see below) | Private key, mnemonic |
Secret Sources
| Source | Storage | Handling |
|---|---|---|
generated | node_keys (AES-256-GCM encrypted with DEK) | GenerateNodeKeys activity creates them |
chain_generated | node_keys (AES-256-GCM) | CaptureChainGeneratedKeys reads from disk after CONFIGURE |
user_provided | provisioning_input_secrets (sealed box ciphertext) | Client encrypts with NaCl sealed box; PrepareUserProvidedKeys decrypts + re-encrypts with DEK |
Input Spec Definition
Declared in chain config YAML underprovisioning_inputs::
keys: field auto-converts to InputSpec with type: secret, source: generated.
Sealed Box Encryption
File:internal/crypto/sealedbox.go
User-provided secrets are client-side encrypted using NaCl sealed box (X25519 + XSalsa20-Poly1305):
internal/api/handler_crypto.go — CryptoHandler.GetPublicKey()
See also: CLAUDE.md — Provisioning Input Framework for code-level conventions.
CAPTURE_KEY Protocol
Proto:proto/agent.proto — COMMAND_TYPE_CAPTURE_KEY (value 8), CaptureKeyPayload { key_name, key_path, key_type }
After CONFIGURE, some chains generate keys on disk. The CaptureChainGeneratedKeys activity:
- Sends
CAPTURE_KEYcommand to agent with key path and name - Agent reads file, encrypts with DEK, returns
encrypted_keyin response metadata - Agent zeros plaintext via
crypto.ZeroBytes() - Activity stores encrypted key in
node_keysviakeyRepo.Create()
validateKeyPath() prevents path traversal (must be within ChainDataDir).
See also: docs/plans/capture-key-protocol.md for implementation design.
Upgrade Rollout Workflows
Binary and config upgrades use a three-level workflow hierarchy orchestrated by Temporal, withCOMMAND_TYPE_UPGRADE (value 9) delivered through the existing command queue. This is separate from the CONFIGURE path — upgrades do not re-execute recipes or wipe chain data.
Manifest loading: Rollouts can be created in auto mode (upgrade_id provided) or manual mode. In auto mode, the API handler loads an upgrade manifest from {ConfigDir()}/upgrades/{chain_profile_id}/{upgrade_id}.yaml via manifest.Reader, auto-populating binary details, state compatibility, config changes, and content hash. See Upgrade Rollout — Manifest Loader.
Workflow Hierarchy
| Level | Workflow | ID Pattern | Purpose |
|---|---|---|---|
| Group | RolloutGroupWorkflow | rollout-group-{group_id} | Multi-binary coordination (e.g., geth + lighthouse). Optional — standalone rollouts skip this. |
| Rollout | RolloutWorkflow | rollout-{rollout_id} | Per-component: batching, failure threshold, strategy enforcement, signals |
| Node | UpgradeNodeWorkflow | upgrade-{rollout_id}-{node_id} | Per-node: maintenance transition, command dispatch, health validation |
RolloutGroupWorkflow
Thin orchestrator for multi-binary chains. Iterates components in declared order, launching oneRolloutWorkflow per component with a health gate between components.
- For each component in
component_order:- Launch
RolloutWorkflowas child - Wait for child completion
- On failure: dispatch based on
failure_policy(partial_ok,rollback_all,manual) - Health gate: validate all upgraded nodes healthy before next component
- Launch
FinalizeGroupactivity sets terminal status
pause, resume, cancel, rollback, skip
RolloutWorkflow
Signals:| Signal | Behavior |
|---|---|
pause | Pause after current batch completes |
resume | Resume from paused state |
cancel | Stop launching new batches, let in-flight children complete |
force | Override failure threshold |
approve_canary | Approve canary batch for breaking migrations (blocks until received) |
rollback | Full rollback of all succeeded nodes |
rolling (batch_size at a time), canary (canary_size first, then rolling), all_at_once (single batch, for hard forks).
Scaling: Uses continue-as-new after each batch, resetting event history. No node count limit. Progress tracked in DB, not Temporal state.
UpgradeNodeWorkflow
Compensation: The agent handles per-action compensation (reverse-order rollback of completed actions). If agent-side rollback succeeds, node returns to previous state. If rollback fails (FAILED_ROLLBACK), a critical incident is created.
Upgrade Activities
| Activity | Purpose |
|---|---|
PreCheckNode | Verify node exists, state is upgradeable, subscription active |
SendUpgradeCommand | Push COMMAND_TYPE_UPGRADE with UpgradePayload to Redis queue |
WaitForCommandCompletion | Liveness-based waiting with progress monitoring |
UpdateNodeBinaryVersion | Write new version to node_config_state |
WaitForHealthValidation | Poll: heartbeat received, app_health != critical, sync != stalled, no crashloop |
ResolveRolloutTargets | JOIN nodes + node_config_state, assign batch_index, record previous versions |
UpdateRolloutProgress | Update denormalized counters on rollouts table |
FinalizeRollout | Set terminal status, force DB-Temporal sync |
ValidateCanaryHealth | Extended observation (2x health_wait) after canary batch |
ResolveGroupComponents | Resolve component rollouts for group |
FinalizeGroup | Set group terminal status |
MarkNodeUpgradeSucceeded | Update rollout_nodes status |
Activity Profiles
| Profile | Timeout | Retries |
|---|---|---|
ProfileUpgradeDefault | 10 min | 3 |
ProfileHealthValidation | Configurable per rollout | 1 |
Agent Upgrade Execution
The agent executes upgrades via a three-layer architecture:| Layer | Responsibility |
|---|---|
| Action Primitives (8 actions) | Express WHAT happens — generic, runtime-agnostic |
| Runtime Adapter | Express HOW for each runtime (systemd/Docker) |
| Executor | Orchestrate sequencing, checkpoints, compensation |
SendUpgradeCommand + WaitForCommandCompletion). The executor heartbeats Temporal during each action for visibility.
Action sequence: BackupCurrentState → AcquireArtifact → VerifyArtifact → StopNode → InstallArtifact → WriteConfigs (conditional) → ReloadDaemon → StartNode
See also: Extending — Adding a Runtime Adapter for how to add support for new runtimes.
NATS Events (Upgrade Rollout)
New subjects for upgrade rollout events:| Subject Pattern | Description |
|---|---|
events.rollout.{rolloutID}.started | Rollout started |
events.rollout.{rolloutID}.progress | Rollout progress update |
events.rollout.{rolloutID}.completed | Rollout completed |
events.node.{nodeID}.upgrade_started | Node upgrade started |
events.node.{nodeID}.upgrade_completed | Node upgrade completed |
events.node.{nodeID}.upgrade_phase_changed | Node upgrade phase transition |
Metrics Flow (Observation System)
Collector types:prometheus, http, script, otlp — chain-agnostic, configured in observation.yaml
CEL policy evaluation (internal/observation/):
- Programs compiled once (
LoadPolicies()), reused across cycles metric(m, "name")/has_metric(m, "name")— binary bindings- Ordered verdict evaluation (first match wins):
critical > degraded > ok > unknown - Circuit breaker on Victoria Metrics reads (fail-fast on outage)
- PromQL injection guard validates
chain_profile_idagainst strict regex
AGENT (ops-agents) and CONTROL_PLANE (internal services). Ops-agents receive a signed user JWT from the agent-gateway on first connect. Internal services self-sign ephemeral user JWTs using account signing seeds from Vault. Port 4222 internal, 4223 external (TLS).
See also: NATS JWT Operator Mode for the full authentication architecture, credential flows, and operational procedures.
See also: Health and Incidents for how policy verdicts feed into the incident pipeline.
NATS Consumers
Payment Consumer
File:internal/consumers/payment_consumer.go
Subscribes to payment.completed on NATS JetStream (durable consumer, manual ACK).
Two-path activation:
- Path 1 (preferred): Activates existing
pending_paymentsubscriptions linked duringPOST /api/v1/payments - Path 2 (fallback): Creates subscriptions from payment line items (backward compatibility)
| Error | Action |
|---|---|
| Unmarshal failure | NAK (requeue) |
| Idempotency hit | ACK (skip) |
| Subscription failure | NAK (requeue, retry) |
| Workflow start failure | NAK (requeue, retry) |
| Malformed message | ACK (prevent infinite retry) |
Idempotency Store
File:internal/consumers/idempotency_store.go
Redis-backed (RedisIdempotencyStore): key idempotency:{key} with configurable TTL. NoopIdempotencyStore for testing.
Configuration: See Environment Variables for the full list of PAYMENT_CONSUMER_* variables (enabled, URL, stream name, consumer name, ACK wait, max deliver).
Payment-to-Provisioning Flow
See also: Payment Service for the payment service architecture.
Related Documents
- Overview — System overview and service descriptions
- Domain Model — Canonical state machines and business rules
- Health and Incidents — Health evaluation pipeline
- Payment Service — Payment service architecture
- Extending — Adding workflows, activities