Skip to main content
Last verified: 2026-02-13 | Commit scope: bc0fb41

HTTP Request Flow

Auth middleware: internal/api/middleware.go. Both JWT and API key requests are subject to scope enforcement via RequireScope(). JWT users without the required scope receive 403 Forbidden. The admin scope is backend-granted only — users cannot self-assign it via the API key creation endpoint. Rollout endpoints require admin scope.
See also: CLAUDE.md — Server Separation for endpoint lists and auth details.

Subscription Lifecycle

Subscriptions follow a two-phase creation flow: API creates with pending_payment, payment consumer activates.
See also: Domain Model — Subscription State Machine for the canonical state machine and status definitions.
Creation: POST /api/v1/subscriptions (internal/api/handler_subscription.go)
  1. Check for existing pending_payment matching (user, chain, nodeType, duration) — return if found (idempotent)
  2. Validate chain profile and node type
  3. Resolve provider, region, instance_type from chain profile + duration mapping
  4. Create subscription with status pending_payment, ExpiresAt placeholder, PaymentID null
  5. Return HTTP 201 (created) or HTTP 200 (existing returned)
Activation: internal/consumers/payment_consumer.go — see NATS Consumers below.

Temporal Workflow Execution

When telemetry is enabled, activities are instrumented via TracingInterceptor — each activity creates an OpenTelemetry span.

Provision Node Workflow

File: internal/workflows/provision.go Steps marked with ? are conditional based on provisioning input specs. Compensation (on any failure): compensate() runs DestroyNode (Terraform destroy + key deletion) and UpdateNodeState -> failed.

Key Activity Details

ActivityFilePurpose
ValidateProvisionInputsinternal/activities/provision.goChecks all required pre_provision inputs are satisfied
GenerateNodeKeysinternal/activities/keys.goCreates DEK + chain keys (only source=generated), encrypts, stores in node_keys, backs up to S3
PrepareUserProvidedKeysinternal/activities/provision.goDecrypts sealed box ciphertext -> re-encrypts with node DEK (conditional)
RunTerraforminternal/activities/provision.goProvisions VM via Terraform module registry
WaitForAgentReadyinternal/activities/agent.goPolls agent_registrations for node_id (5-min timeout)
SendConfigureCommandinternal/activities/provision.goPushes config command to Redis queue, waits for ACK
WaitForCommandCompletioninternal/activities/provision.goLiveness-based waiting with progress monitoring
CaptureChainGeneratedKeysinternal/activities/provision.goSends CAPTURE_KEY to agent, stores encrypted key (conditional)
InjectKeysinternal/activities/keys.goDelivers encrypted keys to agent via gRPC
StartNodeinternal/activities/agent.goSends START command, agent runs systemctl start
MarkInputsAppliedinternal/activities/provision.goUpdates input statuses to applied

Migrate Node Workflow

File: internal/workflows/migrate.go
UpdateState(maintenance) -> GetDetails -> ProvisionNew -> WaitAgent -> Configure ->
CheckBackup -> RestoreSnapshot -> InjectKeys -> StartNode -> DestroyOldHost -> UpdateHostInfo -> UpdateState(syncing)
Compensation (compensateMigration): Only if new infrastructure exists — destroys new host, restores node to degraded state. Old host preserved if failure occurs before new infra is ready.

Terminate Node Workflow

File: internal/workflows/terminate.go
GetNodeDetails -> UpdateState(terminating) -> SendStop -> DeleteKeys -> DestroyInfra -> VerifyCleanup -> UpdateState(terminated)
Best-effort: Continues even if steps fail. Partial failures accumulate in PartialFailures[]. Only final state update is critical.

Command Execution and Progress Monitoring

Async Command Flow

Commands run asynchronously to avoid blocking heartbeats during long operations (e.g., 130GB snapshot downloads).
  1. Orchestrator pushes command to Redis queue via SendConfigureCommand
  2. Agent receives command in heartbeat response, sends command_acknowledged event
  3. Agent adds command ID to running_command_ids in subsequent heartbeats
  4. Agent executes command, reports progress via command_progress events
  5. Agent sends command_completed/command_failed, removes from running_command_ids

Liveness-Based Waiting

WaitForCommandCompletion (internal/activities/provision.go) uses liveness checks instead of fixed timeouts:
  1. Verify command ID in agent’s running_command_ids (retry: 3 attempts, 100ms backoff)
  2. Read progress from Redis progress:{commandID} for Temporal heartbeats
  3. Detect stalls: StallCount >= StallMaxThreshold (20) -> fail
  4. On completion event -> return. On liveness failure -> grace window -> fail.

Stall Detection

SideThresholdAction
Agent5 consecutive no-progress checks (~2.5min)Marks Stalled: true in progress report
OrchestratorStallMaxThreshold 20 (~10min at 30s)Returns false from liveness -> command fails
Configuration: internal/defaults/defaults.go under defaults.Commands.*

Provisioning Input Framework

Chains declare what inputs they need via provisioning_inputs: in chain config YAML. The framework handles schema resolution, validation, storage, and resolution for workflow activities.

Architecture

Package: internal/provision/input/
ComponentFileResponsibility
InputSchemaProviderschema.goResolves InputSpec[] from chain config
InputValidatorvalidator.goValidates user input against spec (regex, required)
InputStorestore.goPersists text/proof inputs, checks satisfaction
InputResolverresolver.goMerges text/proof + secret inputs for workflow activities

Input Types

TypeStorageExample
textprovisioning_inputs (plaintext)Validator address, network ID
proofprovisioning_inputs (plaintext)Attestation proof
secretDepends on source (see below)Private key, mnemonic

Secret Sources

SourceStorageHandling
generatednode_keys (AES-256-GCM encrypted with DEK)GenerateNodeKeys activity creates them
chain_generatednode_keys (AES-256-GCM)CaptureChainGeneratedKeys reads from disk after CONFIGURE
user_providedprovisioning_input_secrets (sealed box ciphertext)Client encrypts with NaCl sealed box; PrepareUserProvidedKeys decrypts + re-encrypts with DEK

Input Spec Definition

Declared in chain config YAML under provisioning_inputs::
provisioning_inputs:
  - name: validator_address
    type: text
    required: true
    validation: "^0x[0-9a-fA-F]{40}$"
    phase: pre_provision
  - name: node_key
    type: secret
    source: generated
    key_type: ed25519
    phase: pre_provision
Backward compatibility: Legacy keys: field auto-converts to InputSpec with type: secret, source: generated.

Sealed Box Encryption

File: internal/crypto/sealedbox.go User-provided secrets are client-side encrypted using NaCl sealed box (X25519 + XSalsa20-Poly1305):
Client: GET /api/v1/crypto/public-key (no auth) -> X25519 public key
Client: tweetnacl.sealedbox.seal(secret, publicKey) -> ephPub(32) || nonce(24) || ciphertext
Client: POST /api/v1/subscriptions/{id}/inputs with sealed ciphertext
Server: Stores sealed ciphertext in provisioning_input_secrets table
Workflow: PrepareUserProvidedKeys decrypts with server X25519 private key (Vault)
         -> re-encrypts with node DEK -> stores in node_keys
Handler: internal/api/handler_crypto.goCryptoHandler.GetPublicKey()
See also: CLAUDE.md — Provisioning Input Framework for code-level conventions.

CAPTURE_KEY Protocol

Proto: proto/agent.protoCOMMAND_TYPE_CAPTURE_KEY (value 8), CaptureKeyPayload { key_name, key_path, key_type } After CONFIGURE, some chains generate keys on disk. The CaptureChainGeneratedKeys activity:
  1. Sends CAPTURE_KEY command to agent with key path and name
  2. Agent reads file, encrypts with DEK, returns encrypted_key in response metadata
  3. Agent zeros plaintext via crypto.ZeroBytes()
  4. Activity stores encrypted key in node_keys via keyRepo.Create()
Security: validateKeyPath() prevents path traversal (must be within ChainDataDir).
See also: docs/plans/capture-key-protocol.md for implementation design.

Upgrade Rollout Workflows

Binary and config upgrades use a three-level workflow hierarchy orchestrated by Temporal, with COMMAND_TYPE_UPGRADE (value 9) delivered through the existing command queue. This is separate from the CONFIGURE path — upgrades do not re-execute recipes or wipe chain data. Manifest loading: Rollouts can be created in auto mode (upgrade_id provided) or manual mode. In auto mode, the API handler loads an upgrade manifest from {ConfigDir()}/upgrades/{chain_profile_id}/{upgrade_id}.yaml via manifest.Reader, auto-populating binary details, state compatibility, config changes, and content hash. See Upgrade Rollout — Manifest Loader.

Workflow Hierarchy

LevelWorkflowID PatternPurpose
GroupRolloutGroupWorkflowrollout-group-{group_id}Multi-binary coordination (e.g., geth + lighthouse). Optional — standalone rollouts skip this.
RolloutRolloutWorkflowrollout-{rollout_id}Per-component: batching, failure threshold, strategy enforcement, signals
NodeUpgradeNodeWorkflowupgrade-{rollout_id}-{node_id}Per-node: maintenance transition, command dispatch, health validation

RolloutGroupWorkflow

Thin orchestrator for multi-binary chains. Iterates components in declared order, launching one RolloutWorkflow per component with a health gate between components.
  1. For each component in component_order:
    • Launch RolloutWorkflow as child
    • Wait for child completion
    • On failure: dispatch based on failure_policy (partial_ok, rollback_all, manual)
    • Health gate: validate all upgraded nodes healthy before next component
  2. FinalizeGroup activity sets terminal status
Signals: pause, resume, cancel, rollback, skip

RolloutWorkflow

Signals:
SignalBehavior
pausePause after current batch completes
resumeResume from paused state
cancelStop launching new batches, let in-flight children complete
forceOverride failure threshold
approve_canaryApprove canary batch for breaking migrations (blocks until received)
rollbackFull rollback of all succeeded nodes
Strategies: rolling (batch_size at a time), canary (canary_size first, then rolling), all_at_once (single batch, for hard forks). Scaling: Uses continue-as-new after each batch, resetting event history. No node count limit. Progress tracked in DB, not Temporal state.

UpgradeNodeWorkflow

Compensation: The agent handles per-action compensation (reverse-order rollback of completed actions). If agent-side rollback succeeds, node returns to previous state. If rollback fails (FAILED_ROLLBACK), a critical incident is created.

Upgrade Activities

ActivityPurpose
PreCheckNodeVerify node exists, state is upgradeable, subscription active
SendUpgradeCommandPush COMMAND_TYPE_UPGRADE with UpgradePayload to Redis queue
WaitForCommandCompletionLiveness-based waiting with progress monitoring
UpdateNodeBinaryVersionWrite new version to node_config_state
WaitForHealthValidationPoll: heartbeat received, app_health != critical, sync != stalled, no crashloop
ResolveRolloutTargetsJOIN nodes + node_config_state, assign batch_index, record previous versions
UpdateRolloutProgressUpdate denormalized counters on rollouts table
FinalizeRolloutSet terminal status, force DB-Temporal sync
ValidateCanaryHealthExtended observation (2x health_wait) after canary batch
ResolveGroupComponentsResolve component rollouts for group
FinalizeGroupSet group terminal status
MarkNodeUpgradeSucceededUpdate rollout_nodes status

Activity Profiles

ProfileTimeoutRetries
ProfileUpgradeDefault10 min3
ProfileHealthValidationConfigurable per rollout1

Agent Upgrade Execution

The agent executes upgrades via a three-layer architecture:
LayerResponsibility
Action Primitives (8 actions)Express WHAT happens — generic, runtime-agnostic
Runtime AdapterExpress HOW for each runtime (systemd/Docker)
ExecutorOrchestrate sequencing, checkpoints, compensation
All 8 actions run inside a single Temporal activity (SendUpgradeCommand + WaitForCommandCompletion). The executor heartbeats Temporal during each action for visibility. Action sequence: BackupCurrentStateAcquireArtifactVerifyArtifactStopNodeInstallArtifactWriteConfigs (conditional) → ReloadDaemonStartNode
See also: Extending — Adding a Runtime Adapter for how to add support for new runtimes.

NATS Events (Upgrade Rollout)

New subjects for upgrade rollout events:
Subject PatternDescription
events.rollout.{rolloutID}.startedRollout started
events.rollout.{rolloutID}.progressRollout progress update
events.rollout.{rolloutID}.completedRollout completed
events.node.{nodeID}.upgrade_startedNode upgrade started
events.node.{nodeID}.upgrade_completedNode upgrade completed
events.node.{nodeID}.upgrade_phase_changedNode upgrade phase transition

Metrics Flow (Observation System)

Collector types: prometheus, http, script, otlp — chain-agnostic, configured in observation.yaml CEL policy evaluation (internal/observation/):
  • Programs compiled once (LoadPolicies()), reused across cycles
  • metric(m, "name") / has_metric(m, "name") — binary bindings
  • Ordered verdict evaluation (first match wins): critical > degraded > ok > unknown
  • Circuit breaker on Victoria Metrics reads (fail-fast on outage)
  • PromQL injection guard validates chain_profile_id against strict regex
NATS auth: NATS uses JWT operator mode with two accounts: AGENT (ops-agents) and CONTROL_PLANE (internal services). Ops-agents receive a signed user JWT from the agent-gateway on first connect. Internal services self-sign ephemeral user JWTs using account signing seeds from Vault. Port 4222 internal, 4223 external (TLS).
See also: NATS JWT Operator Mode for the full authentication architecture, credential flows, and operational procedures.
See also: Health and Incidents for how policy verdicts feed into the incident pipeline.

NATS Consumers

Payment Consumer

File: internal/consumers/payment_consumer.go Subscribes to payment.completed on NATS JetStream (durable consumer, manual ACK). Two-path activation:
  • Path 1 (preferred): Activates existing pending_payment subscriptions linked during POST /api/v1/payments
  • Path 2 (fallback): Creates subscriptions from payment line items (backward compatibility)
Error handling:
ErrorAction
Unmarshal failureNAK (requeue)
Idempotency hitACK (skip)
Subscription failureNAK (requeue, retry)
Workflow start failureNAK (requeue, retry)
Malformed messageACK (prevent infinite retry)

Idempotency Store

File: internal/consumers/idempotency_store.go Redis-backed (RedisIdempotencyStore): key idempotency:{key} with configurable TTL. NoopIdempotencyStore for testing. Configuration: See Environment Variables for the full list of PAYMENT_CONSUMER_* variables (enabled, URL, stream name, consumer name, ACK wait, max deliver).

Payment-to-Provisioning Flow

See also: Payment Service for the payment service architecture.