Skip to main content
Last verified: 2026-02-13

Overview

The upgrade rollout system performs binary and configuration upgrades on running nodes without reprovisioning. It replaces node binaries and config files in-place while preserving chain data, keys, and node identity. When to use upgrades vs. reprovisioning:
ScenarioMechanism
New binary version (minor/patch)Upgrade rollout
Config parameter changeUpgrade rollout (with config_files)
Breaking state migrationUpgrade rollout (canary + blocking approval)
New chain onboardingProvisioning workflow
VM replacement / infra changeMigration workflow
Data corruption recoveryReprovisioning
Design principles:
  • Chain-agnostic: Same 8 action primitives serve all chains. No chain-specific code in the upgrade engine.
  • Runtime-agnostic: RuntimeAdapter interface abstracts systemd/Docker. Adding a runtime requires only a new adapter.
  • Idempotent: Three-layer idempotency (completed store, checkpoints, system state checks) ensures safe re-entry after crashes.
  • Compensating: Per-action Compensate() in reverse order restores previous state on failure.

Architecture

Manifest Loader

Package: internal/upgrade/manifest/ Rollout creation supports two modes:
  • Auto mode: Provide upgrade_id + chain_profile_id + node_type. The control plane loads the manifest from {ConfigDir()}/upgrades/{chain_profile_id}/{upgrade_id}.yaml and auto-populates binary_name, target_version, target_url, target_checksum, migration_type, recovery_plan, config_changes, and manifest_content_hash.
  • Manual mode: Omit upgrade_id. Operator provides all fields explicitly (existing behavior).
Manifests are YAML files stored in the S3-synced chain config directory. The manifest.Reader interface (Get, List, ContentHash) is implemented by FileReader, which resolves the config directory at call time via chains.Provider.ConfigDir(). No new infrastructure dependency.
# {ConfigDir}/upgrades/{chain_profile_id}/{upgrade_id}.yaml
upgrade_id: v1.5.0-hotfix
chain_profile_id: celestia-mocha
node_type: full
binary_name: celestia-node
target_version: v0.21.0
target_url: https://releases.celestia.org/v0.21.0/celestia-node-linux-amd64
target_checksum: sha256:abc123...
state_compatibility:
  migration_type: compatible
  notes: "State format unchanged"
config_changes:
  - path: /etc/celestia/config.toml
    template: config.toml.tmpl
    overrides:
      max_peers: "100"
manifest_content_hash is computed from the raw YAML bytes (sha256:<hex>), not manually provided. Rollout group component_order entries also support upgrade_id for per-component auto-population.

Agent Three-Layer Architecture

Actions express WHAT happens. Adapters express HOW for each runtime. The executor orchestrates sequencing and compensation. No layer has knowledge of the others’ concerns.

Workflow Hierarchy

Control plane to agent flow: Control plane sends COMMAND_TYPE_UPGRADE (value 9) with an UpgradePayload containing artifact references (URL, checksum, version). No execution instructions travel over the wire. The agent constructs the action sequence locally and injects the correct RuntimeAdapter based on runtime.type from config.

Agent Upgrade Handler

UpgradeAction Interface

File: internal/opsagent/upgrade/action.go
type UpgradeAction interface {
    Name() string
    Execute(ctx context.Context, rt RuntimeAdapter, params ActionParams) error
    IsAlreadyDone(ctx context.Context, rt RuntimeAdapter, params ActionParams) (bool, error)
    Compensate(ctx context.Context, rt RuntimeAdapter, params ActionParams) error
    Timeout() time.Duration
}

RuntimeAdapter Interface

File: internal/opsagent/upgrade/adapter.go
type RuntimeAdapter interface {
    runtime.Runtime  // Start, Stop, Restart, Status, ServiceName

    AcquireArtifact(ctx context.Context, url, checksum string) (ArtifactHandle, error)
    VerifyArtifact(ctx context.Context, handle ArtifactHandle, checksum string) error
    IsArtifactInstalled(ctx context.Context, name, checksum string) (bool, error)
    BackupState(ctx context.Context, backupDir string) error
    InstallArtifact(ctx context.Context, handle ArtifactHandle, target string) error
    WriteConfig(ctx context.Context, path string, content []byte) error
    IsConfigWritten(ctx context.Context, path string, expectedHash string) (bool, error)
    ReloadDaemon(ctx context.Context) error
    RestoreState(ctx context.Context, backupDir string) error
    IsArtifactAcquired(ctx context.Context, url, checksum string) (bool, error)
    IsBackedUp(ctx context.Context, backupDir string) (bool, error)
}

Action Primitives

8 actions in internal/opsagent/upgrade/actions/. The same set serves all runtimes:
ActionNameTimeoutIsAlreadyDoneCompensate
BackupCurrentStatebackup_state1 minBackup dir + manifest presentNo-op (preserved for investigation)
AcquireArtifactacquire_artifact5 minTemp file/image exists + checksum matchesClean up acquired artifact
VerifyArtifactverify_artifact30 secAlways false (safe to repeat)No-op
StopNodestop_node60 secService not runningrt.Start() — restart service
InstallArtifactinstall_artifact30 secBinary/container at target + checksum matchesrt.RestoreState() — restore from backup
WriteConfigswrite_configs30 secPer-file SHA256 matches expectedRestore configs from backup dir
ReloadDaemonreload_daemon10 secAlways false (safe to repeat)rt.ReloadDaemon() — safe after config restore
StartNodestart_node30 secService already runningrt.Stop()

Executor

File: internal/opsagent/upgrade/executor.go The executor iterates actions sequentially. For each action:
  1. Check IsAlreadyDone() via adapter (skip if true)
  2. Execute with per-action timeout budget
  3. On failure: call Compensate() on all completed actions in reverse order
  4. Checkpoint completed actions for crash recovery
Total execution budget: ~8.5 min. Temporal activity ScheduleToClose: 10 min. Heartbeat timeout: 30 sec.

Phase Reporting

The executor derives upgrade phase from the current action position:
ActionsPhase
backup_state, acquire_artifact, verify_artifactPREPARING
stop_nodeSTOPPED
install_artifact, write_configs, reload_daemonMUTATING
start_nodeSTARTING
(Temporal-level health validation)VERIFYING
Phase transitions are reported via command response metadata and written to rollout_nodes.upgrade_phase.

Three-Layer Idempotency

LayerScopeMechanism
1. Command-levelCompletedUpgradeStoreIf upgrade_id already completed, skip entirely
2. Step-levelCheckpoint manager with InputHash + TTLTrack per-action progress; expired in_progress (10 min) falls through
3. State-levelIsAlreadyDone() per action via adapterVerify actual system state matches expected

Runtime Adapters

Systemd Adapter

MethodBehavior
AcquireArtifactHTTP download to {InstallTarget}/.download-{name} (same filesystem for atomic rename)
VerifyArtifactSHA256 file hash comparison
IsArtifactInstalledBinary exists at path + checksum matches
BackupStateCopy binary + configs to backup dir, write manifest.json
InstallArtifactos.Rename(downloadPath, installPath) (atomic, same filesystem)
WriteConfigos.WriteFile(path, content, 0644)
ReloadDaemonsystemctl daemon-reload
RestoreStateCopy files from backup dir to original paths

Docker Adapter

MethodBehavior
AcquireArtifactdocker pull {image@sha256:digest}
VerifyArtifactImage digest match
IsArtifactInstalledContainer with correct image + name exists
BackupStatedocker inspect -> save config JSON to backup dir
InstallArtifactdocker rename {c} {c}.bak + docker create new with cloned config
WriteConfigWrite to mounted volume path
ReloadDaemonNo-op (return nil)
RestoreStatedocker rm new + docker rename {c}.bak {c}
See also: Extending for how to add a new runtime adapter.

Data Model

Tables

rollout_groups — multi-component upgrade coordination:
ColumnTypeDescription
idUUIDPrimary key
chain_profile_idVARCHAR(100)Target chain
statusVARCHAR(20)Group lifecycle status
failure_policyVARCHAR(20)partial_ok, rollback_all, manual (required, no default)
component_orderJSONBOrdered array of component definitions
desired_versionsJSONBTarget state: {binary_name: version}
strategyVARCHAR(20)Rollout strategy
Partial unique index: one active group per chain (WHERE status IN ('pending', 'running', 'paused')). rollouts — per-component rollout:
ColumnTypeDescription
idUUIDPrimary key
group_idUUID (nullable)Parent group (null for standalone)
component_indexINTOrdering within group
chain_profile_idVARCHAR(100)Target chain
binary_nameVARCHAR(100)Binary to upgrade
source_version / target_versionVARCHAR(50)Version transition
target_binary_urlTEXTResolved download URL
target_binary_checksumTEXTSHA256 or image digest (required)
strategyVARCHAR(20)rolling, canary, all_at_once
statusVARCHAR(20)Rollout lifecycle status
total_nodes / succeeded_nodes / failed_nodes / pending_nodesINTDenormalized progress counters
manifest_content_hashTEXTSHA256 of manifest at creation time
Partial unique index: one active rollout per chain. rollout_nodes — per-node upgrade tracking:
ColumnTypeDescription
idUUIDPrimary key
rollout_idUUIDParent rollout
node_idUUIDTarget node
batch_indexINTBatch assignment
statusVARCHAR(20)Node upgrade status
upgrade_phaseVARCHAR(20)Display state (PREPARING, MUTATING, etc.)
previous_version / previous_binary_url / previous_checksumTEXTRollback reference
error_messageTEXTFailure details
attempt_countINTRetry counter
Unique constraint: (rollout_id, node_id).

Status Enums

RolloutStatus / RolloutGroupStatus: RolloutGroupStatus adds partial (some components succeeded, some failed with partial_ok policy). NodeUpgradeStatus: UpgradePhase (display state on rollout_nodes):

NodeConfigState

Uses the existing node_config_state table. binary_version is written by UpdateNodeBinaryVersion activity after successful upgrade. Initial version written by the provisioning workflow after CONFIGURE.
See also: Domain Model for canonical node and subscription models.

Temporal Workflows

RolloutGroupWorkflow

File: internal/workflows/rollout_group.go
Workflow ID: rollout-group-{group_id}
Search Attributes: GroupID, ChainProfileID
Thin orchestrator for multi-binary chains (e.g., ethereum-holesky: geth + lighthouse). Iterates components in declared order, launching one RolloutWorkflow per component with a health gate between components. Signals: pause, resume, cancel, rollback, skip Failure policies:
PolicyBehavior
partial_okFailed component stops. Completed components stay upgraded. Group marked partial.
rollback_allAny component failure triggers reverse-order rollback of ALL completed components.
manualFailure pauses group. Operator decides: resume, rollback, or skip.
Standalone single-component rollouts bypass this workflow entirely.

RolloutWorkflow

File: internal/workflows/rollout.go
Workflow ID: rollout-{rollout_id}
Search Attributes: RolloutID, ChainProfileID, Strategy, GroupID
Signals:
SignalBehavior
pausePause after current batch completes
resumeResume from paused state
cancelStop launching new batches, let in-flight children complete
forceOverride failure threshold
Processing flow:
  1. Activity: ResolveRolloutTargets — filter eligible nodes, assign batch indexes
  2. Per batch: launch child UpgradeNodeWorkflow per node, wait for batch completion
  3. Activity: UpdateRolloutProgress — update counters
  4. Evaluate failure threshold (unless force_continue=true)
  5. Canary: after batch 0, extended observation (2x health_wait), validate canary health
  6. continue-as-new with remaining batches (event history resets per batch)
  7. Final batch: Activity FinalizeRollout (set terminal status, forced DB-Temporal sync)
Cancellation: Stop launching new batches. In-flight child workflows run to natural completion (upgrade or rollback). No forced cancellation of mid-upgrade nodes.

UpgradeNodeWorkflow

File: internal/workflows/upgrade_node.go
Workflow ID: upgrade-{rollout_id}-{node_id}
Search Attributes: NodeID, RolloutID, CorrelationID
Compensation (on failure at steps 3-6):
Failure PointCompensation
SendUpgrade / WaitForCompletion (agent compensated)Agent already rolled back via per-action compensation
UpdateVersion / HealthValidation (artifact installed)Agent-side compensation restores from backup
Agent reports FAILED_ROLLBACKCritical incident, set degraded, no further automatic action

Activity Profiles

ProfileTimeoutRetries
ProfileUpgradeDefault10 min3
ProfileHealthValidationConfigurable duration1
See also: Workflows for existing Temporal patterns and compensation conventions.

Rollout Strategies

Rolling

  • Upgrade batch_size nodes at a time (sequential batches, parallel within batch)
  • After each batch: check failure threshold
  • If failed / attempted > threshold -> auto-pause
  • Operator decides: resume, retry, cancel, rollback

Canary

  • Batch 0: canary_size random nodes
  • Extended observation: 2x health_wait_duration after canary batch
  • If canary healthy -> proceed with remaining nodes in rolling batches
  • If canary fails -> auto-pause
  • Breaking migrations (migration_type: breaking): canary batch is blocking — rollout cannot proceed until operator sends approve_canary signal

All-at-Once

  • All nodes in single batch (no sequential batching)
  • For hard forks with block height deadlines
  • force_continue=true: complete all nodes regardless of failures
  • SkipValidation=true: skip health wait for deadline pressure

Compensation and Rollback

Per-Action Compensation

On failure at any action, the executor calls Compensate() on all completed actions in reverse order. The ordering guarantees correct restoration:
  1. Configs restored before daemon-reload
  2. Daemon-reload before start
  3. Service started last
This prevents the deadly rollback inversion (old binary + new config). If any Compensate() fails -> FAILED_ROLLBACK terminal state + critical incident. Manual intervention required.

Backup Directory

Each upgrade creates a backup at {data_dir}/.upgrade-backup/{upgrade_id}/:
.upgrade-backup/
  {upgrade_id}/
    manifest.json          # Atomic marker (written last)
    binaries/
      celestia-node        # Previous binary
    configs/
      config.toml          # Previous config files
    container_config.json  # Docker: previous container inspect
Three-way checksum verification: Before restore, verify: manifest checksum == backup file checksum == original file checksum. Mismatch -> abort compensation, FAILED_ROLLBACK + incident.

Rollback Guarantee Tiers

TierScopeGuaranteeRecovery
Tier 1Binary + ConfigFull. Previous binary and configs restored via per-action compensation.Automatic
Tier 2State-CompatibleBinary + config restored. Old binary can read state written by new binary.Automatic
Tier 3State-IncompatibleBinary + config restored, but chain state may be unreadable by old binary. Rollback may cause crash loop.Reprovisioning
Tier is declared via state_compatibility.migration_type in the upgrade manifest. Post-rollback crash loop (Tier 3):
  1. Binary + config restored (Tier 1 always holds)
  2. Node starts -> old binary encounters incompatible state -> crash loop
  3. Health validation detects crash loop
  4. Critical incident created with recovery_plan from manifest
  5. Node set to degraded
  6. Operator-initiated reprovisioning is the recovery path

Full Rollout Rollback

Operator-triggered via API:
  1. Pause rollout (stop new batches)
  2. For each succeeded node: launch UpgradeNodeWorkflow with swapped versions (target = source)
  3. Wait for all rollback workflows
  4. Mark rollout rolled_back
Uses the same UpgradeNodeWorkflow. No separate rollback workflow.

Group Rollback (rollback_all Policy)

When a component fails in a group with rollback_all:
  • Reverse-order rollback of ALL completed components
  • Each component rolled back via RolloutWorkflow with swapped versions
  • Group marked rolled_back

Health Validation

WaitForHealthValidation

Polls for health_wait_duration with relaxed criteria. ALL must pass:
  1. At least one heartbeat received (agent alive)
  2. application_health != critical
  3. sync_status != stalled
  4. Node state not down or failed
  5. No crashloop (fewer than N restarts within health_wait_duration)
Does NOT require sync_status == synced (syncing expected after restart) or application_health == ok (degraded tolerated during stabilization).

Maintenance State Exemption

During upgrade (node in maintenance):
  • Heartbeat evaluator skips maintenance nodes (existing behavior)
  • Policy evaluator skips maintenance nodes
  • No false alarms during upgrade window

MaintenanceCleanup Loop

Periodic check (configurable, default 30 min) for nodes stuck in maintenance state. Catches nodes where the upgrade workflow failed without cleaning up state.
See also: Health and Incidents for the health evaluation pipeline.

Validation Rules

Creation-Time Rules

#Rule
1checksum must be non-empty
2Docker chains: checksum must be sha256: prefixed
3Docker chains: resolved URL must contain @sha256: (digest-pinned, no mutable tags)
4target_version != source_version
5No active rollout for same chain_profile_id (DB unique index)
6Manifest content hash computed and stored (auto-computed from YAML in auto mode, provided in manual mode)
7Breaking migrations (migration_type: breaking): require recovery_plan, canary strategy, acknowledge_state_risk: true
8Group validation: component_index unique within group, order matches component_order
9Group creation: failure_policy required (no default), desired_versions non-empty, component_order covers all entries

State Compatibility Validation

Declared in upgrade manifest (auto-populated from state_compatibility in auto mode, or provided directly in manual mode):
migration_typeConstraintsRollback Tier
none (default)No additional constraintsTier 1
compatibleCanary recommended, not requiredTier 2
breakingCanary required, blocking approval, recovery_plan required, acknowledge_state_riskTier 3

Config Rendering Pipeline

For upgrades with config changes, the control plane renders config content:
  1. Load base template from hoodcloud-chain-configs/
  2. Build TemplateVars from chain profile + node config (same variables as CONFIGURE)
  3. Apply variable_overrides from upgrade manifest
  4. Render via Go text/template -> full content
  5. Include in UpgradePayload.config_files
Agent receives pre-rendered content. No template rendering on the agent for upgrades.

API Surface

Rollout Endpoints

MethodPathDescription
POST/api/v1/rolloutsCreate rollout
GET/api/v1/rolloutsList rollouts
GET/api/v1/rollouts/{id}Get rollout + progress
POST/api/v1/rollouts/{id}/startStart pending rollout
POST/api/v1/rollouts/{id}/pausePause (Temporal signal)
POST/api/v1/rollouts/{id}/resumeResume (Temporal signal)
POST/api/v1/rollouts/{id}/cancelCancel (Temporal signal)
POST/api/v1/rollouts/{id}/rollbackFull rollback
GET/api/v1/rollouts/{id}/nodesPer-node status (includes upgrade_phase)
POST/api/v1/rollouts/{id}/nodes/{nodeId}/retryRetry failed node

Rollout Group Endpoints

MethodPathDescription
POST/api/v1/rollout-groupsCreate group (multi-component)
GET/api/v1/rollout-groups/{id}Get group + component rollouts
POST/api/v1/rollout-groups/{id}/startStart group
POST/api/v1/rollout-groups/{id}/cancelCancel group
GET/api/v1/rollout-groups/{id}/statusGroup status + per-component progress
Auth: DualAuthMiddleware (JWT + API key). Required scope: admin (backend-granted only — users cannot self-assign this scope). Both JWT and API key users must have admin scope to access rollout endpoints. The CreatedBy field on rollouts records the authenticated user ID for audit trail.

Integration Points

Incident Categories

CategoryTrigger
upgrade_failedNode upgrade AND compensation both fail (FAILED_ROLLBACK)
upgrade_rollback_failedRollback of a previously succeeded node fails
upgrade_stalledHealth validation exceeds duration but node shows some life
See also: Health and Incidents — Incident Service for existing incident categories.

Notification Events

Rollout-level (not per-node): rollout_started, rollout_completed, rollout_paused, rollout_failed, rollback_initiated, rollback_completed Reuses existing dispatcher infrastructure.

NATS Event Subjects

  • Rollout-level: events.rollout.{rolloutID}.started, .progress, .completed
  • Per-node: events.node.{nodeID}.upgrade_started, .upgrade_completed, .upgrade_phase_changed

Prometheus Metrics

MetricTypeLabels
hoodcloud_rollout_totalCounterchain, strategy, status
hoodcloud_rollout_duration_secondsHistogramchain, strategy, status
hoodcloud_rollout_node_upgrades_totalCounterchain, status
hoodcloud_rollout_node_upgrade_duration_secondsHistogramchain, status
hoodcloud_rollout_activeGaugechain
hoodcloud_rollout_progressGaugechain, rollout_id
hoodcloud_rollout_node_phaseGaugechain, rollout_id, phase

Operational Guide

Creating a Rollout

Auto mode (recommended): Create an upgrade manifest YAML in {ConfigDir}/upgrades/{chain_profile_id}/, then:
  1. POST /api/v1/rollouts with upgrade_id, chain_profile_id, node_type, source_version, strategy, and batch size
  2. Control plane loads manifest, auto-populates binary details, state compatibility, and config changes
  3. POST /api/v1/rollouts/{id}/start to begin execution
Manual mode: Provide all fields explicitly (no upgrade_id):
  1. POST /api/v1/rollouts with chain_profile_id, node_type, binary_name, source_version, target_version, target_binary_url, target_binary_checksum, strategy, and batch size
  2. POST /api/v1/rollouts/{id}/start to begin execution
For multi-component upgrades (e.g., geth + lighthouse):
  1. POST /api/v1/rollout-groups with desired_versions, component_order (entries may include upgrade_id for auto-population), and failure_policy
  2. POST /api/v1/rollout-groups/{id}/start

Monitoring Progress

  • GET /api/v1/rollouts/{id} — overall progress (counters)
  • GET /api/v1/rollouts/{id}/nodes — per-node status with upgrade_phase
  • GET /api/v1/rollout-groups/{id}/status — per-component progress
  • Temporal UI: workflow status, heartbeat details (action_index, action_name, progress_pct)

Pausing and Resuming

  • POST /api/v1/rollouts/{id}/pause — pauses after current batch completes
  • POST /api/v1/rollouts/{id}/resume — continues from paused state
  • Auto-pause triggers: failure threshold exceeded, canary failure

Handling Failures

  • Individual node failures: agent runs per-action compensation automatically
  • Retry a failed node: POST /api/v1/rollouts/{id}/nodes/{nodeId}/retry
  • FAILED_ROLLBACK nodes: manual intervention required (critical incident created)

Full Rollback

  1. POST /api/v1/rollouts/{id}/rollback
  2. System pauses the rollout, then launches reverse upgrades for all succeeded nodes
  3. Uses the same UpgradeNodeWorkflow with swapped versions

Canary Approval for Breaking Migrations

When migration_type: breaking:
  1. Canary batch executes and completes
  2. Rollout blocks — waiting for operator approval
  3. Operator validates canary nodes manually
  4. Operator sends approve_canary signal to resume
  5. Remaining batches proceed

Chain-Agnostic Verification

Adding a new systemd chain (zero code changes):
  1. Create chain config YAML with runtime: { type: systemd } and binary definition
  2. Provision nodes
  3. Create rollout with new version -> agent injects SystemdAdapter
  4. Executor runs same 8 actions. No code changes.
Adding a new Docker chain (zero code changes):
  1. Create chain config with runtime: { type: docker } and image digest
  2. Provision nodes
  3. Create rollout -> agent injects DockerAdapter
  4. Executor runs same 8 actions. No code changes.
What requires code changes:
  • New imperative runtime (e.g., LXC, Podman): implement RuntimeAdapter. Zero changes to actions, executor, workflows, or API.
  • Workflows — Temporal workflow patterns, compensation conventions
  • Health and Incidents — Health evaluation pipeline, incident management
  • Domain Model — Node state machine, subscription lifecycle
  • Extending — Adding runtime adapters, notification channels