Last verified: 2026-02-13
Overview
The upgrade rollout system performs binary and configuration upgrades on running nodes without reprovisioning. It replaces node binaries and config files in-place while preserving chain data, keys, and node identity. When to use upgrades vs. reprovisioning:| Scenario | Mechanism |
|---|---|
| New binary version (minor/patch) | Upgrade rollout |
| Config parameter change | Upgrade rollout (with config_files) |
| Breaking state migration | Upgrade rollout (canary + blocking approval) |
| New chain onboarding | Provisioning workflow |
| VM replacement / infra change | Migration workflow |
| Data corruption recovery | Reprovisioning |
- Chain-agnostic: Same 8 action primitives serve all chains. No chain-specific code in the upgrade engine.
- Runtime-agnostic:
RuntimeAdapterinterface abstracts systemd/Docker. Adding a runtime requires only a new adapter. - Idempotent: Three-layer idempotency (completed store, checkpoints, system state checks) ensures safe re-entry after crashes.
- Compensating: Per-action
Compensate()in reverse order restores previous state on failure.
Architecture
Manifest Loader
Package:internal/upgrade/manifest/
Rollout creation supports two modes:
- Auto mode: Provide
upgrade_id+chain_profile_id+node_type. The control plane loads the manifest from{ConfigDir()}/upgrades/{chain_profile_id}/{upgrade_id}.yamland auto-populatesbinary_name,target_version,target_url,target_checksum,migration_type,recovery_plan,config_changes, andmanifest_content_hash. - Manual mode: Omit
upgrade_id. Operator provides all fields explicitly (existing behavior).
manifest.Reader interface (Get, List, ContentHash) is implemented by FileReader, which resolves the config directory at call time via chains.Provider.ConfigDir(). No new infrastructure dependency.
manifest_content_hash is computed from the raw YAML bytes (sha256:<hex>), not manually provided.
Rollout group component_order entries also support upgrade_id for per-component auto-population.
Agent Three-Layer Architecture
Actions express WHAT happens. Adapters express HOW for each runtime. The executor orchestrates sequencing and compensation. No layer has knowledge of the others’ concerns.Workflow Hierarchy
Control plane to agent flow: Control plane sendsCOMMAND_TYPE_UPGRADE (value 9) with an UpgradePayload containing artifact references (URL, checksum, version). No execution instructions travel over the wire. The agent constructs the action sequence locally and injects the correct RuntimeAdapter based on runtime.type from config.
Agent Upgrade Handler
UpgradeAction Interface
File:internal/opsagent/upgrade/action.go
RuntimeAdapter Interface
File:internal/opsagent/upgrade/adapter.go
Action Primitives
8 actions ininternal/opsagent/upgrade/actions/. The same set serves all runtimes:
| Action | Name | Timeout | IsAlreadyDone | Compensate |
|---|---|---|---|---|
BackupCurrentState | backup_state | 1 min | Backup dir + manifest present | No-op (preserved for investigation) |
AcquireArtifact | acquire_artifact | 5 min | Temp file/image exists + checksum matches | Clean up acquired artifact |
VerifyArtifact | verify_artifact | 30 sec | Always false (safe to repeat) | No-op |
StopNode | stop_node | 60 sec | Service not running | rt.Start() — restart service |
InstallArtifact | install_artifact | 30 sec | Binary/container at target + checksum matches | rt.RestoreState() — restore from backup |
WriteConfigs | write_configs | 30 sec | Per-file SHA256 matches expected | Restore configs from backup dir |
ReloadDaemon | reload_daemon | 10 sec | Always false (safe to repeat) | rt.ReloadDaemon() — safe after config restore |
StartNode | start_node | 30 sec | Service already running | rt.Stop() |
Executor
File:internal/opsagent/upgrade/executor.go
The executor iterates actions sequentially. For each action:
- Check
IsAlreadyDone()via adapter (skip if true) - Execute with per-action timeout budget
- On failure: call
Compensate()on all completed actions in reverse order - Checkpoint completed actions for crash recovery
ScheduleToClose: 10 min. Heartbeat timeout: 30 sec.
Phase Reporting
The executor derives upgrade phase from the current action position:| Actions | Phase |
|---|---|
backup_state, acquire_artifact, verify_artifact | PREPARING |
stop_node | STOPPED |
install_artifact, write_configs, reload_daemon | MUTATING |
start_node | STARTING |
| (Temporal-level health validation) | VERIFYING |
rollout_nodes.upgrade_phase.
Three-Layer Idempotency
| Layer | Scope | Mechanism |
|---|---|---|
| 1. Command-level | CompletedUpgradeStore | If upgrade_id already completed, skip entirely |
| 2. Step-level | Checkpoint manager with InputHash + TTL | Track per-action progress; expired in_progress (10 min) falls through |
| 3. State-level | IsAlreadyDone() per action via adapter | Verify actual system state matches expected |
Runtime Adapters
Systemd Adapter
| Method | Behavior |
|---|---|
AcquireArtifact | HTTP download to {InstallTarget}/.download-{name} (same filesystem for atomic rename) |
VerifyArtifact | SHA256 file hash comparison |
IsArtifactInstalled | Binary exists at path + checksum matches |
BackupState | Copy binary + configs to backup dir, write manifest.json |
InstallArtifact | os.Rename(downloadPath, installPath) (atomic, same filesystem) |
WriteConfig | os.WriteFile(path, content, 0644) |
ReloadDaemon | systemctl daemon-reload |
RestoreState | Copy files from backup dir to original paths |
Docker Adapter
| Method | Behavior |
|---|---|
AcquireArtifact | docker pull {image@sha256:digest} |
VerifyArtifact | Image digest match |
IsArtifactInstalled | Container with correct image + name exists |
BackupState | docker inspect -> save config JSON to backup dir |
InstallArtifact | docker rename {c} {c}.bak + docker create new with cloned config |
WriteConfig | Write to mounted volume path |
ReloadDaemon | No-op (return nil) |
RestoreState | docker rm new + docker rename {c}.bak {c} |
See also: Extending for how to add a new runtime adapter.
Data Model
Tables
rollout_groups — multi-component upgrade coordination:
| Column | Type | Description |
|---|---|---|
id | UUID | Primary key |
chain_profile_id | VARCHAR(100) | Target chain |
status | VARCHAR(20) | Group lifecycle status |
failure_policy | VARCHAR(20) | partial_ok, rollback_all, manual (required, no default) |
component_order | JSONB | Ordered array of component definitions |
desired_versions | JSONB | Target state: {binary_name: version} |
strategy | VARCHAR(20) | Rollout strategy |
WHERE status IN ('pending', 'running', 'paused')).
rollouts — per-component rollout:
| Column | Type | Description |
|---|---|---|
id | UUID | Primary key |
group_id | UUID (nullable) | Parent group (null for standalone) |
component_index | INT | Ordering within group |
chain_profile_id | VARCHAR(100) | Target chain |
binary_name | VARCHAR(100) | Binary to upgrade |
source_version / target_version | VARCHAR(50) | Version transition |
target_binary_url | TEXT | Resolved download URL |
target_binary_checksum | TEXT | SHA256 or image digest (required) |
strategy | VARCHAR(20) | rolling, canary, all_at_once |
status | VARCHAR(20) | Rollout lifecycle status |
total_nodes / succeeded_nodes / failed_nodes / pending_nodes | INT | Denormalized progress counters |
manifest_content_hash | TEXT | SHA256 of manifest at creation time |
rollout_nodes — per-node upgrade tracking:
| Column | Type | Description |
|---|---|---|
id | UUID | Primary key |
rollout_id | UUID | Parent rollout |
node_id | UUID | Target node |
batch_index | INT | Batch assignment |
status | VARCHAR(20) | Node upgrade status |
upgrade_phase | VARCHAR(20) | Display state (PREPARING, MUTATING, etc.) |
previous_version / previous_binary_url / previous_checksum | TEXT | Rollback reference |
error_message | TEXT | Failure details |
attempt_count | INT | Retry counter |
(rollout_id, node_id).
Status Enums
RolloutStatus / RolloutGroupStatus:RolloutGroupStatus adds partial (some components succeeded, some failed with partial_ok policy).
NodeUpgradeStatus:
UpgradePhase (display state on rollout_nodes):
NodeConfigState
Uses the existingnode_config_state table. binary_version is written by UpdateNodeBinaryVersion activity after successful upgrade. Initial version written by the provisioning workflow after CONFIGURE.
See also: Domain Model for canonical node and subscription models.
Temporal Workflows
RolloutGroupWorkflow
File:internal/workflows/rollout_group.go
RolloutWorkflow per component with a health gate between components.
Signals: pause, resume, cancel, rollback, skip
Failure policies:
| Policy | Behavior |
|---|---|
partial_ok | Failed component stops. Completed components stay upgraded. Group marked partial. |
rollback_all | Any component failure triggers reverse-order rollback of ALL completed components. |
manual | Failure pauses group. Operator decides: resume, rollback, or skip. |
RolloutWorkflow
File:internal/workflows/rollout.go
| Signal | Behavior |
|---|---|
pause | Pause after current batch completes |
resume | Resume from paused state |
cancel | Stop launching new batches, let in-flight children complete |
force | Override failure threshold |
- Activity:
ResolveRolloutTargets— filter eligible nodes, assign batch indexes - Per batch: launch child
UpgradeNodeWorkflowper node, wait for batch completion - Activity:
UpdateRolloutProgress— update counters - Evaluate failure threshold (unless
force_continue=true) - Canary: after batch 0, extended observation (2x
health_wait), validate canary health continue-as-newwith remaining batches (event history resets per batch)- Final batch: Activity
FinalizeRollout(set terminal status, forced DB-Temporal sync)
UpgradeNodeWorkflow
File:internal/workflows/upgrade_node.go
| Failure Point | Compensation |
|---|---|
| SendUpgrade / WaitForCompletion (agent compensated) | Agent already rolled back via per-action compensation |
| UpdateVersion / HealthValidation (artifact installed) | Agent-side compensation restores from backup |
Agent reports FAILED_ROLLBACK | Critical incident, set degraded, no further automatic action |
Activity Profiles
| Profile | Timeout | Retries |
|---|---|---|
ProfileUpgradeDefault | 10 min | 3 |
ProfileHealthValidation | Configurable duration | 1 |
See also: Workflows for existing Temporal patterns and compensation conventions.
Rollout Strategies
Rolling
- Upgrade
batch_sizenodes at a time (sequential batches, parallel within batch) - After each batch: check failure threshold
- If
failed / attempted > threshold-> auto-pause - Operator decides: resume, retry, cancel, rollback
Canary
- Batch 0:
canary_sizerandom nodes - Extended observation: 2x
health_wait_durationafter canary batch - If canary healthy -> proceed with remaining nodes in rolling batches
- If canary fails -> auto-pause
- Breaking migrations (
migration_type: breaking): canary batch is blocking — rollout cannot proceed until operator sendsapprove_canarysignal
All-at-Once
- All nodes in single batch (no sequential batching)
- For hard forks with block height deadlines
force_continue=true: complete all nodes regardless of failuresSkipValidation=true: skip health wait for deadline pressure
Compensation and Rollback
Per-Action Compensation
On failure at any action, the executor callsCompensate() on all completed actions in reverse order. The ordering guarantees correct restoration:
- Configs restored before daemon-reload
- Daemon-reload before start
- Service started last
Compensate() fails -> FAILED_ROLLBACK terminal state + critical incident. Manual intervention required.
Backup Directory
Each upgrade creates a backup at{data_dir}/.upgrade-backup/{upgrade_id}/:
FAILED_ROLLBACK + incident.
Rollback Guarantee Tiers
| Tier | Scope | Guarantee | Recovery |
|---|---|---|---|
| Tier 1 | Binary + Config | Full. Previous binary and configs restored via per-action compensation. | Automatic |
| Tier 2 | State-Compatible | Binary + config restored. Old binary can read state written by new binary. | Automatic |
| Tier 3 | State-Incompatible | Binary + config restored, but chain state may be unreadable by old binary. Rollback may cause crash loop. | Reprovisioning |
state_compatibility.migration_type in the upgrade manifest.
Post-rollback crash loop (Tier 3):
- Binary + config restored (Tier 1 always holds)
- Node starts -> old binary encounters incompatible state -> crash loop
- Health validation detects crash loop
- Critical incident created with
recovery_planfrom manifest - Node set to
degraded - Operator-initiated reprovisioning is the recovery path
Full Rollout Rollback
Operator-triggered via API:- Pause rollout (stop new batches)
- For each succeeded node: launch
UpgradeNodeWorkflowwith swapped versions (target = source) - Wait for all rollback workflows
- Mark rollout
rolled_back
UpgradeNodeWorkflow. No separate rollback workflow.
Group Rollback (rollback_all Policy)
When a component fails in a group with rollback_all:
- Reverse-order rollback of ALL completed components
- Each component rolled back via
RolloutWorkflowwith swapped versions - Group marked
rolled_back
Health Validation
WaitForHealthValidation
Polls forhealth_wait_duration with relaxed criteria. ALL must pass:
- At least one heartbeat received (agent alive)
application_health != criticalsync_status != stalled- Node state not
downorfailed - No crashloop (fewer than N restarts within
health_wait_duration)
sync_status == synced (syncing expected after restart) or application_health == ok (degraded tolerated during stabilization).
Maintenance State Exemption
During upgrade (node inmaintenance):
- Heartbeat evaluator skips maintenance nodes (existing behavior)
- Policy evaluator skips maintenance nodes
- No false alarms during upgrade window
MaintenanceCleanup Loop
Periodic check (configurable, default 30 min) for nodes stuck inmaintenance state. Catches nodes where the upgrade workflow failed without cleaning up state.
See also: Health and Incidents for the health evaluation pipeline.
Validation Rules
Creation-Time Rules
| # | Rule |
|---|---|
| 1 | checksum must be non-empty |
| 2 | Docker chains: checksum must be sha256: prefixed |
| 3 | Docker chains: resolved URL must contain @sha256: (digest-pinned, no mutable tags) |
| 4 | target_version != source_version |
| 5 | No active rollout for same chain_profile_id (DB unique index) |
| 6 | Manifest content hash computed and stored (auto-computed from YAML in auto mode, provided in manual mode) |
| 7 | Breaking migrations (migration_type: breaking): require recovery_plan, canary strategy, acknowledge_state_risk: true |
| 8 | Group validation: component_index unique within group, order matches component_order |
| 9 | Group creation: failure_policy required (no default), desired_versions non-empty, component_order covers all entries |
State Compatibility Validation
Declared in upgrade manifest (auto-populated fromstate_compatibility in auto mode, or provided directly in manual mode):
migration_type | Constraints | Rollback Tier |
|---|---|---|
none (default) | No additional constraints | Tier 1 |
compatible | Canary recommended, not required | Tier 2 |
breaking | Canary required, blocking approval, recovery_plan required, acknowledge_state_risk | Tier 3 |
Config Rendering Pipeline
For upgrades with config changes, the control plane renders config content:- Load base template from
hoodcloud-chain-configs/ - Build
TemplateVarsfrom chain profile + node config (same variables as CONFIGURE) - Apply
variable_overridesfrom upgrade manifest - Render via Go
text/template-> full content - Include in
UpgradePayload.config_files
API Surface
Rollout Endpoints
| Method | Path | Description |
|---|---|---|
POST | /api/v1/rollouts | Create rollout |
GET | /api/v1/rollouts | List rollouts |
GET | /api/v1/rollouts/{id} | Get rollout + progress |
POST | /api/v1/rollouts/{id}/start | Start pending rollout |
POST | /api/v1/rollouts/{id}/pause | Pause (Temporal signal) |
POST | /api/v1/rollouts/{id}/resume | Resume (Temporal signal) |
POST | /api/v1/rollouts/{id}/cancel | Cancel (Temporal signal) |
POST | /api/v1/rollouts/{id}/rollback | Full rollback |
GET | /api/v1/rollouts/{id}/nodes | Per-node status (includes upgrade_phase) |
POST | /api/v1/rollouts/{id}/nodes/{nodeId}/retry | Retry failed node |
Rollout Group Endpoints
| Method | Path | Description |
|---|---|---|
POST | /api/v1/rollout-groups | Create group (multi-component) |
GET | /api/v1/rollout-groups/{id} | Get group + component rollouts |
POST | /api/v1/rollout-groups/{id}/start | Start group |
POST | /api/v1/rollout-groups/{id}/cancel | Cancel group |
GET | /api/v1/rollout-groups/{id}/status | Group status + per-component progress |
DualAuthMiddleware (JWT + API key). Required scope: admin (backend-granted only — users cannot self-assign this scope). Both JWT and API key users must have admin scope to access rollout endpoints. The CreatedBy field on rollouts records the authenticated user ID for audit trail.
Integration Points
Incident Categories
| Category | Trigger |
|---|---|
upgrade_failed | Node upgrade AND compensation both fail (FAILED_ROLLBACK) |
upgrade_rollback_failed | Rollback of a previously succeeded node fails |
upgrade_stalled | Health validation exceeds duration but node shows some life |
See also: Health and Incidents — Incident Service for existing incident categories.
Notification Events
Rollout-level (not per-node):rollout_started, rollout_completed, rollout_paused, rollout_failed, rollback_initiated, rollback_completed
Reuses existing dispatcher infrastructure.
NATS Event Subjects
- Rollout-level:
events.rollout.{rolloutID}.started,.progress,.completed - Per-node:
events.node.{nodeID}.upgrade_started,.upgrade_completed,.upgrade_phase_changed
Prometheus Metrics
| Metric | Type | Labels |
|---|---|---|
hoodcloud_rollout_total | Counter | chain, strategy, status |
hoodcloud_rollout_duration_seconds | Histogram | chain, strategy, status |
hoodcloud_rollout_node_upgrades_total | Counter | chain, status |
hoodcloud_rollout_node_upgrade_duration_seconds | Histogram | chain, status |
hoodcloud_rollout_active | Gauge | chain |
hoodcloud_rollout_progress | Gauge | chain, rollout_id |
hoodcloud_rollout_node_phase | Gauge | chain, rollout_id, phase |
Operational Guide
Creating a Rollout
Auto mode (recommended): Create an upgrade manifest YAML in{ConfigDir}/upgrades/{chain_profile_id}/, then:
POST /api/v1/rolloutswithupgrade_id,chain_profile_id,node_type,source_version, strategy, and batch size- Control plane loads manifest, auto-populates binary details, state compatibility, and config changes
POST /api/v1/rollouts/{id}/startto begin execution
upgrade_id):
POST /api/v1/rolloutswithchain_profile_id,node_type,binary_name,source_version,target_version,target_binary_url,target_binary_checksum, strategy, and batch sizePOST /api/v1/rollouts/{id}/startto begin execution
POST /api/v1/rollout-groupswithdesired_versions,component_order(entries may includeupgrade_idfor auto-population), andfailure_policyPOST /api/v1/rollout-groups/{id}/start
Monitoring Progress
GET /api/v1/rollouts/{id}— overall progress (counters)GET /api/v1/rollouts/{id}/nodes— per-node status withupgrade_phaseGET /api/v1/rollout-groups/{id}/status— per-component progress- Temporal UI: workflow status, heartbeat details (
action_index,action_name,progress_pct)
Pausing and Resuming
POST /api/v1/rollouts/{id}/pause— pauses after current batch completesPOST /api/v1/rollouts/{id}/resume— continues from paused state- Auto-pause triggers: failure threshold exceeded, canary failure
Handling Failures
- Individual node failures: agent runs per-action compensation automatically
- Retry a failed node:
POST /api/v1/rollouts/{id}/nodes/{nodeId}/retry FAILED_ROLLBACKnodes: manual intervention required (critical incident created)
Full Rollback
POST /api/v1/rollouts/{id}/rollback- System pauses the rollout, then launches reverse upgrades for all succeeded nodes
- Uses the same
UpgradeNodeWorkflowwith swapped versions
Canary Approval for Breaking Migrations
Whenmigration_type: breaking:
- Canary batch executes and completes
- Rollout blocks — waiting for operator approval
- Operator validates canary nodes manually
- Operator sends
approve_canarysignal to resume - Remaining batches proceed
Chain-Agnostic Verification
Adding a new systemd chain (zero code changes):- Create chain config YAML with
runtime: { type: systemd }and binary definition - Provision nodes
- Create rollout with new version -> agent injects
SystemdAdapter - Executor runs same 8 actions. No code changes.
- Create chain config with
runtime: { type: docker }and image digest - Provision nodes
- Create rollout -> agent injects
DockerAdapter - Executor runs same 8 actions. No code changes.
- New imperative runtime (e.g., LXC, Podman): implement
RuntimeAdapter. Zero changes to actions, executor, workflows, or API.
Related Documents
- Workflows — Temporal workflow patterns, compensation conventions
- Health and Incidents — Health evaluation pipeline, incident management
- Domain Model — Node state machine, subscription lifecycle
- Extending — Adding runtime adapters, notification channels