Node Upgrade Rollout System

Last verified: 2026-02-13

Overview

The upgrade rollout system performs binary and configuration upgrades on running nodes without reprovisioning. It replaces node binaries and config files in-place while preserving chain data, keys, and node identity. When to use upgrades vs. reprovisioning:

Scenario	Mechanism
New binary version (minor/patch)	Upgrade rollout
Config parameter change	Upgrade rollout (with `config_files`)
Breaking state migration	Upgrade rollout (canary + blocking approval)
New chain onboarding	Provisioning workflow
VM replacement / infra change	Migration workflow
Data corruption recovery	Reprovisioning

Design principles:

Chain-agnostic: Same 8 action primitives serve all chains. No chain-specific code in the upgrade engine.
Runtime-agnostic: RuntimeAdapter interface abstracts systemd/Docker. Adding a runtime requires only a new adapter.
Idempotent: Three-layer idempotency (completed store, checkpoints, system state checks) ensures safe re-entry after crashes.
Compensating: Per-action Compensate() in reverse order restores previous state on failure.

Architecture

Manifest Loader

Package: internal/upgrade/manifest/ Rollout creation supports two modes:

Auto mode: Provide upgrade_id + chain_profile_id + node_type. The control plane loads the manifest from {ConfigDir()}/upgrades/{chain_profile_id}/{upgrade_id}.yaml and auto-populates binary_name, target_version, target_url, target_checksum, migration_type, recovery_plan, config_changes, and manifest_content_hash.
Manual mode: Omit upgrade_id. Operator provides all fields explicitly (existing behavior).

Manifests are YAML files stored in the S3-synced chain config directory. The manifest.Reader interface (Get, List, ContentHash) is implemented by FileReader, which resolves the config directory at call time via chains.Provider.ConfigDir(). No new infrastructure dependency.

# {ConfigDir}/upgrades/{chain_profile_id}/{upgrade_id}.yaml
upgrade_id: v1.5.0-hotfix
chain_profile_id: celestia-mocha
node_type: full
binary_name: celestia-node
target_version: v0.21.0
target_url: https://releases.celestia.org/v0.21.0/celestia-node-linux-amd64
target_checksum: sha256:abc123...
state_compatibility:
  migration_type: compatible
  notes: "State format unchanged"
config_changes:
  - path: /etc/celestia/config.toml
    template: config.toml.tmpl
    overrides:
      max_peers: "100"

manifest_content_hash is computed from the raw YAML bytes (sha256:<hex>), not manually provided. Rollout group component_order entries also support upgrade_id for per-component auto-population.

Agent Three-Layer Architecture

Actions express WHAT happens. Adapters express HOW for each runtime. The executor orchestrates sequencing and compensation. No layer has knowledge of the others’ concerns.

Workflow Hierarchy

Control plane to agent flow: Control plane sends COMMAND_TYPE_UPGRADE (value 9) with an UpgradePayload containing artifact references (URL, checksum, version). No execution instructions travel over the wire. The agent constructs the action sequence locally and injects the correct RuntimeAdapter based on runtime.type from config.

Agent Upgrade Handler

UpgradeAction Interface

File: internal/opsagent/upgrade/action.go

type UpgradeAction interface {
    Name() string
    Execute(ctx context.Context, rt RuntimeAdapter, params ActionParams) error
    IsAlreadyDone(ctx context.Context, rt RuntimeAdapter, params ActionParams) (bool, error)
    Compensate(ctx context.Context, rt RuntimeAdapter, params ActionParams) error
    Timeout() time.Duration
}

RuntimeAdapter Interface

File: internal/opsagent/upgrade/adapter.go

type RuntimeAdapter interface {
    runtime.Runtime  // Start, Stop, Restart, Status, ServiceName

    AcquireArtifact(ctx context.Context, url, checksum string) (ArtifactHandle, error)
    VerifyArtifact(ctx context.Context, handle ArtifactHandle, checksum string) error
    IsArtifactInstalled(ctx context.Context, name, checksum string) (bool, error)
    BackupState(ctx context.Context, backupDir string) error
    InstallArtifact(ctx context.Context, handle ArtifactHandle, target string) error
    WriteConfig(ctx context.Context, path string, content []byte) error
    IsConfigWritten(ctx context.Context, path string, expectedHash string) (bool, error)
    ReloadDaemon(ctx context.Context) error
    RestoreState(ctx context.Context, backupDir string) error
    IsArtifactAcquired(ctx context.Context, url, checksum string) (bool, error)
    IsBackedUp(ctx context.Context, backupDir string) (bool, error)
}

Action Primitives

8 actions in internal/opsagent/upgrade/actions/. The same set serves all runtimes:

Action	Name	Timeout	IsAlreadyDone	Compensate
`BackupCurrentState`	`backup_state`	1 min	Backup dir + manifest present	No-op (preserved for investigation)
`AcquireArtifact`	`acquire_artifact`	5 min	Temp file/image exists + checksum matches	Clean up acquired artifact
`VerifyArtifact`	`verify_artifact`	30 sec	Always false (safe to repeat)	No-op
`StopNode`	`stop_node`	60 sec	Service not running	`rt.Start()` — restart service
`InstallArtifact`	`install_artifact`	30 sec	Binary/container at target + checksum matches	`rt.RestoreState()` — restore from backup
`WriteConfigs`	`write_configs`	30 sec	Per-file SHA256 matches expected	Restore configs from backup dir
`ReloadDaemon`	`reload_daemon`	10 sec	Always false (safe to repeat)	`rt.ReloadDaemon()` — safe after config restore
`StartNode`	`start_node`	30 sec	Service already running	`rt.Stop()`

Executor

File: internal/opsagent/upgrade/executor.go The executor iterates actions sequentially. For each action:

Check IsAlreadyDone() via adapter (skip if true)
Execute with per-action timeout budget
On failure: call Compensate() on all completed actions in reverse order
Checkpoint completed actions for crash recovery

Total execution budget: ~8.5 min. Temporal activity ScheduleToClose: 10 min. Heartbeat timeout: 30 sec.

Phase Reporting

The executor derives upgrade phase from the current action position:

Actions	Phase
`backup_state`, `acquire_artifact`, `verify_artifact`	`PREPARING`
`stop_node`	`STOPPED`
`install_artifact`, `write_configs`, `reload_daemon`	`MUTATING`
`start_node`	`STARTING`
(Temporal-level health validation)	`VERIFYING`

Phase transitions are reported via command response metadata and written to rollout_nodes.upgrade_phase.

Three-Layer Idempotency

Layer	Scope	Mechanism
1. Command-level	`CompletedUpgradeStore`	If `upgrade_id` already completed, skip entirely
2. Step-level	Checkpoint manager with `InputHash` + TTL	Track per-action progress; expired `in_progress` (10 min) falls through
3. State-level	`IsAlreadyDone()` per action via adapter	Verify actual system state matches expected

Runtime Adapters

Systemd Adapter

Method	Behavior
`AcquireArtifact`	HTTP download to `{InstallTarget}/.download-{name}` (same filesystem for atomic rename)
`VerifyArtifact`	SHA256 file hash comparison
`IsArtifactInstalled`	Binary exists at path + checksum matches
`BackupState`	Copy binary + configs to backup dir, write `manifest.json`
`InstallArtifact`	`os.Rename(downloadPath, installPath)` (atomic, same filesystem)
`WriteConfig`	`os.WriteFile(path, content, 0644)`
`ReloadDaemon`	`systemctl daemon-reload`
`RestoreState`	Copy files from backup dir to original paths

Docker Adapter

Method	Behavior
`AcquireArtifact`	`docker pull {image@sha256:digest}`
`VerifyArtifact`	Image digest match
`IsArtifactInstalled`	Container with correct image + name exists
`BackupState`	`docker inspect` -> save config JSON to backup dir
`InstallArtifact`	`docker rename {c} {c}.bak` + `docker create` new with cloned config
`WriteConfig`	Write to mounted volume path
`ReloadDaemon`	No-op (return nil)
`RestoreState`	`docker rm` new + `docker rename {c}.bak {c}`

See also: Extending for how to add a new runtime adapter.

Data Model

Tables

rollout_groups — multi-component upgrade coordination:

Column	Type	Description
`id`	UUID	Primary key
`chain_profile_id`	VARCHAR(100)	Target chain
`status`	VARCHAR(20)	Group lifecycle status
`failure_policy`	VARCHAR(20)	`partial_ok`, `rollback_all`, `manual` (required, no default)
`component_order`	JSONB	Ordered array of component definitions
`desired_versions`	JSONB	Target state: `{binary_name: version}`
`strategy`	VARCHAR(20)	Rollout strategy

Partial unique index: one active group per chain (WHERE status IN ('pending', 'running', 'paused')). rollouts — per-component rollout:

Column	Type	Description
`id`	UUID	Primary key
`group_id`	UUID (nullable)	Parent group (null for standalone)
`component_index`	INT	Ordering within group
`chain_profile_id`	VARCHAR(100)	Target chain
`binary_name`	VARCHAR(100)	Binary to upgrade
`source_version` / `target_version`	VARCHAR(50)	Version transition
`target_binary_url`	TEXT	Resolved download URL
`target_binary_checksum`	TEXT	SHA256 or image digest (required)
`strategy`	VARCHAR(20)	`rolling`, `canary`, `all_at_once`
`status`	VARCHAR(20)	Rollout lifecycle status
`total_nodes` / `succeeded_nodes` / `failed_nodes` / `pending_nodes`	INT	Denormalized progress counters
`manifest_content_hash`	TEXT	SHA256 of manifest at creation time

Partial unique index: one active rollout per chain. rollout_nodes — per-node upgrade tracking:

Column	Type	Description
`id`	UUID	Primary key
`rollout_id`	UUID	Parent rollout
`node_id`	UUID	Target node
`batch_index`	INT	Batch assignment
`status`	VARCHAR(20)	Node upgrade status
`upgrade_phase`	VARCHAR(20)	Display state (PREPARING, MUTATING, etc.)
`previous_version` / `previous_binary_url` / `previous_checksum`	TEXT	Rollback reference
`error_message`	TEXT	Failure details
`attempt_count`	INT	Retry counter

Unique constraint: (rollout_id, node_id).

Status Enums

RolloutStatus / RolloutGroupStatus: RolloutGroupStatus adds partial (some components succeeded, some failed with partial_ok policy). NodeUpgradeStatus: UpgradePhase (display state on rollout_nodes):

NodeConfigState

Uses the existing node_config_state table. binary_version is written by UpdateNodeBinaryVersion activity after successful upgrade. Initial version written by the provisioning workflow after CONFIGURE.

See also: Domain Model for canonical node and subscription models.

Temporal Workflows

RolloutGroupWorkflow

File: internal/workflows/rollout_group.go

Workflow ID: rollout-group-{group_id}
Search Attributes: GroupID, ChainProfileID

Thin orchestrator for multi-binary chains (e.g., ethereum-holesky: geth + lighthouse). Iterates components in declared order, launching one RolloutWorkflow per component with a health gate between components. Signals: pause, resume, cancel, rollback, skip Failure policies:

Policy	Behavior
`partial_ok`	Failed component stops. Completed components stay upgraded. Group marked `partial`.
`rollback_all`	Any component failure triggers reverse-order rollback of ALL completed components.
`manual`	Failure pauses group. Operator decides: resume, rollback, or skip.

Standalone single-component rollouts bypass this workflow entirely.

RolloutWorkflow

File: internal/workflows/rollout.go

Workflow ID: rollout-{rollout_id}
Search Attributes: RolloutID, ChainProfileID, Strategy, GroupID

Signals:

Signal	Behavior
`pause`	Pause after current batch completes
`resume`	Resume from paused state
`cancel`	Stop launching new batches, let in-flight children complete
`force`	Override failure threshold

Processing flow:

Activity: ResolveRolloutTargets — filter eligible nodes, assign batch indexes
Per batch: launch child UpgradeNodeWorkflow per node, wait for batch completion
Activity: UpdateRolloutProgress — update counters
Evaluate failure threshold (unless force_continue=true)
Canary: after batch 0, extended observation (2x health_wait), validate canary health
continue-as-new with remaining batches (event history resets per batch)
Final batch: Activity FinalizeRollout (set terminal status, forced DB-Temporal sync)

Cancellation: Stop launching new batches. In-flight child workflows run to natural completion (upgrade or rollback). No forced cancellation of mid-upgrade nodes.

UpgradeNodeWorkflow

File: internal/workflows/upgrade_node.go

Workflow ID: upgrade-{rollout_id}-{node_id}
Search Attributes: NodeID, RolloutID, CorrelationID

Compensation (on failure at steps 3-6):

Failure Point	Compensation
SendUpgrade / WaitForCompletion (agent compensated)	Agent already rolled back via per-action compensation
UpdateVersion / HealthValidation (artifact installed)	Agent-side compensation restores from backup
Agent reports `FAILED_ROLLBACK`	Critical incident, set `degraded`, no further automatic action

Activity Profiles

Profile	Timeout	Retries
`ProfileUpgradeDefault`	10 min	3
`ProfileHealthValidation`	Configurable duration	1

See also: Workflows for existing Temporal patterns and compensation conventions.

Rollout Strategies

Rolling

Upgrade batch_size nodes at a time (sequential batches, parallel within batch)
After each batch: check failure threshold
If failed / attempted > threshold -> auto-pause
Operator decides: resume, retry, cancel, rollback

Canary

Batch 0: canary_size random nodes
Extended observation: 2x health_wait_duration after canary batch
If canary healthy -> proceed with remaining nodes in rolling batches
If canary fails -> auto-pause
Breaking migrations (migration_type: breaking): canary batch is blocking — rollout cannot proceed until operator sends approve_canary signal

All-at-Once

All nodes in single batch (no sequential batching)
For hard forks with block height deadlines
force_continue=true: complete all nodes regardless of failures
SkipValidation=true: skip health wait for deadline pressure

Compensation and Rollback

Per-Action Compensation

On failure at any action, the executor calls Compensate() on all completed actions in reverse order. The ordering guarantees correct restoration:

Configs restored before daemon-reload
Daemon-reload before start
Service started last

This prevents the deadly rollback inversion (old binary + new config). If any Compensate() fails -> FAILED_ROLLBACK terminal state + critical incident. Manual intervention required.

Backup Directory

Each upgrade creates a backup at {data_dir}/.upgrade-backup/{upgrade_id}/:

.upgrade-backup/
  {upgrade_id}/
    manifest.json          # Atomic marker (written last)
    binaries/
      celestia-node        # Previous binary
    configs/
      config.toml          # Previous config files
    container_config.json  # Docker: previous container inspect

Three-way checksum verification: Before restore, verify: manifest checksum == backup file checksum == original file checksum. Mismatch -> abort compensation, FAILED_ROLLBACK + incident.

Rollback Guarantee Tiers

Tier	Scope	Guarantee	Recovery
Tier 1	Binary + Config	Full. Previous binary and configs restored via per-action compensation.	Automatic
Tier 2	State-Compatible	Binary + config restored. Old binary can read state written by new binary.	Automatic
Tier 3	State-Incompatible	Binary + config restored, but chain state may be unreadable by old binary. Rollback may cause crash loop.	Reprovisioning

Tier is declared via state_compatibility.migration_type in the upgrade manifest. Post-rollback crash loop (Tier 3):

Binary + config restored (Tier 1 always holds)
Node starts -> old binary encounters incompatible state -> crash loop
Health validation detects crash loop
Critical incident created with recovery_plan from manifest
Node set to degraded
Operator-initiated reprovisioning is the recovery path

Full Rollout Rollback

Operator-triggered via API:

Pause rollout (stop new batches)
For each succeeded node: launch UpgradeNodeWorkflow with swapped versions (target = source)
Wait for all rollback workflows
Mark rollout rolled_back

Uses the same UpgradeNodeWorkflow. No separate rollback workflow.

Group Rollback (`rollback_all` Policy)

When a component fails in a group with rollback_all:

Reverse-order rollback of ALL completed components
Each component rolled back via RolloutWorkflow with swapped versions
Group marked rolled_back

Health Validation

WaitForHealthValidation

Polls for health_wait_duration with relaxed criteria. ALL must pass:

At least one heartbeat received (agent alive)
application_health != critical
sync_status != stalled
Node state not down or failed
No crashloop (fewer than N restarts within health_wait_duration)

Does NOT require sync_status == synced (syncing expected after restart) or application_health == ok (degraded tolerated during stabilization).

Maintenance State Exemption

During upgrade (node in maintenance):

Heartbeat evaluator skips maintenance nodes (existing behavior)
Policy evaluator skips maintenance nodes
No false alarms during upgrade window

MaintenanceCleanup Loop

Periodic check (configurable, default 30 min) for nodes stuck in maintenance state. Catches nodes where the upgrade workflow failed without cleaning up state.

See also: Health and Incidents for the health evaluation pipeline.

Validation Rules

Creation-Time Rules

#	Rule
1	`checksum` must be non-empty
2	Docker chains: `checksum` must be `sha256:` prefixed
3	Docker chains: resolved URL must contain `@sha256:` (digest-pinned, no mutable tags)
4	`target_version != source_version`
5	No active rollout for same `chain_profile_id` (DB unique index)
6	Manifest content hash computed and stored (auto-computed from YAML in auto mode, provided in manual mode)
7	Breaking migrations (`migration_type: breaking`): require `recovery_plan`, canary strategy, `acknowledge_state_risk: true`
8	Group validation: `component_index` unique within group, order matches `component_order`
9	Group creation: `failure_policy` required (no default), `desired_versions` non-empty, `component_order` covers all entries

State Compatibility Validation

Declared in upgrade manifest (auto-populated from state_compatibility in auto mode, or provided directly in manual mode):

`migration_type`	Constraints	Rollback Tier
`none` (default)	No additional constraints	Tier 1
`compatible`	Canary recommended, not required	Tier 2
`breaking`	Canary required, blocking approval, `recovery_plan` required, `acknowledge_state_risk`	Tier 3

Config Rendering Pipeline

For upgrades with config changes, the control plane renders config content:

Load base template from hoodcloud-chain-configs/
Build TemplateVars from chain profile + node config (same variables as CONFIGURE)
Apply variable_overrides from upgrade manifest
Render via Go text/template -> full content
Include in UpgradePayload.config_files

Agent receives pre-rendered content. No template rendering on the agent for upgrades.

API Surface

Rollout Endpoints

Method	Path	Description
`POST`	`/api/v1/rollouts`	Create rollout
`GET`	`/api/v1/rollouts`	List rollouts
`GET`	`/api/v1/rollouts/{id}`	Get rollout + progress
`POST`	`/api/v1/rollouts/{id}/start`	Start pending rollout
`POST`	`/api/v1/rollouts/{id}/pause`	Pause (Temporal signal)
`POST`	`/api/v1/rollouts/{id}/resume`	Resume (Temporal signal)
`POST`	`/api/v1/rollouts/{id}/cancel`	Cancel (Temporal signal)
`POST`	`/api/v1/rollouts/{id}/rollback`	Full rollback
`GET`	`/api/v1/rollouts/{id}/nodes`	Per-node status (includes `upgrade_phase`)
`POST`	`/api/v1/rollouts/{id}/nodes/{nodeId}/retry`	Retry failed node

Rollout Group Endpoints

Method	Path	Description
`POST`	`/api/v1/rollout-groups`	Create group (multi-component)
`GET`	`/api/v1/rollout-groups/{id}`	Get group + component rollouts
`POST`	`/api/v1/rollout-groups/{id}/start`	Start group
`POST`	`/api/v1/rollout-groups/{id}/cancel`	Cancel group
`GET`	`/api/v1/rollout-groups/{id}/status`	Group status + per-component progress

Auth: DualAuthMiddleware (JWT + API key). Required scope: admin (backend-granted only — users cannot self-assign this scope). Both JWT and API key users must have admin scope to access rollout endpoints. The CreatedBy field on rollouts records the authenticated user ID for audit trail.

Integration Points

Incident Categories

Category	Trigger
`upgrade_failed`	Node upgrade AND compensation both fail (`FAILED_ROLLBACK`)
`upgrade_rollback_failed`	Rollback of a previously succeeded node fails
`upgrade_stalled`	Health validation exceeds duration but node shows some life

See also: Health and Incidents — Incident Service for existing incident categories.

Notification Events

Rollout-level (not per-node): rollout_started, rollout_completed, rollout_paused, rollout_failed, rollback_initiated, rollback_completed Reuses existing dispatcher infrastructure.

NATS Event Subjects

Rollout-level: events.rollout.{rolloutID}.started, .progress, .completed
Per-node: events.node.{nodeID}.upgrade_started, .upgrade_completed, .upgrade_phase_changed

Prometheus Metrics

Metric	Type	Labels
`hoodcloud_rollout_total`	Counter	`chain`, `strategy`, `status`
`hoodcloud_rollout_duration_seconds`	Histogram	`chain`, `strategy`, `status`
`hoodcloud_rollout_node_upgrades_total`	Counter	`chain`, `status`
`hoodcloud_rollout_node_upgrade_duration_seconds`	Histogram	`chain`, `status`
`hoodcloud_rollout_active`	Gauge	`chain`
`hoodcloud_rollout_progress`	Gauge	`chain`, `rollout_id`
`hoodcloud_rollout_node_phase`	Gauge	`chain`, `rollout_id`, `phase`

Operational Guide

Creating a Rollout

Auto mode (recommended): Create an upgrade manifest YAML in {ConfigDir}/upgrades/{chain_profile_id}/, then:

POST /api/v1/rollouts with upgrade_id, chain_profile_id, node_type, source_version, strategy, and batch size
Control plane loads manifest, auto-populates binary details, state compatibility, and config changes
POST /api/v1/rollouts/{id}/start to begin execution

Manual mode: Provide all fields explicitly (no upgrade_id):

POST /api/v1/rollouts with chain_profile_id, node_type, binary_name, source_version, target_version, target_binary_url, target_binary_checksum, strategy, and batch size
POST /api/v1/rollouts/{id}/start to begin execution

For multi-component upgrades (e.g., geth + lighthouse):

POST /api/v1/rollout-groups with desired_versions, component_order (entries may include upgrade_id for auto-population), and failure_policy
POST /api/v1/rollout-groups/{id}/start

Monitoring Progress

GET /api/v1/rollouts/{id} — overall progress (counters)
GET /api/v1/rollouts/{id}/nodes — per-node status with upgrade_phase
GET /api/v1/rollout-groups/{id}/status — per-component progress
Temporal UI: workflow status, heartbeat details (action_index, action_name, progress_pct)

Pausing and Resuming

POST /api/v1/rollouts/{id}/pause — pauses after current batch completes
POST /api/v1/rollouts/{id}/resume — continues from paused state
Auto-pause triggers: failure threshold exceeded, canary failure

Handling Failures

Individual node failures: agent runs per-action compensation automatically
Retry a failed node: POST /api/v1/rollouts/{id}/nodes/{nodeId}/retry
FAILED_ROLLBACK nodes: manual intervention required (critical incident created)

Full Rollback

POST /api/v1/rollouts/{id}/rollback
System pauses the rollout, then launches reverse upgrades for all succeeded nodes
Uses the same UpgradeNodeWorkflow with swapped versions

Canary Approval for Breaking Migrations

When migration_type: breaking:

Canary batch executes and completes
Rollout blocks — waiting for operator approval
Operator validates canary nodes manually
Operator sends approve_canary signal to resume
Remaining batches proceed

Chain-Agnostic Verification

Adding a new systemd chain (zero code changes):

Create chain config YAML with runtime: { type: systemd } and binary definition
Provision nodes
Create rollout with new version -> agent injects SystemdAdapter
Executor runs same 8 actions. No code changes.

Adding a new Docker chain (zero code changes):

Create chain config with runtime: { type: docker } and image digest
Provision nodes
Create rollout -> agent injects DockerAdapter
Executor runs same 8 actions. No code changes.

What requires code changes:

New imperative runtime (e.g., LXC, Podman): implement RuntimeAdapter. Zero changes to actions, executor, workflows, or API.

Workflows — Temporal workflow patterns, compensation conventions
Health and Incidents — Health evaluation pipeline, incident management
Domain Model — Node state machine, subscription lifecycle
Extending — Adding runtime adapters, notification channels

​Overview

​Architecture

​Manifest Loader

​Agent Three-Layer Architecture

​Workflow Hierarchy

​Agent Upgrade Handler

​UpgradeAction Interface

​RuntimeAdapter Interface

​Action Primitives

​Executor

​Phase Reporting

​Three-Layer Idempotency

​Runtime Adapters

​Systemd Adapter

​Docker Adapter

​Data Model

​Tables

​Status Enums

​NodeConfigState

​Temporal Workflows

​RolloutGroupWorkflow

​RolloutWorkflow

​UpgradeNodeWorkflow

​Activity Profiles

​Rollout Strategies

​Rolling

​Canary

​All-at-Once

​Compensation and Rollback

​Per-Action Compensation

​Backup Directory

​Rollback Guarantee Tiers

​Full Rollout Rollback

​Group Rollback (rollback_all Policy)

​Health Validation

​WaitForHealthValidation

​Maintenance State Exemption

​MaintenanceCleanup Loop

​Validation Rules

​Creation-Time Rules

​State Compatibility Validation

​Config Rendering Pipeline

​API Surface

​Rollout Endpoints

​Rollout Group Endpoints

​Integration Points

​Incident Categories

​Notification Events

​NATS Event Subjects

​Prometheus Metrics

​Operational Guide

​Creating a Rollout

​Monitoring Progress

​Pausing and Resuming

​Handling Failures

​Full Rollback

​Canary Approval for Breaking Migrations

​Chain-Agnostic Verification

​Related Documents