Health and Incidents

Pipeline Overview

Health monitoring is a single pipeline: detection -> state transition -> incident creation -> notification -> cleanup. State transitions also feed the rolling uptime system. All components run inside the Health Evaluator service (cmd/health-evaluator/main.go).

Leader Election

The health evaluator uses PostgreSQL advisory lock-based leader election to ensure safe multi-instance deployment. One active leader runs the full pipeline; N-1 standby instances run only multi-instance-safe subsystems. Implementation: internal/health/leader.go (generic, reusable). Bootstrap via internal/app/bootstrap/leader_election.go. Advisory lock connection: Uses a dedicated, long-lived pgx.Conn (NOT from the connection pool). The lock is held for the lifetime of leader tenure. If a pooled connection were used, the lock would release silently when the connection returns to the pool.

Subsystem	Leader-gated?	Why
Heartbeat evaluator	YES	Duplicate evaluation → state races
Subscription cleaner	YES	Duplicate termination workflows
Terraform cleaner	YES	Duplicate cleanup
Maintenance cleaner	YES	Duplicate recovery
Backup cleaner	YES	Duplicate cleanup
Outbox worker	NO	`FOR UPDATE SKIP LOCKED` — already multi-instance safe
Incident dispatcher	YES	In-memory rate limiter per instance → 2x notifications
Rate limiter eviction	YES	Paired with dispatcher
Uptime worker	YES	Duplicate writes
Metrics ingester	NO	Idempotent writes to metrics DB
Policy evaluator	YES	Evaluates same data as evaluator

Failover: On leader failure, a standby acquires the lock within one evaluation interval (15-30s). Advisory locks have no TTL — held until connection drops. Observability: Leader emits health_evaluator_is_leader{instance="X"} 1|0.

NodeHealthMachine

File: internal/health/machine.go Single gateway for all node state transitions. Enforces the transition table, uses optimistic locking (version column on node_health_state AND state_version column on nodes), and emits events via the health_event_outbox table. Defense-in-depth: The nodes table has a state_version column for optimistic locking on state transitions. This complements leader election by catching races during the failover window (15-30s) when both old and new leaders may briefly overlap. On conflict (0 rows affected), the node is skipped and re-evaluated on the next cycle. Methods:

Method	Caller	Purpose
`Transition()`	Temporal activities, NodeService	Single-node transition with validation
`ForceTransition()`	Compensation (workflow cleanup)	Bypasses validation table
`ApplyDecisions()`	Policy evaluator	Batch transitions from CEL verdicts
`ApplyHeartbeatDecisions()`	Heartbeat evaluator	Batch transitions from heartbeat decisions
`UpdateApplicationHealth()`	Policy evaluator	Writes `application_health` dimension
`UpdateSyncStatus()`	NATS event handler	Writes `sync_status` dimension

Event emission: State changes emit HealthTransitionEvent (node_id, subscription_id, chain_profile_id, previous/new state, trigger, metadata). Dimension changes emit HealthDimensionEvent (dimension name, old/new value). Optimistic locking: Conflicting writes return ErrConcurrentModification — retried on the next evaluation cycle. BatchResult tracks applied transitions vs. conflicts.

See also: Domain Model — Node State Machine for the canonical transition table.

Health Event Outbox

File: internal/health/outbox.go The OutboxWorker polls the health_event_outbox table and dispatches events to registered handlers. Processing flow:

Atomic claim: UPDATE SET status='processing' WHERE id IN (SELECT ... FOR UPDATE SKIP LOCKED)
Dispatch to composite handlers:
- CompositeTransitionHandler -> incident service, migration handler, state log handler
- CompositeDimensionHandler -> incident service
Mark processed or failed
Cleanup: Remove old processed entries, recover stale processing entries from crashed workers

Guarantees: At-least-once delivery. Handlers must be idempotent. Handler errors are logged but don’t propagate or block other handlers.

Two Evaluation Systems

System	Responsibility	Writes Via	Triggers Migration
Heartbeat Evaluator	Infrastructure liveness (agent reachable?)	`ApplyHeartbeatDecisions()`	YES (after grace period)
Policy Evaluator	Application health (sync lag, disk, etc.)	`UpdateApplicationHealth()`	NO

Both emit events through the outbox, enabling downstream consumers (incident service) without coupling to evaluation logic.

Heartbeat Evaluator

Files: internal/health/heartbeat.go (pure function), internal/health/batch_evaluator.go (batch orchestration) The batch evaluator is the active implementation (replaces sequential evaluator). Uses O(1) DB queries per cycle:

ListSnapshots() — single 3-way JOIN: nodes + agent_registrations + node_health_state
EvaluateHeartbeat() per node — pure decision function, no I/O
ApplyHeartbeatDecisions() — batch state transitions with optimistic locking
BatchUpdateMutations() — persist failure counters to node_health_state
Trigger migrations for eligible nodes (deterministic workflow IDs)

Deterministic workflow IDs: Migration workflow IDs use migrate-node-{nodeID}-{healthStateVersion} instead of timestamps, ensuring idempotency during leader failover. Termination workflow IDs use terminate-expired-{nodeID}-{subscriptionID}. This prevents duplicate Temporal workflows when two instances briefly overlap. Pure evaluation: EvaluateHeartbeat(snapshot, alive, config, now) returns HeartbeatDecision:

Transition — state change (if any)
HealthMutation — bookkeeping (consecutive_failures, down_since, last_heartbeat_check)
Migration — whether to trigger migration

Failure detection flow:

No heartbeat for HeartbeatTimeout (60s) -> increment consecutive_failures
consecutive_failures >= ConsecutiveFailuresForDown (3) -> transition to DOWN
DOWN for MigrationGracePeriod (60s) -> trigger migration (if cooldown allows)
Heartbeat received while DOWN -> recover to HEALTHY/SYNCING, reset counter

Maintenance state during upgrades: Nodes in maintenance state (set by UpgradeNodeWorkflow) are skipped by both the heartbeat evaluator and policy evaluator. This prevents false alarms and migration triggers during the upgrade window. After upgrade completion, the workflow transitions the node to syncing, resuming normal health evaluation. Scaling: ~5,000 nodes on single leader instance. Beyond that, partitioning by node ID range needed. Not a Phase 1 concern. Standby instances provide automatic failover via leader election.

Policy Evaluator

Package: internal/observation/evaluator/ Evaluates CEL policies against Victoria Metrics data. Writes application_health via NodeHealthMachine.UpdateApplicationHealth(). Flow:

Query Victoria Metrics for active nodes (circuit breaker protected)
Evaluate compiled CEL programs per chain profile
First-match-wins verdict ordering: critical > degraded > ok > unknown
Write verdict to health machine (emits HealthDimensionEvent on change)

Sync status: Sync events flow from agent -> gRPC -> NATS -> health evaluator -> NodeHealthMachine.UpdateSyncStatus(). The state field is written exclusively by the heartbeat evaluator — the agent does not affect it.

See also: Workflows — Metrics Flow for the full observation pipeline.

Incident Service

File: internal/incident/service.go Creates, updates, escalates, and resolves incidents from health events.

Event Categorization

State transitions:

Transition	Category	Severity
Any -> `down`	`node_down`	critical
`provisioning` -> `failed`	`provision_failed`	critical
`maintenance` -> `failed`	`migration_failed`	critical
Upgrade + compensation both fail	`upgrade_failed`	critical
Rollback of succeeded node fails	`upgrade_rollback_failed`	critical
Health validation exceeds duration	`upgrade_stalled`	warning

Dimension changes:

Dimension	Value	Category	Severity
`application_health`	`critical`	`app_critical`	critical
`application_health`	`degraded`	`app_degraded`	warning
`sync_status`	`stalled`	`sync_stalled`	warning

Incident Lifecycle

Create/Upsert: Dedup by (node_id, category) partial unique index. New occurrence -> create. Existing open incident -> increment occurrence_count, update last_seen_at, merge metadata (JSONB concatenation).
Escalation: Warning incidents escalate to critical after:
- app_degraded: occurrence_count >= threshold OR duration > escalation_duration
- sync_stalled: duration > escalation_duration
Flapping detection: If occurrence_count >= threshold within time window, incident marked is_flapping (persisted to dedicated column, survives restarts).
Auto-resolution: When node recovers (DOWN -> HEALTHY/SYNCING, app_health -> OK, sync_status -> syncing/synced):
- Increment resolution_debounce counter (per-incident, scoped to node_id + category)
- Normal incidents resolve after debounce threshold
- Flapping incidents require doubled threshold
- Resolution resets debounce counter
User resolution: resolveUserID() fetches owner from subscription (nullable on failure — incidents still created without user_id).

Recovery Signals

Previous State	Recovery Signal	Action
DOWN	-> HEALTHY or SYNCING	Auto-resolve `node_down`
app_health critical/degraded	-> OK	Auto-resolve `app_critical`/`app_degraded`
sync_status stalled	-> syncing or synced	Auto-resolve `sync_stalled`

Notification Pipeline

Package: internal/incident/notifier/

Dispatcher

File: internal/incident/notifier/dispatcher.go Async pipeline: enqueue -> collect batch -> dispatch.

Enqueue: Fire-and-forget to buffered channel (overflow drops with metric)
Batch collection: Wait for batch window (50ms default) or buffer full
Correlation: Group notifications by ChainProfileID. If count >= CorrelationMinNodes, send single correlated summary instead of individual notifications
Mass failure guard: If batch >= MassFailureThreshold, send single summary
Rate limiting: Apply sliding window limits before sending
Send: Fan out to all registered channels (Slack, Telegram, Email, Webhook)
Retry: Failed sends -> notification outbox with exponential backoff

Rate Limiter

File: internal/incident/notifier/ratelimiter.go Sliding window implementation with three tiers:

Tier	Scope	Default Window
Per-node	Limits notifications for a single node	10 min
Per-user	Limits notifications for a single user	10 min
Global	System-wide cap	10 min

A notification is allowed only if all three tiers permit. Thread-safe with mutex. Optional background goroutine evicts stale entries (default: every 5 min).

Notification Channels

All implement contracts.Notifier:

type Notifier interface {
    Send(ctx context.Context, notification Notification) error
    Name() string
}

Channel	File	Transport
Slack	`slack.go`	Incoming webhook, Block Kit formatting
Telegram	`telegram.go`	Bot API, HTML parsing
Email	`email.go`	Generic HTTP API (SendGrid/Mailgun/Postmark compatible)
Webhook	`webhook.go`	JSON POST with `X-Webhook-Secret` header

Templates (templates.go): FormatTitle() renders severity emoji + event type + title. FormatBody() includes category, severity, status, chain, node, occurrence count, first_seen. Correlation and mass failure summaries group incidents by chain. Notification types: created, escalated, resolved, auto_resolved, flapping, correlated Rollout notification events (rollout-level, not per-node):

Event	Trigger
`rollout_started`	Rollout workflow begins execution
`rollout_completed`	All nodes in rollout upgraded successfully
`rollout_paused`	Rollout auto-paused (failure threshold) or manually paused
`rollout_failed`	Rollout failed (unrecoverable)
`rollback_initiated`	Full rollback triggered for a rollout
`rollback_completed`	Full rollback finished

Rollout notifications reuse the existing dispatcher infrastructure (rate limiting, retry, all channels).

See also: Extending — Adding a Notification Channel for how to add new channels.

Notification Outbox

File: internal/incident/notifier/outbox.go Persistent retry for failed sends:

Setting	Default
Retry schedule	30s, 1m, 5m, 15m, 1h
Max attempts	5
Poll interval	60s
Batch size	100
Cleanup age	7 days

Status flow: pending -> retrying -> sent/failed. Worker polls, claims entries, tracks next_retry_at for backoff.

Metrics

Metric	Type	Description
`incident_notifications_dropped_total`	Counter	Buffer overflow drops
`incident_notifications_dispatched_total`	Counter	Success/error by channel
`incident_notifications_exhausted_total`	Counter	Retry exhaustion

Rolling Uptime

Package: internal/uptime/ Computes rolling uptime percentages from node state transitions using an event-driven architecture with periodic materialization.

Architecture

HealthTransitionEvent → CompositeTransitionHandler → StateLogHandler → node_state_log (append-only)
                                                                              │
                                                          UptimeWorker (every 5min) → node_uptime_hourly
                                                                                              │
                                                                                  API → SUM(buckets) → % uptime

StateLogHandler

File: internal/uptime/state_log_handler.go Implements contracts.HealthTransitionHandler. Registered with the CompositeTransitionHandler alongside the incident service and migration handler. On each transition:

Closes the previous state entry (exited_at = event.Timestamp)
Inserts a new state entry (exited_at = NULL, representing the current state)

Idempotent via UNIQUE constraint on event_id — duplicate outbox deliveries are safely rejected.

UptimeWorker

File: internal/uptime/worker.go Background worker running inside the Health Evaluator service. Materializes hourly uptime buckets from node_state_log every 5 minutes. Per-node processing:

Find the last complete bucket (or node.created_at if none exist)
Compute each hour from that point to the current hour
For each hour, classify state intervals as uptime, downtime, or unknown
Upsert bucket via INSERT ... ON CONFLICT DO UPDATE (idempotent)
Mark past hours as is_complete = true; current hour remains is_complete = false (recomputed each tick)
Purge buckets older than 90 days

State classification:

Category	States	Rationale
Uptime	`healthy`, `syncing`, `degraded`, `maintenance`	Node is operational
Uptime (grace)	`provisioning`	First 10 min after creation
Downtime	`down`, `failed`	Node is not operational
Excluded	`terminating`, `terminated`	Not uptime-relevant

Invariant: For every complete bucket: uptime_seconds + downtime_seconds + unknown_seconds = 3600. Configuration: internal/defaults/defaults.go — UptimeWorkerInterval (5min), UptimeWorkerBatchSize (500), UptimeGracePeriod (10min), UptimeRetentionDays (90d).

UptimeService

File: internal/service/uptime.go Computes rolling uptime from materialized hourly buckets. Two query modes:

Summary: SUM(uptime_seconds) / SUM(total_seconds) over the requested window
History: Group buckets by granularity (hourly or daily) for chart data

Windows are bounded by node.created_at (no credit for time before the node existed) and terminated_at (no penalty for time after termination).

API Endpoints

File: internal/api/handler_uptime.go

Method	Path	Description
`GET`	`/api/v1/nodes/{nodeID}/uptime?window={window}`	Rolling uptime summary
`GET`	`/api/v1/nodes/{nodeID}/uptime/history?window={window}&granularity={granularity}`	Per-period breakdown for charts

Parameters:

window: 24h, 7d, 30d (default: 30d)
granularity: hourly, daily (default: hourly for 24h, daily otherwise)

Both endpoints require nodes:read scope. Ownership verified via subscription lookup.

Data Tables

Table	Purpose
`node_state_log`	Append-only log of state transitions with `entered_at`/`exited_at` intervals
`node_uptime_hourly`	Materialized hourly buckets (`uptime_seconds`, `downtime_seconds`, `unknown_seconds`)

See also: Domain Model — Database Schema for table definitions and indexes.

Cleanup Services

Cleanup Matrix

Scenario	Handler	Trigger
Provisioning fails	`ProvisionNodeWorkflow.compensate()`	Step failure
Node terminates	`TerminateNodeWorkflow`	API call
Pending payment expires	`SubscriptionCleaner`	Periodic timer
Subscription expires	`SubscriptionCleaner` (2-phase)	Periodic timer
Orphaned terraform state	`TerraformCleaner`	Periodic timer
Orphaned key backups	`BackupCleaner`	Termination + periodic
Expired auth tokens	`TokenCleaner`	Periodic timer
Stuck maintenance nodes	`MaintenanceCleanup`	Periodic timer (30 min timeout)
Migration fails	`MigrateNodeWorkflow.compensateMigration()`	Step failure

SubscriptionCleaner

File: internal/health/subscription_cleanup.go Three-stage periodic process:

Stage	Finds	Action
`handlePendingPaymentExpired`	`pending_payment` older than TTL (30m)	Delete subscription
`handleExpiredSubscriptions`	`active` where `expires_at < now`	Transition to `expiring`, set 24h grace period
`handleGracePeriodExpiredSubscriptions`	`expiring` where `grace_period_expires_at < now`	Terminate nodes, delete backups, transition to `terminated`

Error handling: Best-effort logging; individual failures don’t block other subscriptions.

BackupCleaner

File: internal/health/backup_cleanup.go Removes encrypted key backups from S3:

Called by SubscriptionCleaner on node termination
Periodic CleanupOrphanedBackups() for already-terminated nodes
Returns nil if S3 not configured; non-blocking on failures

TerraformCleaner

File: internal/health/terraform_cleanup.go Removes orphaned Terraform state directories:

Builds set of active host IDs (non-terminated + recently-terminated nodes)
Only deletes state for hosts NOT in active set
Age threshold prevents deletion of recently-stopped hosts

Auth CleanupJob

File: internal/auth/cleanup.go Removes expired nonces and revoked/expired refresh tokens (default: 1h interval).

MaintenanceCleanup

Periodic loop that detects nodes stuck in maintenance state beyond a configurable timeout (default: 30 minutes). Nodes can become stuck if an upgrade workflow crashes or Temporal loses track of the child workflow. Behavior: Queries nodes in maintenance state where updated_at exceeds the timeout. Transitions stuck nodes back to their previous healthy state and creates a warning incident. Configuration: MAINTENANCE_CLEANUP_INTERVAL (check frequency), MAINTENANCE_CLEANUP_TIMEOUT (how long in maintenance before considered stuck).

Workflow Compensation Patterns

Workflow	Compensation	Cleans Up
Provision	`compensate()`	Terraform destroy + key deletion -> state `failed`
Terminate	None (is already cleanup)	Best-effort, partial failures in `PartialFailures[]`
Migrate	`compensateMigration()`	Destroy NEW infra (if exists) -> state `degraded`

Key pattern: Migration compensation only destroys newly-created infrastructure. Old host preserved until new host confirmed.

Resource Lifecycle

Resource	Cleaner	Timing
VM/Infrastructure	`DestroyNode` activity	Immediate on cleanup
Terraform state	`TerraformCleaner`	~1h after termination
Encryption keys	`DeleteNodeKeys` activity	Immediate
Key backups (S3)	`BackupCleaner`	Same as termination
Pending subscriptions	`SubscriptionCleaner`	30m after creation
Auth tokens	`TokenCleaner`	Configured interval

Post-Upgrade Health Validation

The WaitForHealthValidation activity (called by UpgradeNodeWorkflow) polls node state after an upgrade to confirm the node is healthy. Uses relaxed criteria to avoid false failures during stabilization. Success criteria (ALL must pass):

#	Criterion	Rationale
1	At least one heartbeat received	Agent alive after upgrade
2	`application_health != critical`	App not critically broken
3	`sync_status != stalled`	Node making progress
4	Node state not `down` or `failed`	Infrastructure healthy
5	No crashloop (< N restarts within `health_wait_duration`)	Detects state incompatibility after upgrade

Does not require sync_status == synced (syncing expected after restart) or application_health == ok (degraded tolerated during stabilization). Crashloop detection: Counts restarts within the health wait window. If the threshold is exceeded, the upgrade is marked failed and compensation triggers. This is critical for detecting state-incompatible upgrades (Tier 3 rollback scenarios).

Overview — Health evaluator service description
Domain Model — Node state machine, incident model
Workflows — Temporal workflow compensation patterns
Extending — Adding health event handlers, notification channels

​Pipeline Overview

​Leader Election

​NodeHealthMachine

​Health Event Outbox

​Two Evaluation Systems

​Heartbeat Evaluator

​Policy Evaluator

​Incident Service

​Event Categorization

​Incident Lifecycle

​Recovery Signals

​Notification Pipeline

​Dispatcher

​Rate Limiter

​Notification Channels

​Notification Outbox

​Metrics

​Rolling Uptime

​Architecture

​StateLogHandler

​UptimeWorker

​UptimeService

​API Endpoints

​Data Tables

​Cleanup Services

​Cleanup Matrix

​SubscriptionCleaner

​BackupCleaner

​TerraformCleaner

​Auth CleanupJob

​MaintenanceCleanup

​Workflow Compensation Patterns

​Resource Lifecycle

​Post-Upgrade Health Validation

​Related Documents