Pipeline Overview
Health monitoring is a single pipeline: detection -> state transition -> incident creation -> notification -> cleanup. State transitions also feed the rolling uptime system. All components run inside the Health Evaluator service (cmd/health-evaluator/main.go).
Leader Election
The health evaluator uses PostgreSQL advisory lock-based leader election to ensure safe multi-instance deployment. One active leader runs the full pipeline; N-1 standby instances run only multi-instance-safe subsystems. Implementation:internal/health/leader.go (generic, reusable). Bootstrap via internal/app/bootstrap/leader_election.go.
Advisory lock connection: Uses a dedicated, long-lived pgx.Conn (NOT from the connection pool). The lock is held for the lifetime of leader tenure. If a pooled connection were used, the lock would release silently when the connection returns to the pool.
| Subsystem | Leader-gated? | Why |
|---|---|---|
| Heartbeat evaluator | YES | Duplicate evaluation → state races |
| Subscription cleaner | YES | Duplicate termination workflows |
| Terraform cleaner | YES | Duplicate cleanup |
| Maintenance cleaner | YES | Duplicate recovery |
| Backup cleaner | YES | Duplicate cleanup |
| Outbox worker | NO | FOR UPDATE SKIP LOCKED — already multi-instance safe |
| Incident dispatcher | YES | In-memory rate limiter per instance → 2x notifications |
| Rate limiter eviction | YES | Paired with dispatcher |
| Uptime worker | YES | Duplicate writes |
| Metrics ingester | NO | Idempotent writes to metrics DB |
| Policy evaluator | YES | Evaluates same data as evaluator |
health_evaluator_is_leader{instance="X"} 1|0.
NodeHealthMachine
File:internal/health/machine.go
Single gateway for all node state transitions. Enforces the transition table, uses optimistic locking (version column on node_health_state AND state_version column on nodes), and emits events via the health_event_outbox table.
Defense-in-depth: The nodes table has a state_version column for optimistic locking on state transitions. This complements leader election by catching races during the failover window (15-30s) when both old and new leaders may briefly overlap. On conflict (0 rows affected), the node is skipped and re-evaluated on the next cycle.
Methods:
| Method | Caller | Purpose |
|---|---|---|
Transition() | Temporal activities, NodeService | Single-node transition with validation |
ForceTransition() | Compensation (workflow cleanup) | Bypasses validation table |
ApplyDecisions() | Policy evaluator | Batch transitions from CEL verdicts |
ApplyHeartbeatDecisions() | Heartbeat evaluator | Batch transitions from heartbeat decisions |
UpdateApplicationHealth() | Policy evaluator | Writes application_health dimension |
UpdateSyncStatus() | NATS event handler | Writes sync_status dimension |
HealthTransitionEvent (node_id, subscription_id, chain_profile_id, previous/new state, trigger, metadata). Dimension changes emit HealthDimensionEvent (dimension name, old/new value).
Optimistic locking: Conflicting writes return ErrConcurrentModification — retried on the next evaluation cycle. BatchResult tracks applied transitions vs. conflicts.
See also: Domain Model — Node State Machine for the canonical transition table.
Health Event Outbox
File:internal/health/outbox.go
The OutboxWorker polls the health_event_outbox table and dispatches events to registered handlers.
Processing flow:
- Atomic claim:
UPDATE SET status='processing' WHERE id IN (SELECT ... FOR UPDATE SKIP LOCKED) - Dispatch to composite handlers:
CompositeTransitionHandler-> incident service, migration handler, state log handlerCompositeDimensionHandler-> incident service
- Mark processed or failed
- Cleanup: Remove old processed entries, recover stale
processingentries from crashed workers
Two Evaluation Systems
| System | Responsibility | Writes Via | Triggers Migration |
|---|---|---|---|
| Heartbeat Evaluator | Infrastructure liveness (agent reachable?) | ApplyHeartbeatDecisions() | YES (after grace period) |
| Policy Evaluator | Application health (sync lag, disk, etc.) | UpdateApplicationHealth() | NO |
Heartbeat Evaluator
Files:internal/health/heartbeat.go (pure function), internal/health/batch_evaluator.go (batch orchestration)
The batch evaluator is the active implementation (replaces sequential evaluator). Uses O(1) DB queries per cycle:
ListSnapshots()— single 3-way JOIN:nodes + agent_registrations + node_health_stateEvaluateHeartbeat()per node — pure decision function, no I/OApplyHeartbeatDecisions()— batch state transitions with optimistic lockingBatchUpdateMutations()— persist failure counters tonode_health_state- Trigger migrations for eligible nodes (deterministic workflow IDs)
migrate-node-{nodeID}-{healthStateVersion} instead of timestamps, ensuring idempotency during leader failover. Termination workflow IDs use terminate-expired-{nodeID}-{subscriptionID}. This prevents duplicate Temporal workflows when two instances briefly overlap.
Pure evaluation: EvaluateHeartbeat(snapshot, alive, config, now) returns HeartbeatDecision:
Transition— state change (if any)HealthMutation— bookkeeping (consecutive_failures, down_since, last_heartbeat_check)Migration— whether to trigger migration
- No heartbeat for
HeartbeatTimeout(60s) -> incrementconsecutive_failures consecutive_failures >= ConsecutiveFailuresForDown(3) -> transition to DOWN- DOWN for
MigrationGracePeriod(60s) -> trigger migration (if cooldown allows) - Heartbeat received while DOWN -> recover to HEALTHY/SYNCING, reset counter
maintenance state (set by UpgradeNodeWorkflow) are skipped by both the heartbeat evaluator and policy evaluator. This prevents false alarms and migration triggers during the upgrade window. After upgrade completion, the workflow transitions the node to syncing, resuming normal health evaluation.
Scaling: ~5,000 nodes on single leader instance. Beyond that, partitioning by node ID range needed. Not a Phase 1 concern. Standby instances provide automatic failover via leader election.
Policy Evaluator
Package:internal/observation/evaluator/
Evaluates CEL policies against Victoria Metrics data. Writes application_health via NodeHealthMachine.UpdateApplicationHealth().
Flow:
- Query Victoria Metrics for active nodes (circuit breaker protected)
- Evaluate compiled CEL programs per chain profile
- First-match-wins verdict ordering:
critical > degraded > ok > unknown - Write verdict to health machine (emits
HealthDimensionEventon change)
NodeHealthMachine.UpdateSyncStatus(). The state field is written exclusively by the heartbeat evaluator — the agent does not affect it.
See also: Workflows — Metrics Flow for the full observation pipeline.
Incident Service
File:internal/incident/service.go
Creates, updates, escalates, and resolves incidents from health events.
Event Categorization
State transitions:| Transition | Category | Severity |
|---|---|---|
Any -> down | node_down | critical |
provisioning -> failed | provision_failed | critical |
maintenance -> failed | migration_failed | critical |
| Upgrade + compensation both fail | upgrade_failed | critical |
| Rollback of succeeded node fails | upgrade_rollback_failed | critical |
| Health validation exceeds duration | upgrade_stalled | warning |
| Dimension | Value | Category | Severity |
|---|---|---|---|
application_health | critical | app_critical | critical |
application_health | degraded | app_degraded | warning |
sync_status | stalled | sync_stalled | warning |
Incident Lifecycle
-
Create/Upsert: Dedup by
(node_id, category)partial unique index. New occurrence -> create. Existing open incident -> incrementoccurrence_count, updatelast_seen_at, merge metadata (JSONB concatenation). -
Escalation: Warning incidents escalate to critical after:
app_degraded: occurrence_count >= threshold OR duration > escalation_durationsync_stalled: duration > escalation_duration
-
Flapping detection: If
occurrence_count >= thresholdwithin time window, incident markedis_flapping(persisted to dedicated column, survives restarts). -
Auto-resolution: When node recovers (DOWN -> HEALTHY/SYNCING, app_health -> OK, sync_status -> syncing/synced):
- Increment
resolution_debouncecounter (per-incident, scoped to node_id + category) - Normal incidents resolve after debounce threshold
- Flapping incidents require doubled threshold
- Resolution resets debounce counter
- Increment
-
User resolution:
resolveUserID()fetches owner from subscription (nullable on failure — incidents still created without user_id).
Recovery Signals
| Previous State | Recovery Signal | Action |
|---|---|---|
| DOWN | -> HEALTHY or SYNCING | Auto-resolve node_down |
| app_health critical/degraded | -> OK | Auto-resolve app_critical/app_degraded |
| sync_status stalled | -> syncing or synced | Auto-resolve sync_stalled |
Notification Pipeline
Package:internal/incident/notifier/
Dispatcher
File:internal/incident/notifier/dispatcher.go
Async pipeline: enqueue -> collect batch -> dispatch.
- Enqueue: Fire-and-forget to buffered channel (overflow drops with metric)
- Batch collection: Wait for batch window (50ms default) or buffer full
- Correlation: Group notifications by
ChainProfileID. If count >=CorrelationMinNodes, send single correlated summary instead of individual notifications - Mass failure guard: If batch >=
MassFailureThreshold, send single summary - Rate limiting: Apply sliding window limits before sending
- Send: Fan out to all registered channels (Slack, Telegram, Email, Webhook)
- Retry: Failed sends -> notification outbox with exponential backoff
Rate Limiter
File:internal/incident/notifier/ratelimiter.go
Sliding window implementation with three tiers:
| Tier | Scope | Default Window |
|---|---|---|
| Per-node | Limits notifications for a single node | 10 min |
| Per-user | Limits notifications for a single user | 10 min |
| Global | System-wide cap | 10 min |
Notification Channels
All implementcontracts.Notifier:
| Channel | File | Transport |
|---|---|---|
| Slack | slack.go | Incoming webhook, Block Kit formatting |
| Telegram | telegram.go | Bot API, HTML parsing |
email.go | Generic HTTP API (SendGrid/Mailgun/Postmark compatible) | |
| Webhook | webhook.go | JSON POST with X-Webhook-Secret header |
templates.go): FormatTitle() renders severity emoji + event type + title. FormatBody() includes category, severity, status, chain, node, occurrence count, first_seen. Correlation and mass failure summaries group incidents by chain.
Notification types: created, escalated, resolved, auto_resolved, flapping, correlated
Rollout notification events (rollout-level, not per-node):
| Event | Trigger |
|---|---|
rollout_started | Rollout workflow begins execution |
rollout_completed | All nodes in rollout upgraded successfully |
rollout_paused | Rollout auto-paused (failure threshold) or manually paused |
rollout_failed | Rollout failed (unrecoverable) |
rollback_initiated | Full rollback triggered for a rollout |
rollback_completed | Full rollback finished |
See also: Extending — Adding a Notification Channel for how to add new channels.
Notification Outbox
File:internal/incident/notifier/outbox.go
Persistent retry for failed sends:
| Setting | Default |
|---|---|
| Retry schedule | 30s, 1m, 5m, 15m, 1h |
| Max attempts | 5 |
| Poll interval | 60s |
| Batch size | 100 |
| Cleanup age | 7 days |
pending -> retrying -> sent/failed. Worker polls, claims entries, tracks next_retry_at for backoff.
Metrics
| Metric | Type | Description |
|---|---|---|
incident_notifications_dropped_total | Counter | Buffer overflow drops |
incident_notifications_dispatched_total | Counter | Success/error by channel |
incident_notifications_exhausted_total | Counter | Retry exhaustion |
Rolling Uptime
Package:internal/uptime/
Computes rolling uptime percentages from node state transitions using an event-driven architecture with periodic materialization.
Architecture
StateLogHandler
File:internal/uptime/state_log_handler.go
Implements contracts.HealthTransitionHandler. Registered with the CompositeTransitionHandler alongside the incident service and migration handler. On each transition:
- Closes the previous state entry (
exited_at = event.Timestamp) - Inserts a new state entry (
exited_at = NULL, representing the current state)
UNIQUE constraint on event_id — duplicate outbox deliveries are safely rejected.
UptimeWorker
File:internal/uptime/worker.go
Background worker running inside the Health Evaluator service. Materializes hourly uptime buckets from node_state_log every 5 minutes.
Per-node processing:
- Find the last complete bucket (or
node.created_atif none exist) - Compute each hour from that point to the current hour
- For each hour, classify state intervals as uptime, downtime, or unknown
- Upsert bucket via
INSERT ... ON CONFLICT DO UPDATE(idempotent) - Mark past hours as
is_complete = true; current hour remainsis_complete = false(recomputed each tick) - Purge buckets older than 90 days
| Category | States | Rationale |
|---|---|---|
| Uptime | healthy, syncing, degraded, maintenance | Node is operational |
| Uptime (grace) | provisioning | First 10 min after creation |
| Downtime | down, failed | Node is not operational |
| Excluded | terminating, terminated | Not uptime-relevant |
uptime_seconds + downtime_seconds + unknown_seconds = 3600.
Configuration: internal/defaults/defaults.go — UptimeWorkerInterval (5min), UptimeWorkerBatchSize (500), UptimeGracePeriod (10min), UptimeRetentionDays (90d).
UptimeService
File:internal/service/uptime.go
Computes rolling uptime from materialized hourly buckets. Two query modes:
- Summary:
SUM(uptime_seconds)/SUM(total_seconds)over the requested window - History: Group buckets by granularity (
hourlyordaily) for chart data
node.created_at (no credit for time before the node existed) and terminated_at (no penalty for time after termination).
API Endpoints
File:internal/api/handler_uptime.go
| Method | Path | Description |
|---|---|---|
GET | /api/v1/nodes/{nodeID}/uptime?window={window} | Rolling uptime summary |
GET | /api/v1/nodes/{nodeID}/uptime/history?window={window}&granularity={granularity} | Per-period breakdown for charts |
window:24h,7d,30d(default:30d)granularity:hourly,daily(default:hourlyfor 24h,dailyotherwise)
nodes:read scope. Ownership verified via subscription lookup.
Data Tables
| Table | Purpose |
|---|---|
node_state_log | Append-only log of state transitions with entered_at/exited_at intervals |
node_uptime_hourly | Materialized hourly buckets (uptime_seconds, downtime_seconds, unknown_seconds) |
See also: Domain Model — Database Schema for table definitions and indexes.
Cleanup Services
Cleanup Matrix
| Scenario | Handler | Trigger |
|---|---|---|
| Provisioning fails | ProvisionNodeWorkflow.compensate() | Step failure |
| Node terminates | TerminateNodeWorkflow | API call |
| Pending payment expires | SubscriptionCleaner | Periodic timer |
| Subscription expires | SubscriptionCleaner (2-phase) | Periodic timer |
| Orphaned terraform state | TerraformCleaner | Periodic timer |
| Orphaned key backups | BackupCleaner | Termination + periodic |
| Expired auth tokens | TokenCleaner | Periodic timer |
| Stuck maintenance nodes | MaintenanceCleanup | Periodic timer (30 min timeout) |
| Migration fails | MigrateNodeWorkflow.compensateMigration() | Step failure |
SubscriptionCleaner
File:internal/health/subscription_cleanup.go
Three-stage periodic process:
| Stage | Finds | Action |
|---|---|---|
handlePendingPaymentExpired | pending_payment older than TTL (30m) | Delete subscription |
handleExpiredSubscriptions | active where expires_at < now | Transition to expiring, set 24h grace period |
handleGracePeriodExpiredSubscriptions | expiring where grace_period_expires_at < now | Terminate nodes, delete backups, transition to terminated |
BackupCleaner
File:internal/health/backup_cleanup.go
Removes encrypted key backups from S3:
- Called by SubscriptionCleaner on node termination
- Periodic
CleanupOrphanedBackups()for already-terminated nodes - Returns nil if S3 not configured; non-blocking on failures
TerraformCleaner
File:internal/health/terraform_cleanup.go
Removes orphaned Terraform state directories:
- Builds set of active host IDs (non-terminated + recently-terminated nodes)
- Only deletes state for hosts NOT in active set
- Age threshold prevents deletion of recently-stopped hosts
Auth CleanupJob
File:internal/auth/cleanup.go
Removes expired nonces and revoked/expired refresh tokens (default: 1h interval).
MaintenanceCleanup
Periodic loop that detects nodes stuck inmaintenance state beyond a configurable timeout (default: 30 minutes). Nodes can become stuck if an upgrade workflow crashes or Temporal loses track of the child workflow.
Behavior: Queries nodes in maintenance state where updated_at exceeds the timeout. Transitions stuck nodes back to their previous healthy state and creates a warning incident.
Configuration: MAINTENANCE_CLEANUP_INTERVAL (check frequency), MAINTENANCE_CLEANUP_TIMEOUT (how long in maintenance before considered stuck).
Workflow Compensation Patterns
| Workflow | Compensation | Cleans Up |
|---|---|---|
| Provision | compensate() | Terraform destroy + key deletion -> state failed |
| Terminate | None (is already cleanup) | Best-effort, partial failures in PartialFailures[] |
| Migrate | compensateMigration() | Destroy NEW infra (if exists) -> state degraded |
Resource Lifecycle
| Resource | Cleaner | Timing |
|---|---|---|
| VM/Infrastructure | DestroyNode activity | Immediate on cleanup |
| Terraform state | TerraformCleaner | ~1h after termination |
| Encryption keys | DeleteNodeKeys activity | Immediate |
| Key backups (S3) | BackupCleaner | Same as termination |
| Pending subscriptions | SubscriptionCleaner | 30m after creation |
| Auth tokens | TokenCleaner | Configured interval |
Post-Upgrade Health Validation
TheWaitForHealthValidation activity (called by UpgradeNodeWorkflow) polls node state after an upgrade to confirm the node is healthy. Uses relaxed criteria to avoid false failures during stabilization.
Success criteria (ALL must pass):
| # | Criterion | Rationale |
|---|---|---|
| 1 | At least one heartbeat received | Agent alive after upgrade |
| 2 | application_health != critical | App not critically broken |
| 3 | sync_status != stalled | Node making progress |
| 4 | Node state not down or failed | Infrastructure healthy |
| 5 | No crashloop (< N restarts within health_wait_duration) | Detects state incompatibility after upgrade |
sync_status == synced (syncing expected after restart) or application_health == ok (degraded tolerated during stabilization).
Crashloop detection: Counts restarts within the health wait window. If the threshold is exceeded, the upgrade is marked failed and compensation triggers. This is critical for detecting state-incompatible upgrades (Tier 3 rollback scenarios).
Related Documents
- Overview — Health evaluator service description
- Domain Model — Node state machine, incident model
- Workflows — Temporal workflow compensation patterns
- Extending — Adding health event handlers, notification channels