Skip to main content

Pipeline Overview

Health monitoring is a single pipeline: detection -> state transition -> incident creation -> notification -> cleanup. State transitions also feed the rolling uptime system. All components run inside the Health Evaluator service (cmd/health-evaluator/main.go).

Leader Election

The health evaluator uses PostgreSQL advisory lock-based leader election to ensure safe multi-instance deployment. One active leader runs the full pipeline; N-1 standby instances run only multi-instance-safe subsystems. Implementation: internal/health/leader.go (generic, reusable). Bootstrap via internal/app/bootstrap/leader_election.go. Advisory lock connection: Uses a dedicated, long-lived pgx.Conn (NOT from the connection pool). The lock is held for the lifetime of leader tenure. If a pooled connection were used, the lock would release silently when the connection returns to the pool.
SubsystemLeader-gated?Why
Heartbeat evaluatorYESDuplicate evaluation → state races
Subscription cleanerYESDuplicate termination workflows
Terraform cleanerYESDuplicate cleanup
Maintenance cleanerYESDuplicate recovery
Backup cleanerYESDuplicate cleanup
Outbox workerNOFOR UPDATE SKIP LOCKED — already multi-instance safe
Incident dispatcherYESIn-memory rate limiter per instance → 2x notifications
Rate limiter evictionYESPaired with dispatcher
Uptime workerYESDuplicate writes
Metrics ingesterNOIdempotent writes to metrics DB
Policy evaluatorYESEvaluates same data as evaluator
Failover: On leader failure, a standby acquires the lock within one evaluation interval (15-30s). Advisory locks have no TTL — held until connection drops. Observability: Leader emits health_evaluator_is_leader{instance="X"} 1|0.

NodeHealthMachine

File: internal/health/machine.go Single gateway for all node state transitions. Enforces the transition table, uses optimistic locking (version column on node_health_state AND state_version column on nodes), and emits events via the health_event_outbox table. Defense-in-depth: The nodes table has a state_version column for optimistic locking on state transitions. This complements leader election by catching races during the failover window (15-30s) when both old and new leaders may briefly overlap. On conflict (0 rows affected), the node is skipped and re-evaluated on the next cycle. Methods:
MethodCallerPurpose
Transition()Temporal activities, NodeServiceSingle-node transition with validation
ForceTransition()Compensation (workflow cleanup)Bypasses validation table
ApplyDecisions()Policy evaluatorBatch transitions from CEL verdicts
ApplyHeartbeatDecisions()Heartbeat evaluatorBatch transitions from heartbeat decisions
UpdateApplicationHealth()Policy evaluatorWrites application_health dimension
UpdateSyncStatus()NATS event handlerWrites sync_status dimension
Event emission: State changes emit HealthTransitionEvent (node_id, subscription_id, chain_profile_id, previous/new state, trigger, metadata). Dimension changes emit HealthDimensionEvent (dimension name, old/new value). Optimistic locking: Conflicting writes return ErrConcurrentModification — retried on the next evaluation cycle. BatchResult tracks applied transitions vs. conflicts.
See also: Domain Model — Node State Machine for the canonical transition table.

Health Event Outbox

File: internal/health/outbox.go The OutboxWorker polls the health_event_outbox table and dispatches events to registered handlers. Processing flow:
  1. Atomic claim: UPDATE SET status='processing' WHERE id IN (SELECT ... FOR UPDATE SKIP LOCKED)
  2. Dispatch to composite handlers:
    • CompositeTransitionHandler -> incident service, migration handler, state log handler
    • CompositeDimensionHandler -> incident service
  3. Mark processed or failed
  4. Cleanup: Remove old processed entries, recover stale processing entries from crashed workers
Guarantees: At-least-once delivery. Handlers must be idempotent. Handler errors are logged but don’t propagate or block other handlers.

Two Evaluation Systems

SystemResponsibilityWrites ViaTriggers Migration
Heartbeat EvaluatorInfrastructure liveness (agent reachable?)ApplyHeartbeatDecisions()YES (after grace period)
Policy EvaluatorApplication health (sync lag, disk, etc.)UpdateApplicationHealth()NO
Both emit events through the outbox, enabling downstream consumers (incident service) without coupling to evaluation logic.

Heartbeat Evaluator

Files: internal/health/heartbeat.go (pure function), internal/health/batch_evaluator.go (batch orchestration) The batch evaluator is the active implementation (replaces sequential evaluator). Uses O(1) DB queries per cycle:
  1. ListSnapshots() — single 3-way JOIN: nodes + agent_registrations + node_health_state
  2. EvaluateHeartbeat() per node — pure decision function, no I/O
  3. ApplyHeartbeatDecisions() — batch state transitions with optimistic locking
  4. BatchUpdateMutations() — persist failure counters to node_health_state
  5. Trigger migrations for eligible nodes (deterministic workflow IDs)
Deterministic workflow IDs: Migration workflow IDs use migrate-node-{nodeID}-{healthStateVersion} instead of timestamps, ensuring idempotency during leader failover. Termination workflow IDs use terminate-expired-{nodeID}-{subscriptionID}. This prevents duplicate Temporal workflows when two instances briefly overlap. Pure evaluation: EvaluateHeartbeat(snapshot, alive, config, now) returns HeartbeatDecision:
  • Transition — state change (if any)
  • HealthMutation — bookkeeping (consecutive_failures, down_since, last_heartbeat_check)
  • Migration — whether to trigger migration
Failure detection flow:
  1. No heartbeat for HeartbeatTimeout (60s) -> increment consecutive_failures
  2. consecutive_failures >= ConsecutiveFailuresForDown (3) -> transition to DOWN
  3. DOWN for MigrationGracePeriod (60s) -> trigger migration (if cooldown allows)
  4. Heartbeat received while DOWN -> recover to HEALTHY/SYNCING, reset counter
Maintenance state during upgrades: Nodes in maintenance state (set by UpgradeNodeWorkflow) are skipped by both the heartbeat evaluator and policy evaluator. This prevents false alarms and migration triggers during the upgrade window. After upgrade completion, the workflow transitions the node to syncing, resuming normal health evaluation. Scaling: ~5,000 nodes on single leader instance. Beyond that, partitioning by node ID range needed. Not a Phase 1 concern. Standby instances provide automatic failover via leader election.

Policy Evaluator

Package: internal/observation/evaluator/ Evaluates CEL policies against Victoria Metrics data. Writes application_health via NodeHealthMachine.UpdateApplicationHealth(). Flow:
  1. Query Victoria Metrics for active nodes (circuit breaker protected)
  2. Evaluate compiled CEL programs per chain profile
  3. First-match-wins verdict ordering: critical > degraded > ok > unknown
  4. Write verdict to health machine (emits HealthDimensionEvent on change)
Sync status: Sync events flow from agent -> gRPC -> NATS -> health evaluator -> NodeHealthMachine.UpdateSyncStatus(). The state field is written exclusively by the heartbeat evaluator — the agent does not affect it.
See also: Workflows — Metrics Flow for the full observation pipeline.

Incident Service

File: internal/incident/service.go Creates, updates, escalates, and resolves incidents from health events.

Event Categorization

State transitions:
TransitionCategorySeverity
Any -> downnode_downcritical
provisioning -> failedprovision_failedcritical
maintenance -> failedmigration_failedcritical
Upgrade + compensation both failupgrade_failedcritical
Rollback of succeeded node failsupgrade_rollback_failedcritical
Health validation exceeds durationupgrade_stalledwarning
Dimension changes:
DimensionValueCategorySeverity
application_healthcriticalapp_criticalcritical
application_healthdegradedapp_degradedwarning
sync_statusstalledsync_stalledwarning

Incident Lifecycle

  1. Create/Upsert: Dedup by (node_id, category) partial unique index. New occurrence -> create. Existing open incident -> increment occurrence_count, update last_seen_at, merge metadata (JSONB concatenation).
  2. Escalation: Warning incidents escalate to critical after:
    • app_degraded: occurrence_count >= threshold OR duration > escalation_duration
    • sync_stalled: duration > escalation_duration
  3. Flapping detection: If occurrence_count >= threshold within time window, incident marked is_flapping (persisted to dedicated column, survives restarts).
  4. Auto-resolution: When node recovers (DOWN -> HEALTHY/SYNCING, app_health -> OK, sync_status -> syncing/synced):
    • Increment resolution_debounce counter (per-incident, scoped to node_id + category)
    • Normal incidents resolve after debounce threshold
    • Flapping incidents require doubled threshold
    • Resolution resets debounce counter
  5. User resolution: resolveUserID() fetches owner from subscription (nullable on failure — incidents still created without user_id).

Recovery Signals

Previous StateRecovery SignalAction
DOWN-> HEALTHY or SYNCINGAuto-resolve node_down
app_health critical/degraded-> OKAuto-resolve app_critical/app_degraded
sync_status stalled-> syncing or syncedAuto-resolve sync_stalled

Notification Pipeline

Package: internal/incident/notifier/

Dispatcher

File: internal/incident/notifier/dispatcher.go Async pipeline: enqueue -> collect batch -> dispatch.
  1. Enqueue: Fire-and-forget to buffered channel (overflow drops with metric)
  2. Batch collection: Wait for batch window (50ms default) or buffer full
  3. Correlation: Group notifications by ChainProfileID. If count >= CorrelationMinNodes, send single correlated summary instead of individual notifications
  4. Mass failure guard: If batch >= MassFailureThreshold, send single summary
  5. Rate limiting: Apply sliding window limits before sending
  6. Send: Fan out to all registered channels (Slack, Telegram, Email, Webhook)
  7. Retry: Failed sends -> notification outbox with exponential backoff

Rate Limiter

File: internal/incident/notifier/ratelimiter.go Sliding window implementation with three tiers:
TierScopeDefault Window
Per-nodeLimits notifications for a single node10 min
Per-userLimits notifications for a single user10 min
GlobalSystem-wide cap10 min
A notification is allowed only if all three tiers permit. Thread-safe with mutex. Optional background goroutine evicts stale entries (default: every 5 min).

Notification Channels

All implement contracts.Notifier:
type Notifier interface {
    Send(ctx context.Context, notification Notification) error
    Name() string
}
ChannelFileTransport
Slackslack.goIncoming webhook, Block Kit formatting
Telegramtelegram.goBot API, HTML parsing
Emailemail.goGeneric HTTP API (SendGrid/Mailgun/Postmark compatible)
Webhookwebhook.goJSON POST with X-Webhook-Secret header
Templates (templates.go): FormatTitle() renders severity emoji + event type + title. FormatBody() includes category, severity, status, chain, node, occurrence count, first_seen. Correlation and mass failure summaries group incidents by chain. Notification types: created, escalated, resolved, auto_resolved, flapping, correlated Rollout notification events (rollout-level, not per-node):
EventTrigger
rollout_startedRollout workflow begins execution
rollout_completedAll nodes in rollout upgraded successfully
rollout_pausedRollout auto-paused (failure threshold) or manually paused
rollout_failedRollout failed (unrecoverable)
rollback_initiatedFull rollback triggered for a rollout
rollback_completedFull rollback finished
Rollout notifications reuse the existing dispatcher infrastructure (rate limiting, retry, all channels).
See also: Extending — Adding a Notification Channel for how to add new channels.

Notification Outbox

File: internal/incident/notifier/outbox.go Persistent retry for failed sends:
SettingDefault
Retry schedule30s, 1m, 5m, 15m, 1h
Max attempts5
Poll interval60s
Batch size100
Cleanup age7 days
Status flow: pending -> retrying -> sent/failed. Worker polls, claims entries, tracks next_retry_at for backoff.

Metrics

MetricTypeDescription
incident_notifications_dropped_totalCounterBuffer overflow drops
incident_notifications_dispatched_totalCounterSuccess/error by channel
incident_notifications_exhausted_totalCounterRetry exhaustion

Rolling Uptime

Package: internal/uptime/ Computes rolling uptime percentages from node state transitions using an event-driven architecture with periodic materialization.

Architecture

HealthTransitionEvent → CompositeTransitionHandler → StateLogHandler → node_state_log (append-only)

                                                          UptimeWorker (every 5min) → node_uptime_hourly

                                                                                  API → SUM(buckets) → % uptime

StateLogHandler

File: internal/uptime/state_log_handler.go Implements contracts.HealthTransitionHandler. Registered with the CompositeTransitionHandler alongside the incident service and migration handler. On each transition:
  1. Closes the previous state entry (exited_at = event.Timestamp)
  2. Inserts a new state entry (exited_at = NULL, representing the current state)
Idempotent via UNIQUE constraint on event_id — duplicate outbox deliveries are safely rejected.

UptimeWorker

File: internal/uptime/worker.go Background worker running inside the Health Evaluator service. Materializes hourly uptime buckets from node_state_log every 5 minutes. Per-node processing:
  1. Find the last complete bucket (or node.created_at if none exist)
  2. Compute each hour from that point to the current hour
  3. For each hour, classify state intervals as uptime, downtime, or unknown
  4. Upsert bucket via INSERT ... ON CONFLICT DO UPDATE (idempotent)
  5. Mark past hours as is_complete = true; current hour remains is_complete = false (recomputed each tick)
  6. Purge buckets older than 90 days
State classification:
CategoryStatesRationale
Uptimehealthy, syncing, degraded, maintenanceNode is operational
Uptime (grace)provisioningFirst 10 min after creation
Downtimedown, failedNode is not operational
Excludedterminating, terminatedNot uptime-relevant
Invariant: For every complete bucket: uptime_seconds + downtime_seconds + unknown_seconds = 3600. Configuration: internal/defaults/defaults.goUptimeWorkerInterval (5min), UptimeWorkerBatchSize (500), UptimeGracePeriod (10min), UptimeRetentionDays (90d).

UptimeService

File: internal/service/uptime.go Computes rolling uptime from materialized hourly buckets. Two query modes:
  • Summary: SUM(uptime_seconds) / SUM(total_seconds) over the requested window
  • History: Group buckets by granularity (hourly or daily) for chart data
Windows are bounded by node.created_at (no credit for time before the node existed) and terminated_at (no penalty for time after termination).

API Endpoints

File: internal/api/handler_uptime.go
MethodPathDescription
GET/api/v1/nodes/{nodeID}/uptime?window={window}Rolling uptime summary
GET/api/v1/nodes/{nodeID}/uptime/history?window={window}&granularity={granularity}Per-period breakdown for charts
Parameters:
  • window: 24h, 7d, 30d (default: 30d)
  • granularity: hourly, daily (default: hourly for 24h, daily otherwise)
Both endpoints require nodes:read scope. Ownership verified via subscription lookup.

Data Tables

TablePurpose
node_state_logAppend-only log of state transitions with entered_at/exited_at intervals
node_uptime_hourlyMaterialized hourly buckets (uptime_seconds, downtime_seconds, unknown_seconds)
See also: Domain Model — Database Schema for table definitions and indexes.

Cleanup Services

Cleanup Matrix

ScenarioHandlerTrigger
Provisioning failsProvisionNodeWorkflow.compensate()Step failure
Node terminatesTerminateNodeWorkflowAPI call
Pending payment expiresSubscriptionCleanerPeriodic timer
Subscription expiresSubscriptionCleaner (2-phase)Periodic timer
Orphaned terraform stateTerraformCleanerPeriodic timer
Orphaned key backupsBackupCleanerTermination + periodic
Expired auth tokensTokenCleanerPeriodic timer
Stuck maintenance nodesMaintenanceCleanupPeriodic timer (30 min timeout)
Migration failsMigrateNodeWorkflow.compensateMigration()Step failure

SubscriptionCleaner

File: internal/health/subscription_cleanup.go Three-stage periodic process:
StageFindsAction
handlePendingPaymentExpiredpending_payment older than TTL (30m)Delete subscription
handleExpiredSubscriptionsactive where expires_at < nowTransition to expiring, set 24h grace period
handleGracePeriodExpiredSubscriptionsexpiring where grace_period_expires_at < nowTerminate nodes, delete backups, transition to terminated
Error handling: Best-effort logging; individual failures don’t block other subscriptions.

BackupCleaner

File: internal/health/backup_cleanup.go Removes encrypted key backups from S3:
  • Called by SubscriptionCleaner on node termination
  • Periodic CleanupOrphanedBackups() for already-terminated nodes
  • Returns nil if S3 not configured; non-blocking on failures

TerraformCleaner

File: internal/health/terraform_cleanup.go Removes orphaned Terraform state directories:
  • Builds set of active host IDs (non-terminated + recently-terminated nodes)
  • Only deletes state for hosts NOT in active set
  • Age threshold prevents deletion of recently-stopped hosts

Auth CleanupJob

File: internal/auth/cleanup.go Removes expired nonces and revoked/expired refresh tokens (default: 1h interval).

MaintenanceCleanup

Periodic loop that detects nodes stuck in maintenance state beyond a configurable timeout (default: 30 minutes). Nodes can become stuck if an upgrade workflow crashes or Temporal loses track of the child workflow. Behavior: Queries nodes in maintenance state where updated_at exceeds the timeout. Transitions stuck nodes back to their previous healthy state and creates a warning incident. Configuration: MAINTENANCE_CLEANUP_INTERVAL (check frequency), MAINTENANCE_CLEANUP_TIMEOUT (how long in maintenance before considered stuck).

Workflow Compensation Patterns

WorkflowCompensationCleans Up
Provisioncompensate()Terraform destroy + key deletion -> state failed
TerminateNone (is already cleanup)Best-effort, partial failures in PartialFailures[]
MigratecompensateMigration()Destroy NEW infra (if exists) -> state degraded
Key pattern: Migration compensation only destroys newly-created infrastructure. Old host preserved until new host confirmed.

Resource Lifecycle

ResourceCleanerTiming
VM/InfrastructureDestroyNode activityImmediate on cleanup
Terraform stateTerraformCleaner~1h after termination
Encryption keysDeleteNodeKeys activityImmediate
Key backups (S3)BackupCleanerSame as termination
Pending subscriptionsSubscriptionCleaner30m after creation
Auth tokensTokenCleanerConfigured interval

Post-Upgrade Health Validation

The WaitForHealthValidation activity (called by UpgradeNodeWorkflow) polls node state after an upgrade to confirm the node is healthy. Uses relaxed criteria to avoid false failures during stabilization. Success criteria (ALL must pass):
#CriterionRationale
1At least one heartbeat receivedAgent alive after upgrade
2application_health != criticalApp not critically broken
3sync_status != stalledNode making progress
4Node state not down or failedInfrastructure healthy
5No crashloop (< N restarts within health_wait_duration)Detects state incompatibility after upgrade
Does not require sync_status == synced (syncing expected after restart) or application_health == ok (degraded tolerated during stabilization). Crashloop detection: Counts restarts within the health wait window. If the threshold is exceeded, the upgrade is marked failed and compensation triggers. This is critical for detecting state-incompatible upgrades (Tier 3 rollback scenarios).
  • Overview — Health evaluator service description
  • Domain Model — Node state machine, incident model
  • Workflows — Temporal workflow compensation patterns
  • Extending — Adding health event handlers, notification channels