Skip to main content

How to Read This Codebase

  1. Domain modelsinternal/models/node.go, internal/models/subscription.go. Understand state machines and business rules.
  2. Request flowcmd/api-server/main.go -> internal/app/bootstrap/api_server.go -> internal/api/router.go -> internal/api/middleware.go -> handlers -> services -> repositories.
  3. Workflowsinternal/workflows/provision.go, internal/activities/. Understand the Temporal orchestration pattern.
  4. Upgrade rolloutinternal/workflows/rollout.go, internal/opsagent/upgrade/. Understand three-level workflow hierarchy and agent-side three-layer architecture.
  5. Agent communicationproto/agent.proto, internal/grpc/server.go, cmd/ops-agent/main.go. Understand bidirectional gRPC communication.
  6. Health monitoringinternal/health/leader.go -> internal/health/machine.go -> internal/health/heartbeat.go -> internal/health/outbox.go -> internal/incident/service.go -> internal/incident/notifier/. Uptime: internal/uptime/state_log_handler.go -> internal/uptime/worker.go.
  7. Chain configurationhoodcloud-chain-configs/chains/. Understand the declarative, chain-agnostic approach.

Key Files

FilePurpose
internal/models/node.goCore domain model — node states, transitions
internal/models/subscription.goSubscription lifecycle states
internal/contracts/Canonical interfaces (repositories, services)
internal/api/middleware.goDual auth middleware (JWT + API key)
internal/api/router.goHTTP routing and middleware stack
internal/workflows/provision.goProvisioning workflow (compensation on failure)
internal/health/leader.goPostgreSQL advisory lock leader election (generic)
internal/health/machine.goNodeHealthMachine (state transitions, outbox events)
internal/health/heartbeat.goPure heartbeat evaluation function
internal/incident/service.goIncident lifecycle (categorize, upsert, escalate, resolve)
proto/agent.protogRPC protocol definition
internal/opsagent/upgrade/action.goUpgradeAction interface
internal/opsagent/upgrade/adapter.goRuntimeAdapter interface
internal/opsagent/upgrade/executor.goUpgrade executor (sequencing, compensation)
internal/opsagent/upgrade/sequences.goCanonical action sequence
internal/opsagent/upgrade/actions/8 action primitive implementations
internal/opsagent/upgrade/adapters/Systemd and Docker runtime adapters
internal/workflows/rollout.goRolloutWorkflow (per-component batching)
internal/workflows/rollout_group.goRolloutGroupWorkflow (multi-binary coordination)
internal/workflows/upgrade_node.goUpgradeNodeWorkflow (per-node child)
internal/activities/upgrade.goUpgrade activity implementations
internal/activities/types.goActivity input/output types (includes upgrade types)
internal/upgrade/manifest/manifest.goUpgrade manifest types (auto-populated from YAML)
internal/upgrade/manifest/reader.goManifest file reader (loads from chain config directory)
internal/models/rollout.goRollout, RolloutGroup, RolloutNode models
internal/uptime/state_log_handler.goStateLogHandler — records transitions to node_state_log
internal/uptime/worker.goUptimeWorker — materializes hourly uptime buckets
internal/service/uptime.goUptimeService — computes rolling uptime from buckets
internal/api/handler_uptime.goUptimeHandler — uptime API endpoints
internal/contracts/uptime.goUptimeRepository + UptimeService interfaces

Payment Service Key Files

FilePurpose
payment-service/internal/adapters/provider.goProvider interface
payment-service/internal/service/payment.goPayment business logic
payment-service/internal/service/pricing.goPricing service
internal/consumers/payment_consumer.goMain app NATS consumer

Common Patterns

Repository Pattern

All database interactions use repositories defined in internal/contracts/repository.go, implemented in internal/database/. Isolates SQL from business logic, testable with mocks.

Service Layer

Services in internal/service/ coordinate between repositories and external systems. Encapsulates business rules, orchestrates operations.

Temporal Workflows

Durable orchestrations in internal/workflows/. Survive process restarts, automatic retries, compensation on failure. Activities in internal/activities/ are idempotent.

Declarative Configuration

Chains defined in YAML (hoodcloud-chain-configs/), not code. Observation collectors, health policies, and recipes are all configuration-driven.

Event-Driven Architecture

All state transitions and dimension changes emit events via health_event_outbox:
Evaluator -> NodeHealthMachine -> health_event_outbox -> OutboxWorker -> handlers
Handlers (incident service, migration handler) subscribe without coupling to evaluation logic.

Graceful Degradation

Optional services degrade gracefully — NATS, telemetry, and other optional subsystems fall back to no-op implementations on initialization failure.

Error Handling

Error Types and Handling

CategoryExamplesHandling
InfrastructureTerraform provisioning fails, cloud API unavailableTemporal retry policy (exponential backoff). On exhaustion: compensate() -> state failed
Agent CommunicationAgent never registers, heartbeat timeout, Redis queue fullWait with retries (WaitForAgentReady). Mark DOWN on timeout. Trigger migration.
Key ManagementKey generation fails, S3 backup upload failsWorkflow failure -> compensate (delete partial state) -> state failed. Manual intervention required.
DatabaseConnection pool exhausted, deadlock, constraint violationpgx retries transient errors. Application retries with backoff. Structured logging.

Workflow Error Propagation

Activity fails -> Temporal retry policy (3 attempts, exponential backoff)
  -> All retries fail -> workflow receives error
  -> compensate() called
  -> Compensating activities (delete keys, destroy infra, update state to "failed")
  -> Workflow returns error result
  -> Metrics incremented, logs written
Compensation code: internal/workflows/provision.go (compensate()), internal/workflows/migrate.go (compensateMigration()).

HTTP Error Responses

{
  "error": "validation failed",
  "details": "chain_profile_id is required"
}
StatusUsage
400Validation errors
401Missing or invalid auth
404Resource not found
409State conflict
429Rate limit exceeded
500Unexpected errors

Temporal Retry Policies

ContextTimeoutRetriesBackoff
Default activity10m start-to-close, 1m heartbeat3 attempts1s initial, 2.0 coefficient, 1m max
Terraform15m start-to-close, 2m heartbeat3 attempts10s initial, 2.0 coefficient, 2m max
WaitForAgent5m start-to-close20 attempts5s initial, 1.5 coefficient, 30s max

Tracing Code Paths

”How does a node get marked as DOWN?”

  1. Health evaluator leader fires every 30s (leader-gated via advisory lock) -> ListSnapshots() (3-way JOIN, includes state_version)
  2. EvaluateHeartbeat() per node — pure function, no I/O. If heartbeat timed out: increment ConsecutiveFailures. If >= ConsecutiveFailuresForDown (3): return transition to DOWN.
  3. NodeHealthMachine.ApplyHeartbeatDecisions() — batch update with optimistic locking on both node_health_state.version and nodes.state_version. Inserts HealthTransitionEvent into health_event_outbox.
  4. OutboxWorker (NOT leader-gated, uses FOR UPDATE SKIP LOCKED) dispatches to CompositeTransitionHandler -> incident service creates node_down incident -> migration handler evaluates grace period and cooldown, triggers migration with deterministic workflow ID (migrate-node-{nodeID}-{healthStateVersion}).

”How does a node get upgraded?”

  1. Operator creates rollout via POST /api/v1/rollouts — either with upgrade_id (auto mode: manifest auto-populates binary details, state compatibility, config changes) or with all fields explicit (manual mode)
  2. Operator starts rollout via POST /api/v1/rollouts/{id}/start
  3. RolloutWorkflow launched in Temporal → ResolveRolloutTargets assigns nodes to batches
  4. Per batch: launches UpgradeNodeWorkflow children in parallel
  5. UpgradeNodeWorkflow: PreCheckNodeUpdateNodeState(maintenance)SendUpgradeCommand
  6. Agent receives COMMAND_TYPE_UPGRADE (value 9) in heartbeat → executor runs 8 actions via RuntimeAdapter
  7. WaitForCommandCompletionUpdateNodeBinaryVersionWaitForHealthValidationUpdateNodeState(syncing)
  8. On failure: agent runs per-action compensation (reverse order), workflow marks node failed/rolled_back

”How does uptime get computed?”

  1. State transition emitted by NodeHealthMachine -> health_event_outbox
  2. OutboxWorker dispatches to CompositeTransitionHandler -> StateLogHandler.OnTransition() closes previous node_state_log entry, inserts new entry
  3. UptimeWorker (5min cycle) finds last complete bucket per node, computes hourly buckets from node_state_log entries, upserts into node_uptime_hourly
  4. API request: UptimeService queries node_uptime_hourly with SUM(uptime_seconds) / SUM(total_seconds) over requested window

”How does an agent send a heartbeat?”

  1. Agent heartbeat loop (15s) -> sendHeartbeat()
  2. gRPC ControlPlaneService.Heartbeat() with node_id
  3. Control plane updates agent_registrations.last_seen_at, returns pending commands from Redis queue
  4. Agent executes commands, reports results via ReportEvent()

Debugging Tips

Database Migrations

Migrations are handled by the dedicated cmd/migrate binary, not at service startup. Run as a pre-deploy step:
go run cmd/migrate/main.go          # Apply pending migrations
Services no longer run migrations at startup. If a service starts before migrations are applied, it will fail on schema mismatches.

Leader Election

# Check which instance is the leader
curl -s http://localhost:9090/metrics | grep health_evaluator_is_leader

# Verify advisory lock is held
psql -c "SELECT * FROM pg_locks WHERE locktype = 'advisory';"

Logging

LOG_LEVEL=debug ./bin/api-server
Structured logs include request IDs, node IDs, timestamps.

Database Guardrails

Per-service connection pools and statement timeouts are configured to prevent resource exhaustion:
ServiceMaxConnsStatement Timeout
API Server102s
Auth Server52s
Agent Gateway102s
Orchestrator1010s
Health Evaluator55s
Observability: A pgx query tracer emits db_query_duration_seconds histograms per service. Statement timeout errors are counted via db_statement_timeout_total.

Database Queries

-- Node and health state
SELECT id, chain_profile_id, state, last_heartbeat FROM nodes;
SELECT node_id, consecutive_failures, down_since, current_state FROM node_health_state;

-- Active incidents
SELECT id, node_id, category, severity, status, occurrence_count
FROM incidents WHERE status NOT IN ('resolved', 'auto_resolved');

-- Pending health events
SELECT id, event_type, status FROM health_event_outbox WHERE status = 'pending';

-- Agent connectivity
SELECT node_id, agent_version, last_seen_at FROM agent_registrations;

-- Uptime state log (recent transitions)
SELECT node_id, state, entered_at, exited_at, trigger
FROM node_state_log WHERE node_id = '<uuid>' ORDER BY entered_at DESC LIMIT 20;

-- Uptime hourly buckets
SELECT bucket_hour, uptime_seconds, downtime_seconds, unknown_seconds, is_complete
FROM node_uptime_hourly WHERE node_id = '<uuid>' ORDER BY bucket_hour DESC LIMIT 24;

Distributed Traces (Tempo)

When OTEL_ENABLED=true:
  1. Open Grafana -> Explore -> Tempo
  2. Search by service name or trace ID
  3. TraceQL queries: {resource.service.name="api-server"}, {status=error}
Tempo is configured with log correlation — click “Logs for this span” to jump to Loki logs.

Temporal Workflows

  1. Open Temporal UI: http://localhost:8233
  2. Search for workflow ID: provision-{uuid}
  3. View execution history (activities, retries, errors)

Redis Command Queue

redis-cli KEYS hoodcloud:commands:*
redis-cli LRANGE hoodcloud:commands:{node_id} 0 -1

NATS Streams

nats stream info HOODCLOUD_METRICS
nats stream info PAYMENTS
nats consumer info PAYMENTS main-app-payment

Victoria Metrics

curl "http://localhost:8428/api/v1/query?query=up{node_id='...'}"

Terraform State

# Local backend
cd /app/terraform-state/{host_id} && terraform show

# S3 backend (TERRAFORM_STATE_BACKEND=s3)
aws s3 ls s3://hoodcloud-terraform-state/nodes/ --recursive

Status Page (Gatus)

http://localhost:8081 — health dashboard for all services, infrastructure, and observability stack.

Running Tests

Unit Tests

go test ./...                          # All
go test ./internal/service/...         # Specific package
go test -v ./internal/workflows/...    # Verbose

Integration Tests

Uses testcontainers (Docker required):
go test -tags=integration ./tests/integration/...
Available suites: repository_test.go (NodeRepository CRUD, state transitions), commandqueue_test.go (ProgressStore, stall detection).

E2E Tests

Requires full Docker Compose environment:
docker compose -f tests/e2e/docker-compose.e2e.yml up -d
go test -tags=e2e ./tests/e2e/...
docker compose -f tests/e2e/docker-compose.e2e.yml down -v

Payment Service Tests

cd payment-service
make test