How to Read This Codebase
Recommended Reading Order
-
Domain models —
internal/models/node.go,internal/models/subscription.go. Understand state machines and business rules.- See: Domain Model
-
Request flow —
cmd/api-server/main.go->internal/app/bootstrap/api_server.go->internal/api/router.go->internal/api/middleware.go-> handlers -> services -> repositories. -
Workflows —
internal/workflows/provision.go,internal/activities/. Understand the Temporal orchestration pattern.- See: Workflows
-
Upgrade rollout —
internal/workflows/rollout.go,internal/opsagent/upgrade/. Understand three-level workflow hierarchy and agent-side three-layer architecture. -
Agent communication —
proto/agent.proto,internal/grpc/server.go,cmd/ops-agent/main.go. Understand bidirectional gRPC communication. -
Health monitoring —
internal/health/leader.go->internal/health/machine.go->internal/health/heartbeat.go->internal/health/outbox.go->internal/incident/service.go->internal/incident/notifier/. Uptime:internal/uptime/state_log_handler.go->internal/uptime/worker.go.- See: Health and Incidents
-
Chain configuration —
hoodcloud-chain-configs/chains/. Understand the declarative, chain-agnostic approach.
Key Files
| File | Purpose |
|---|---|
internal/models/node.go | Core domain model — node states, transitions |
internal/models/subscription.go | Subscription lifecycle states |
internal/contracts/ | Canonical interfaces (repositories, services) |
internal/api/middleware.go | Dual auth middleware (JWT + API key) |
internal/api/router.go | HTTP routing and middleware stack |
internal/workflows/provision.go | Provisioning workflow (compensation on failure) |
internal/health/leader.go | PostgreSQL advisory lock leader election (generic) |
internal/health/machine.go | NodeHealthMachine (state transitions, outbox events) |
internal/health/heartbeat.go | Pure heartbeat evaluation function |
internal/incident/service.go | Incident lifecycle (categorize, upsert, escalate, resolve) |
proto/agent.proto | gRPC protocol definition |
internal/opsagent/upgrade/action.go | UpgradeAction interface |
internal/opsagent/upgrade/adapter.go | RuntimeAdapter interface |
internal/opsagent/upgrade/executor.go | Upgrade executor (sequencing, compensation) |
internal/opsagent/upgrade/sequences.go | Canonical action sequence |
internal/opsagent/upgrade/actions/ | 8 action primitive implementations |
internal/opsagent/upgrade/adapters/ | Systemd and Docker runtime adapters |
internal/workflows/rollout.go | RolloutWorkflow (per-component batching) |
internal/workflows/rollout_group.go | RolloutGroupWorkflow (multi-binary coordination) |
internal/workflows/upgrade_node.go | UpgradeNodeWorkflow (per-node child) |
internal/activities/upgrade.go | Upgrade activity implementations |
internal/activities/types.go | Activity input/output types (includes upgrade types) |
internal/upgrade/manifest/manifest.go | Upgrade manifest types (auto-populated from YAML) |
internal/upgrade/manifest/reader.go | Manifest file reader (loads from chain config directory) |
internal/models/rollout.go | Rollout, RolloutGroup, RolloutNode models |
internal/uptime/state_log_handler.go | StateLogHandler — records transitions to node_state_log |
internal/uptime/worker.go | UptimeWorker — materializes hourly uptime buckets |
internal/service/uptime.go | UptimeService — computes rolling uptime from buckets |
internal/api/handler_uptime.go | UptimeHandler — uptime API endpoints |
internal/contracts/uptime.go | UptimeRepository + UptimeService interfaces |
Payment Service Key Files
| File | Purpose |
|---|---|
payment-service/internal/adapters/provider.go | Provider interface |
payment-service/internal/service/payment.go | Payment business logic |
payment-service/internal/service/pricing.go | Pricing service |
internal/consumers/payment_consumer.go | Main app NATS consumer |
Common Patterns
Repository Pattern
All database interactions use repositories defined ininternal/contracts/repository.go, implemented in internal/database/. Isolates SQL from business logic, testable with mocks.
Service Layer
Services ininternal/service/ coordinate between repositories and external systems. Encapsulates business rules, orchestrates operations.
Temporal Workflows
Durable orchestrations ininternal/workflows/. Survive process restarts, automatic retries, compensation on failure. Activities in internal/activities/ are idempotent.
Declarative Configuration
Chains defined in YAML (hoodcloud-chain-configs/), not code. Observation collectors, health policies, and recipes are all configuration-driven.
Event-Driven Architecture
All state transitions and dimension changes emit events viahealth_event_outbox:
Graceful Degradation
Optional services degrade gracefully — NATS, telemetry, and other optional subsystems fall back to no-op implementations on initialization failure.Error Handling
Error Types and Handling
| Category | Examples | Handling |
|---|---|---|
| Infrastructure | Terraform provisioning fails, cloud API unavailable | Temporal retry policy (exponential backoff). On exhaustion: compensate() -> state failed |
| Agent Communication | Agent never registers, heartbeat timeout, Redis queue full | Wait with retries (WaitForAgentReady). Mark DOWN on timeout. Trigger migration. |
| Key Management | Key generation fails, S3 backup upload fails | Workflow failure -> compensate (delete partial state) -> state failed. Manual intervention required. |
| Database | Connection pool exhausted, deadlock, constraint violation | pgx retries transient errors. Application retries with backoff. Structured logging. |
Workflow Error Propagation
internal/workflows/provision.go (compensate()), internal/workflows/migrate.go (compensateMigration()).
HTTP Error Responses
| Status | Usage |
|---|---|
| 400 | Validation errors |
| 401 | Missing or invalid auth |
| 404 | Resource not found |
| 409 | State conflict |
| 429 | Rate limit exceeded |
| 500 | Unexpected errors |
Temporal Retry Policies
| Context | Timeout | Retries | Backoff |
|---|---|---|---|
| Default activity | 10m start-to-close, 1m heartbeat | 3 attempts | 1s initial, 2.0 coefficient, 1m max |
| Terraform | 15m start-to-close, 2m heartbeat | 3 attempts | 10s initial, 2.0 coefficient, 2m max |
| WaitForAgent | 5m start-to-close | 20 attempts | 5s initial, 1.5 coefficient, 30s max |
Tracing Code Paths
”How does a node get marked as DOWN?”
- Health evaluator leader fires every 30s (leader-gated via advisory lock) ->
ListSnapshots()(3-way JOIN, includesstate_version) EvaluateHeartbeat()per node — pure function, no I/O. If heartbeat timed out: incrementConsecutiveFailures. If>= ConsecutiveFailuresForDown(3): return transition to DOWN.NodeHealthMachine.ApplyHeartbeatDecisions()— batch update with optimistic locking on bothnode_health_state.versionandnodes.state_version. InsertsHealthTransitionEventintohealth_event_outbox.OutboxWorker(NOT leader-gated, usesFOR UPDATE SKIP LOCKED) dispatches toCompositeTransitionHandler-> incident service createsnode_downincident -> migration handler evaluates grace period and cooldown, triggers migration with deterministic workflow ID (migrate-node-{nodeID}-{healthStateVersion}).
”How does a node get upgraded?”
- Operator creates rollout via
POST /api/v1/rollouts— either withupgrade_id(auto mode: manifest auto-populates binary details, state compatibility, config changes) or with all fields explicit (manual mode) - Operator starts rollout via
POST /api/v1/rollouts/{id}/start RolloutWorkflowlaunched in Temporal →ResolveRolloutTargetsassigns nodes to batches- Per batch: launches
UpgradeNodeWorkflowchildren in parallel UpgradeNodeWorkflow:PreCheckNode→UpdateNodeState(maintenance)→SendUpgradeCommand- Agent receives
COMMAND_TYPE_UPGRADE(value 9) in heartbeat → executor runs 8 actions viaRuntimeAdapter WaitForCommandCompletion→UpdateNodeBinaryVersion→WaitForHealthValidation→UpdateNodeState(syncing)- On failure: agent runs per-action compensation (reverse order), workflow marks node
failed/rolled_back
”How does uptime get computed?”
- State transition emitted by
NodeHealthMachine->health_event_outbox OutboxWorkerdispatches toCompositeTransitionHandler->StateLogHandler.OnTransition()closes previousnode_state_logentry, inserts new entryUptimeWorker(5min cycle) finds last complete bucket per node, computes hourly buckets fromnode_state_logentries, upserts intonode_uptime_hourly- API request:
UptimeServicequeriesnode_uptime_hourlywithSUM(uptime_seconds)/SUM(total_seconds)over requested window
”How does an agent send a heartbeat?”
- Agent heartbeat loop (15s) ->
sendHeartbeat() - gRPC
ControlPlaneService.Heartbeat()with node_id - Control plane updates
agent_registrations.last_seen_at, returns pending commands from Redis queue - Agent executes commands, reports results via
ReportEvent()
Debugging Tips
Database Migrations
Migrations are handled by the dedicatedcmd/migrate binary, not at service startup. Run as a pre-deploy step:
Leader Election
Logging
Database Guardrails
Per-service connection pools and statement timeouts are configured to prevent resource exhaustion:| Service | MaxConns | Statement Timeout |
|---|---|---|
| API Server | 10 | 2s |
| Auth Server | 5 | 2s |
| Agent Gateway | 10 | 2s |
| Orchestrator | 10 | 10s |
| Health Evaluator | 5 | 5s |
db_query_duration_seconds histograms per service. Statement timeout errors are counted via db_statement_timeout_total.
Database Queries
Distributed Traces (Tempo)
WhenOTEL_ENABLED=true:
- Open Grafana -> Explore -> Tempo
- Search by service name or trace ID
- TraceQL queries:
{resource.service.name="api-server"},{status=error}
Temporal Workflows
- Open Temporal UI:
http://localhost:8233 - Search for workflow ID:
provision-{uuid} - View execution history (activities, retries, errors)
Redis Command Queue
NATS Streams
Victoria Metrics
Terraform State
Status Page (Gatus)
http://localhost:8081 — health dashboard for all services, infrastructure, and observability stack.
Running Tests
Unit Tests
Integration Tests
Uses testcontainers (Docker required):repository_test.go (NodeRepository CRUD, state transitions), commandqueue_test.go (ProgressStore, stall detection).
E2E Tests
Requires full Docker Compose environment:Payment Service Tests
Related Documents
- Overview — System overview, tech stack, service descriptions
- Domain Model — Domain objects, state machines, business rules
- Workflows — HTTP flow, Temporal workflows, provisioning inputs
- Health and Incidents — Health evaluation, incidents, notifications
- Payment Service — Payment architecture
- Extending — Extension points, adding new capabilities
- Environment Variables — Configuration reference
- Deployment and Operations — Local dev, operations