Developer Guide - HoodCloud

How to Read This Codebase

Key Files

File	Purpose
`internal/models/node.go`	Core domain model — node states, transitions
`internal/models/subscription.go`	Subscription lifecycle states
`internal/contracts/`	Canonical interfaces (repositories, services)
`internal/api/middleware.go`	Dual auth middleware (JWT + API key)
`internal/api/router.go`	HTTP routing and middleware stack
`internal/workflows/provision.go`	Provisioning workflow (compensation on failure)
`internal/health/leader.go`	PostgreSQL advisory lock leader election (generic)
`internal/health/machine.go`	NodeHealthMachine (state transitions, outbox events)
`internal/health/heartbeat.go`	Pure heartbeat evaluation function
`internal/incident/service.go`	Incident lifecycle (categorize, upsert, escalate, resolve)
`proto/agent.proto`	gRPC protocol definition
`internal/opsagent/upgrade/action.go`	`UpgradeAction` interface
`internal/opsagent/upgrade/adapter.go`	`RuntimeAdapter` interface
`internal/opsagent/upgrade/executor.go`	Upgrade executor (sequencing, compensation)
`internal/opsagent/upgrade/sequences.go`	Canonical action sequence
`internal/opsagent/upgrade/actions/`	8 action primitive implementations
`internal/opsagent/upgrade/adapters/`	Systemd and Docker runtime adapters
`internal/workflows/rollout.go`	`RolloutWorkflow` (per-component batching)
`internal/workflows/rollout_group.go`	`RolloutGroupWorkflow` (multi-binary coordination)
`internal/workflows/upgrade_node.go`	`UpgradeNodeWorkflow` (per-node child)
`internal/activities/upgrade.go`	Upgrade activity implementations
`internal/activities/types.go`	Activity input/output types (includes upgrade types)
`internal/upgrade/manifest/manifest.go`	Upgrade manifest types (auto-populated from YAML)
`internal/upgrade/manifest/reader.go`	Manifest file reader (loads from chain config directory)
`internal/models/rollout.go`	Rollout, RolloutGroup, RolloutNode models
`internal/uptime/state_log_handler.go`	StateLogHandler — records transitions to `node_state_log`
`internal/uptime/worker.go`	UptimeWorker — materializes hourly uptime buckets
`internal/service/uptime.go`	UptimeService — computes rolling uptime from buckets
`internal/api/handler_uptime.go`	UptimeHandler — uptime API endpoints
`internal/contracts/uptime.go`	UptimeRepository + UptimeService interfaces

Payment Service Key Files

File	Purpose
`payment-service/internal/adapters/provider.go`	Provider interface
`payment-service/internal/service/payment.go`	Payment business logic
`payment-service/internal/service/pricing.go`	Pricing service
`internal/consumers/payment_consumer.go`	Main app NATS consumer

Common Patterns

Repository Pattern

All database interactions use repositories defined in internal/contracts/repository.go, implemented in internal/database/. Isolates SQL from business logic, testable with mocks.

Service Layer

Services in internal/service/ coordinate between repositories and external systems. Encapsulates business rules, orchestrates operations.

Temporal Workflows

Durable orchestrations in internal/workflows/. Survive process restarts, automatic retries, compensation on failure. Activities in internal/activities/ are idempotent.

Declarative Configuration

Chains defined in YAML (hoodcloud-chain-configs/), not code. Observation collectors, health policies, and recipes are all configuration-driven.

Event-Driven Architecture

All state transitions and dimension changes emit events via health_event_outbox:

Evaluator -> NodeHealthMachine -> health_event_outbox -> OutboxWorker -> handlers

Handlers (incident service, migration handler) subscribe without coupling to evaluation logic.

Graceful Degradation

Optional services degrade gracefully — NATS, telemetry, and other optional subsystems fall back to no-op implementations on initialization failure.

Error Handling

Error Types and Handling

Category	Examples	Handling
Infrastructure	Terraform provisioning fails, cloud API unavailable	Temporal retry policy (exponential backoff). On exhaustion: `compensate()` -> state `failed`
Agent Communication	Agent never registers, heartbeat timeout, Redis queue full	Wait with retries (`WaitForAgentReady`). Mark DOWN on timeout. Trigger migration.
Key Management	Key generation fails, S3 backup upload fails	Workflow failure -> compensate (delete partial state) -> state `failed`. Manual intervention required.
Database	Connection pool exhausted, deadlock, constraint violation	pgx retries transient errors. Application retries with backoff. Structured logging.

Workflow Error Propagation

Activity fails -> Temporal retry policy (3 attempts, exponential backoff)
  -> All retries fail -> workflow receives error
  -> compensate() called
  -> Compensating activities (delete keys, destroy infra, update state to "failed")
  -> Workflow returns error result
  -> Metrics incremented, logs written

Compensation code: internal/workflows/provision.go (compensate()), internal/workflows/migrate.go (compensateMigration()).

HTTP Error Responses

{
  "error": "validation failed",
  "details": "chain_profile_id is required"
}

Status	Usage
400	Validation errors
401	Missing or invalid auth
404	Resource not found
409	State conflict
429	Rate limit exceeded
500	Unexpected errors

Temporal Retry Policies

Context	Timeout	Retries	Backoff
Default activity	10m start-to-close, 1m heartbeat	3 attempts	1s initial, 2.0 coefficient, 1m max
Terraform	15m start-to-close, 2m heartbeat	3 attempts	10s initial, 2.0 coefficient, 2m max
WaitForAgent	5m start-to-close	20 attempts	5s initial, 1.5 coefficient, 30s max

Tracing Code Paths

”How does a node get marked as DOWN?”

Health evaluator leader fires every 30s (leader-gated via advisory lock) -> ListSnapshots() (3-way JOIN, includes state_version)
EvaluateHeartbeat() per node — pure function, no I/O. If heartbeat timed out: increment ConsecutiveFailures. If >= ConsecutiveFailuresForDown (3): return transition to DOWN.
NodeHealthMachine.ApplyHeartbeatDecisions() — batch update with optimistic locking on both node_health_state.version and nodes.state_version. Inserts HealthTransitionEvent into health_event_outbox.
OutboxWorker (NOT leader-gated, uses FOR UPDATE SKIP LOCKED) dispatches to CompositeTransitionHandler -> incident service creates node_down incident -> migration handler evaluates grace period and cooldown, triggers migration with deterministic workflow ID (migrate-node-{nodeID}-{healthStateVersion}).

”How does a node get upgraded?”

Operator creates rollout via POST /api/v1/rollouts — either with upgrade_id (auto mode: manifest auto-populates binary details, state compatibility, config changes) or with all fields explicit (manual mode)
Operator starts rollout via POST /api/v1/rollouts/{id}/start
RolloutWorkflow launched in Temporal → ResolveRolloutTargets assigns nodes to batches
Per batch: launches UpgradeNodeWorkflow children in parallel
UpgradeNodeWorkflow: PreCheckNode → UpdateNodeState(maintenance) → SendUpgradeCommand
Agent receives COMMAND_TYPE_UPGRADE (value 9) in heartbeat → executor runs 8 actions via RuntimeAdapter
WaitForCommandCompletion → UpdateNodeBinaryVersion → WaitForHealthValidation → UpdateNodeState(syncing)
On failure: agent runs per-action compensation (reverse order), workflow marks node failed/rolled_back

”How does uptime get computed?”

State transition emitted by NodeHealthMachine -> health_event_outbox
OutboxWorker dispatches to CompositeTransitionHandler -> StateLogHandler.OnTransition() closes previous node_state_log entry, inserts new entry
UptimeWorker (5min cycle) finds last complete bucket per node, computes hourly buckets from node_state_log entries, upserts into node_uptime_hourly
API request: UptimeService queries node_uptime_hourly with SUM(uptime_seconds) / SUM(total_seconds) over requested window

”How does an agent send a heartbeat?”

Agent heartbeat loop (15s) -> sendHeartbeat()
gRPC ControlPlaneService.Heartbeat() with node_id
Control plane updates agent_registrations.last_seen_at, returns pending commands from Redis queue
Agent executes commands, reports results via ReportEvent()

Debugging Tips

Database Migrations

Migrations are handled by the dedicated cmd/migrate binary, not at service startup. Run as a pre-deploy step:

go run cmd/migrate/main.go          # Apply pending migrations

Services no longer run migrations at startup. If a service starts before migrations are applied, it will fail on schema mismatches.

Leader Election

# Check which instance is the leader
curl -s http://localhost:9090/metrics | grep health_evaluator_is_leader

# Verify advisory lock is held
psql -c "SELECT * FROM pg_locks WHERE locktype = 'advisory';"

Logging

LOG_LEVEL=debug ./bin/api-server

Structured logs include request IDs, node IDs, timestamps.

Database Guardrails

Per-service connection pools and statement timeouts are configured to prevent resource exhaustion:

Service	MaxConns	Statement Timeout
API Server	10	2s
Auth Server	5	2s
Agent Gateway	10	2s
Orchestrator	10	10s
Health Evaluator	5	5s

Observability: A pgx query tracer emits db_query_duration_seconds histograms per service. Statement timeout errors are counted via db_statement_timeout_total.

Database Queries

-- Node and health state
SELECT id, chain_profile_id, state, last_heartbeat FROM nodes;
SELECT node_id, consecutive_failures, down_since, current_state FROM node_health_state;

-- Active incidents
SELECT id, node_id, category, severity, status, occurrence_count
FROM incidents WHERE status NOT IN ('resolved', 'auto_resolved');

-- Pending health events
SELECT id, event_type, status FROM health_event_outbox WHERE status = 'pending';

-- Agent connectivity
SELECT node_id, agent_version, last_seen_at FROM agent_registrations;

-- Uptime state log (recent transitions)
SELECT node_id, state, entered_at, exited_at, trigger
FROM node_state_log WHERE node_id = '<uuid>' ORDER BY entered_at DESC LIMIT 20;

-- Uptime hourly buckets
SELECT bucket_hour, uptime_seconds, downtime_seconds, unknown_seconds, is_complete
FROM node_uptime_hourly WHERE node_id = '<uuid>' ORDER BY bucket_hour DESC LIMIT 24;

Distributed Traces (Tempo)

When OTEL_ENABLED=true:

Open Grafana -> Explore -> Tempo
Search by service name or trace ID
TraceQL queries: {resource.service.name="api-server"}, {status=error}

Tempo is configured with log correlation — click “Logs for this span” to jump to Loki logs.

Temporal Workflows

Open Temporal UI: http://localhost:8233
Search for workflow ID: provision-{uuid}
View execution history (activities, retries, errors)

Redis Command Queue

redis-cli KEYS hoodcloud:commands:*
redis-cli LRANGE hoodcloud:commands:{node_id} 0 -1

NATS Streams

nats stream info HOODCLOUD_METRICS
nats stream info PAYMENTS
nats consumer info PAYMENTS main-app-payment

Victoria Metrics

curl "http://localhost:8428/api/v1/query?query=up{node_id='...'}"

Terraform State

# Local backend
cd /app/terraform-state/{host_id} && terraform show

# S3 backend (TERRAFORM_STATE_BACKEND=s3)
aws s3 ls s3://hoodcloud-terraform-state/nodes/ --recursive

Status Page (Gatus)

http://localhost:8081 — health dashboard for all services, infrastructure, and observability stack.

Running Tests

Unit Tests

go test ./...                          # All
go test ./internal/service/...         # Specific package
go test -v ./internal/workflows/...    # Verbose

Integration Tests

Uses testcontainers (Docker required):

go test -tags=integration ./tests/integration/...

Available suites: repository_test.go (NodeRepository CRUD, state transitions), commandqueue_test.go (ProgressStore, stall detection).

E2E Tests

Requires full Docker Compose environment:

docker compose -f tests/e2e/docker-compose.e2e.yml up -d
go test -tags=e2e ./tests/e2e/...
docker compose -f tests/e2e/docker-compose.e2e.yml down -v

Payment Service Tests

cd payment-service
make test

Overview — System overview, tech stack, service descriptions
Domain Model — Domain objects, state machines, business rules
Workflows — HTTP flow, Temporal workflows, provisioning inputs
Health and Incidents — Health evaluation, incidents, notifications
Payment Service — Payment architecture
Extending — Extension points, adding new capabilities
Environment Variables — Configuration reference
Deployment and Operations — Local dev, operations

​How to Read This Codebase

​Recommended Reading Order

​Key Files

​Payment Service Key Files

​Common Patterns

​Repository Pattern

​Service Layer

​Temporal Workflows

​Declarative Configuration

​Event-Driven Architecture

​Graceful Degradation

​Error Handling

​Error Types and Handling

​Workflow Error Propagation

​HTTP Error Responses

​Temporal Retry Policies

​Tracing Code Paths

​”How does a node get marked as DOWN?”

​”How does a node get upgraded?”

​”How does uptime get computed?”

​”How does an agent send a heartbeat?”

​Debugging Tips

​Database Migrations

​Leader Election

​Logging

​Database Guardrails

​Database Queries

​Distributed Traces (Tempo)

​Temporal Workflows

​Redis Command Queue

​NATS Streams

​Victoria Metrics

​Terraform State

​Status Page (Gatus)

​Running Tests

​Unit Tests

​Integration Tests

​E2E Tests

​Payment Service Tests

​Related Documents

How to Read This Codebase

Recommended Reading Order

Key Files

Payment Service Key Files

Common Patterns

Repository Pattern

Service Layer

Temporal Workflows

Declarative Configuration

Event-Driven Architecture

Graceful Degradation

Error Handling

Error Types and Handling

Workflow Error Propagation

HTTP Error Responses

Temporal Retry Policies

Tracing Code Paths

”How does a node get marked as DOWN?”

”How does a node get upgraded?”

”How does uptime get computed?”

”How does an agent send a heartbeat?”

Debugging Tips

Database Migrations

Leader Election

Logging

Database Guardrails

Database Queries

Distributed Traces (Tempo)

Temporal Workflows

Redis Command Queue

NATS Streams

Victoria Metrics

Terraform State

Status Page (Gatus)

Running Tests

Unit Tests

Integration Tests

E2E Tests

Payment Service Tests

Related Documents