What is HoodCloud?
HoodCloud is a fully managed blockchain node infrastructure service implemented in Go. It provisions, monitors, and maintains non-validator blockchain nodes (full nodes, archive nodes, indexers) without requiring users to manage infrastructure, software upgrades, or monitoring. Primary Responsibilities:- Provision blockchain nodes on cloud infrastructure (Hetzner, OVH, extensible via module registry)
- Generate and securely manage cryptographic key material
- Monitor node health and automatically trigger migrations on failure
- Apply configuration changes and software updates via declarative recipes
- Apply targeted binary and config upgrades via rollout orchestration
- Terminate nodes and clean up resources when subscriptions expire
- Not a validator service
- Not a public RPC endpoint provider
- Not a self-service platform (admin-controlled provisioning in v1)
- Not highly available by default (single-host nodes in v1)
Application Type
HoodCloud is a distributed control plane system consisting of six services:| Service | Transport | Port | Purpose |
|---|---|---|---|
| API Server | HTTP | 8080 | Business logic — nodes, subscriptions, chains, payments, key export |
| Auth Server | HTTP | 8081 | Identity, authentication, wallet registration, API keys |
| Agent Gateway | gRPC | 9090 | Ops-agent communication (heartbeat, commands, events) |
| Orchestrator | Temporal worker | — | Durable workflow execution (provision, migrate, terminate, upgrade rollout) |
| Health Evaluator | Background daemon | 9090 (metrics) | Health evaluation, incidents, notifications, cleanup (leader + standby) |
| Migration Runner | One-shot CLI | — | Database schema migrations (cmd/migrate), runs as pre-deploy step |
| Ops Agent | gRPC client | — | On-host agent (one per node VM) |
Technology Stack
| Component | Technology |
|---|---|
| Language | Go 1.25 |
| Workflow Engine | Temporal (durable workflows) |
| Database | PostgreSQL (node state, subscriptions, keys, users; per-service connection pools and statement timeouts) |
| Payment Database | PostgreSQL (separate instance, payment records) |
| Cache/Queue | Redis (command queue, progress storage, idempotency) |
| Event Streaming | NATS JetStream (agent events, metrics transport, payment events; 3-node cluster, R=3 in production) |
| RPC Protocol | gRPC with Protocol Buffers |
| Service-to-Service Auth | mTLS (TLS 1.3, mutual certificate verification) |
| Infrastructure | Terraform (declarative provisioning via module registry) |
| Encryption | AES-256-GCM (node key material), NaCl sealed box (user-provided secrets) |
| Authentication | Clerk (external auth provider), JWT (RS256) |
| Node Hosting | Hetzner, OVH (VPS, Public Cloud, Dedicated) — extensible |
| Cloud Services | AWS (S3), Secrets (HashiCorp Vault), Vault AWS Credential Provider |
| Metrics Storage | Victoria Metrics TSDB |
| Policy Engine | CEL (Common Expression Language) |
| Distributed Tracing | OpenTelemetry SDK -> OTel Collector -> Grafana Tempo |
| Status Page | Gatus (health monitoring dashboard) |
High-Level System Architecture
Provisioning Workflow Sequence
Security Boundaries
Keys exist in plaintext only temporarily for node operation. No permanent key retention — keys and backups are deleted on subscription expiration. No SSH access to nodes; all management via ops-agent. User-provided secrets are client-side encrypted (NaCl sealed box).See also: CLAUDE.md — Security Boundaries for the canonical security rules.
Service Descriptions
API Server (cmd/api-server/)
Purpose: Business logic — nodes, subscriptions, chains, payments, key export.
- Entry point:
cmd/api-server/main.go(thin, delegates tointernal/app/bootstrap/api_server.go) - Port: 8080 (HTTP)
- Dependencies: PostgreSQL, Redis, Temporal, Vault, NATS (optional for payment consumer)
- Auth:
DualAuthMiddleware— JWT (primary) + API key (programmatic). Seeinternal/api/middleware.go - Key packages:
internal/api/(handlers, routing, middleware),internal/service/(business logic)
Note: Database migrations are handled by the dedicated cmd/migrate binary as a pre-deploy step, not at service startup.
See also: CLAUDE.md — Server Separation for endpoint lists.
Auth Server (cmd/auth-server/)
Purpose: Identity, authentication, wallet registration, API key management.
- Entry point:
cmd/auth-server/main.go(thin, delegates tointernal/app/bootstrap/auth_server.go) - Port: 8081 (HTTP), 9094 (metrics)
- Dependencies: PostgreSQL, Vault (JWT keys). Minimal — no Temporal, no Redis
- Key packages:
internal/auth/(service, handlers, JWT),internal/authprovider/(Clerk adapter)
- Clerk webhook endpoint (
POST /webhooks/clerk) for user lifecycle sync - JWT session management (RS256, 15m access / 7d refresh, atomic rotation)
- Chain-agnostic wallet registration via
SignatureVerifierRegistry - API key CRUD and rotation
- IP-based rate limiting (20 req/min, Redis-backed for global enforcement across instances)
See also: Clerk Setup for operational configuration.
Agent Gateway (cmd/agent-gateway/)
Purpose: gRPC endpoint for ops-agent communication.
- Entry point:
cmd/agent-gateway/main.go - Port: 9090 (gRPC), 9091 (metrics HTTP)
- Dependencies: PostgreSQL, Redis, NATS
- Key packages:
internal/grpc/(server),internal/commandqueue/(Redis queue + progress)
- Agent registration and heartbeat processing
- Command queue delivery (Redis -> agent via heartbeat response)
- DEK retrieval for key decryption
- Progress tracking for long-running commands (Redis
progress:{commandID}) - Event forwarding to NATS
Orchestrator (cmd/orchestrator/)
Purpose: Temporal workflow worker — executes provision, migrate, and terminate workflows.
- Entry point:
cmd/orchestrator/main.go - Dependencies: PostgreSQL, Redis, Temporal, Vault, Terraform, S3
- Key packages:
internal/workflows/(workflow definitions),internal/activities/(activity implementations),internal/terraform/(infrastructure provisioning)
ProvisionNodeWorkflow, MigrateNodeWorkflow, TerminateNodeWorkflow, RolloutGroupWorkflow, RolloutWorkflow, UpgradeNodeWorkflow
Startup: Load config -> Init telemetry -> Init secrets -> Connect PostgreSQL -> Init repos + crypto -> Init chain config + Terraform -> Connect Redis -> Connect Temporal -> Create worker -> Register workflows + activities -> Start worker -> Wait for signal.
Health Evaluator (cmd/health-evaluator/)
Purpose: Background daemon for health evaluation, incident management, notifications, and cleanup.
- Entry point:
cmd/health-evaluator/main.go - Port: 9090 (metrics HTTP)
- Dependencies: PostgreSQL, Temporal (for migration triggers), NATS (event subscription), S3 (backup cleanup), Victoria Metrics (metrics queries)
- Key packages:
internal/health/(machine, evaluator, outbox, cleanup),internal/incident/(service, notifier),internal/observation/(policy evaluation, metrics ingestion),internal/uptime/(state log handler, uptime worker)
pgx.Conn (not pooled) for advisory lock persistence. On leader failure, a standby acquires the lock within one evaluation interval (15-30s).
Subsystems run concurrently:
| Subsystem | Leader-gated? | Description |
|---|---|---|
| Heartbeat evaluator | YES | 30s cycle, batch evaluation via 3-way JOIN snapshot |
| Policy evaluator | YES | 60s cycle, CEL policy evaluation against Victoria Metrics |
| Outbox worker | NO | Polls health_event_outbox, dispatches to handlers (FOR UPDATE SKIP LOCKED — multi-instance safe) |
| Incident pipeline | YES | Incident service + notification dispatcher (Slack, Telegram, Email, Webhook) |
| Uptime worker | YES | 5min cycle, materializes hourly uptime buckets from state transition log |
| Metrics ingester | NO | NATS -> Victoria Metrics (idempotent writes) |
| Subscription cleanup | YES | Expiration, grace period, pending payment TTL |
| Backup cleanup | YES | S3 orphaned backup removal |
| Terraform cleanup | YES | Orphaned state directory removal |
| Maintenance cleanup | YES | Stuck maintenance node recovery |
See also: Health and Incidents for the full pipeline architecture.
Ops Agent (cmd/ops-agent/)
Purpose: Lightweight on-host agent for node lifecycle management.
- Entry point:
cmd/ops-agent/main.go - Runs on: Each node VM (installed via cloud-init)
- Communication: gRPC client -> Agent Gateway, NATS publisher for metrics
- Key packages:
internal/opsagent/(agent core, commands, recipes, config, state tracking, observation, upgrade)
- Lifecycle control (start/stop/restart node process via systemd)
- Upgrade execution via three-layer architecture (actions, runtime adapters, executor)
- Configuration application via declarative recipes (
hoodcloud-chain-configs/recipes/) - System and chain metric collection via observation runner
- Sync status tracking (events forwarded via gRPC -> NATS)
- Key injection (encrypted key material decrypted with DEK in memory)
- Progress monitoring for long-running operations (snapshot downloads)
- Self-update mechanism
observation.yaml exists) -> Start gRPC server -> Register with control plane -> Fetch DEK -> Start state tracking + observation + heartbeat loops -> Wait for signal -> Stop node -> Clear DEK -> Shutdown.
Key Packages
| Package | Purpose |
|---|---|
internal/contracts/ | Canonical interfaces (repositories, services, notifiers) |
internal/models/ | Domain model types (Node, Subscription, User, Incident, etc.) |
internal/database/ | PostgreSQL repositories (auth/, ops/, user/ domains) |
internal/service/ | Business logic services (NodeService, SubscriptionService, etc.) |
internal/crypto/ | AES-256-GCM encryption, NaCl sealed box |
internal/vault/ | HashiCorp Vault client (AppRole, Transit, PKI, circuit breaker) |
internal/chains/ | Chain profile loading (local filesystem or S3 with version polling) |
internal/upgrade/manifest/ | Upgrade manifest reader (YAML manifests from chain config directory) |
internal/provision/input/ | Provisioning input framework (schema, validation, storage) |
internal/wallet/ | Chain-agnostic wallet verification (Ethereum, Solana) |
internal/consumers/ | NATS consumers (payment events, idempotency) |
internal/observation/ | Metrics collection, transport, CEL policy evaluation |
internal/uptime/ | Rolling uptime calculation (state log handler, uptime worker) |
internal/terraform/ | Terraform execution and self-describing module registry |
internal/health/leader.go | PostgreSQL advisory lock leader election (generic, reusable) |
internal/app/bootstrap/leader_election.go | Leader election bootstrap pattern |
internal/correlation/ | Correlation ID propagation |
internal/telemetry/ | OpenTelemetry distributed tracing |
Related Documents
Architecture:- Domain Model — Domain objects, state machines, business rules, DB schema
- Workflows — Temporal workflows, provisioning inputs, NATS consumers
- NATS JWT Operator Mode — JWT authentication, account structure, credential flows
- Health and Incidents — Health evaluation, incidents, notifications, cleanup
- Payment Service — Isolated payment microservice
- Extending — Extension points, adding chains/providers/channels
- Developer Guide — Reading guide, patterns, debugging
- Environment Variables — Complete env var reference
- Vault — Vault setup, secret structure, operations
- Deployment and Operations — Local dev, day-to-day ops