Connection Pool Configuration
Each service has a configured connection pool size based on its workload. Set viaDB_MAX_CONNS and DB_MIN_CONNS environment variables per service.
| Service | MaxConns | MinConns | Rationale |
|---|---|---|---|
| Health Evaluator | 5 | 2 | Single active instance (leader election), periodic batch queries |
| Auth Server | 5 | 2 | Lightweight auth queries |
| API Server | 10 | 2 | User-facing, variable load |
| Agent Gateway | 10 | 2 | Agent heartbeats, moderate load |
| Orchestrator | 10 | 2 | Workflow activities, Terraform operations |
Connection Budget Formula
Tuning
Ifdb_pool_active_connections exceeds 80% of MaxConns or db_pool_acquire_duration_seconds p99 exceeds 1s:
- Check if queries are slow (see Statement Timeout Runbook below)
- Increase
DB_MAX_CONNSfor the affected service - Recalculate connection budget to ensure it stays under
max_connections - If total budget exceeds PostgreSQL limits, deploy PgBouncer
Statement Timeouts
Per-service statement timeouts prevent runaway queries from holding connections. Set viaDB_STATEMENT_TIMEOUT environment variable.
| Service | Statement Timeout | Rationale |
|---|---|---|
| API Server | 2s | User-facing, queries should be fast |
| Auth Server | 2s | Auth lookups must be fast |
| Agent Gateway | 2s | Agent heartbeats, fast interactions |
| Health Evaluator | 5s | Batch queries over all nodes |
| Orchestrator | 10s | Terraform-related operations can be heavier |
57014. The application logs the timeout and the request fails with an appropriate error.
Adjusting timeouts: If legitimate queries are timing out, investigate the root cause before increasing the timeout. See the Statement Timeout Runbook below.
PostgreSQL Configuration Requirements
pg_stat_statements (Required)
Enable in PostgreSQL configuration:Slow Query Logging
Query Performance SLOs (p95)
| Service | p95 SLO | Statement Timeout | Rationale |
|---|---|---|---|
| API Server | < 100ms | 2s | User-facing, latency impacts UX |
| Health Evaluator | < 500ms | 5s | Batch ListSnapshots over all nodes, runs every 15s |
| Orchestrator | < 1000ms | 10s | Terraform-related DB operations, not latency-sensitive |
| Agent Gateway | < 50ms | 2s | Agent heartbeats every 30s, must be fast |
| Auth Server | < 50ms | 2s | Auth lookups (API key validation, token refresh) |
Application-Side Metrics
The following Prometheus metrics are automatically emitted by all services via the shared database connection setup:| Metric | Type | Labels | Description |
|---|---|---|---|
hoodcloud_db_query_duration_seconds | Histogram | service | Per-service query latency (p50, p95, p99) |
hoodcloud_db_statement_timeout_total | Counter | service | Statement timeout errors (SQLSTATE 57014) per service |
Alert Thresholds
| Alert | Condition | Severity | Response |
|---|---|---|---|
| p95 query latency SLO breach | hoodcloud_db_query_duration_seconds{quantile="0.95"} > SLO for > 5 min | WARNING | Investigate via pg_stat_statements |
| Statement timeout rate spike | rate(hoodcloud_db_statement_timeout_total[5m]) > 0.1 (any service) | HIGH | Follow statement timeout runbook below |
| Slow query rate sustained | pg_slow_query_log_rate > 10/min for > 15 min | WARNING | Review pg_stat_statements top queries |
| Health evaluator cycle exceeds SLO | hoodcloud_evaluation_cycle_duration_seconds{quantile="0.95"} > 0.5 | HIGH | ListSnapshots query degrading |
Statement Timeout Runbook
When statement timeouts trigger frequently:-
Identify the query: Check
hoodcloud_db_statement_timeout_totalmetric labels for the service. Cross-reference withpg_stat_statementsfor recently degraded queries. -
Check for lock contention:
Health evaluator batch updates and API server reads can contend.
-
Check for table bloat:
The
nodesandnode_health_statetables are high-write and bloat-prone. -
Check for missing indexes:
Run
EXPLAIN ANALYZEon the timed-out query. Verify indexes exist on all JOIN and WHERE columns. -
Check for data growth: If node count has grown significantly,
ListSnapshotsmay need pagination or the health evaluator’s statement timeout may need adjustment. -
Temporary mitigation: If a single runaway query is causing cascading timeouts, identify and
pg_cancel_backend()it. Do NOT increase statement timeouts without understanding root cause. - Escalation: If timeouts persist after steps 1-6, this may indicate the scale trigger for PgBouncer or a health read replica.