Database Guardrails

Connection Pool Configuration

Each service has a configured connection pool size based on its workload. Set via DB_MAX_CONNS and DB_MIN_CONNS environment variables per service.

Service	MaxConns	MinConns	Rationale
Health Evaluator	5	2	Single active instance (leader election), periodic batch queries
Auth Server	5	2	Lightweight auth queries
API Server	10	2	User-facing, variable load
Agent Gateway	10	2	Agent heartbeats, moderate load
Orchestrator	10	2	Workflow activities, Terraform operations

Connection Budget Formula

sum(MaxConns × instances) < PostgreSQL max_connections

Example with 2 instances per service:

(5×2) + (5×2) + (10×2) + (10×2) + (10×2) = 80 < 100 (default max_connections)

Example with 3 instances per stateless service:

(5×1) + (5×3) + (10×3) + (10×3) + (10×3) = 95 < 100

Note: Health evaluator uses leader election — only 1 active instance needs connections. Scale triggers for PgBouncer at N=10 connection pools.

Tuning

If db_pool_active_connections exceeds 80% of MaxConns or db_pool_acquire_duration_seconds p99 exceeds 1s:

Check if queries are slow (see Statement Timeout Runbook below)
Increase DB_MAX_CONNS for the affected service
Recalculate connection budget to ensure it stays under max_connections
If total budget exceeds PostgreSQL limits, deploy PgBouncer

Statement Timeouts

Per-service statement timeouts prevent runaway queries from holding connections. Set via DB_STATEMENT_TIMEOUT environment variable.

Service	Statement Timeout	Rationale
API Server	2s	User-facing, queries should be fast
Auth Server	2s	Auth lookups must be fast
Agent Gateway	2s	Agent heartbeats, fast interactions
Health Evaluator	5s	Batch queries over all nodes
Orchestrator	10s	Terraform-related operations can be heavier

When a query exceeds the timeout, PostgreSQL cancels it and returns error code 57014. The application logs the timeout and the request fails with an appropriate error. Adjusting timeouts: If legitimate queries are timing out, investigate the root cause before increasing the timeout. See the Statement Timeout Runbook below.

PostgreSQL Configuration Requirements

pg_stat_statements (Required)

Enable in PostgreSQL configuration:

shared_preload_libraries = 'pg_stat_statements'
pg_stat_statements.track = top
pg_stat_statements.max = 5000

Requires PostgreSQL restart. Coordinate with managed PostgreSQL deployment.

Slow Query Logging

log_min_duration_statement = 500

Logs all queries exceeding 500ms with full query text. This threshold is lower than statement timeouts (2-10s) to catch degradation before it becomes a timeout.

Query Performance SLOs (p95)

Service	p95 SLO	Statement Timeout	Rationale
API Server	< 100ms	2s	User-facing, latency impacts UX
Health Evaluator	< 500ms	5s	Batch `ListSnapshots` over all nodes, runs every 15s
Orchestrator	< 1000ms	10s	Terraform-related DB operations, not latency-sensitive
Agent Gateway	< 50ms	2s	Agent heartbeats every 30s, must be fast
Auth Server	< 50ms	2s	Auth lookups (API key validation, token refresh)

Application-Side Metrics

The following Prometheus metrics are automatically emitted by all services via the shared database connection setup:

Metric	Type	Labels	Description
`hoodcloud_db_query_duration_seconds`	Histogram	`service`	Per-service query latency (p50, p95, p99)
`hoodcloud_db_statement_timeout_total`	Counter	`service`	Statement timeout errors (SQLSTATE 57014) per service

Alert Thresholds

Alert	Condition	Severity	Response
p95 query latency SLO breach	`hoodcloud_db_query_duration_seconds{quantile="0.95"} > SLO` for > 5 min	WARNING	Investigate via `pg_stat_statements`
Statement timeout rate spike	`rate(hoodcloud_db_statement_timeout_total[5m]) > 0.1` (any service)	HIGH	Follow statement timeout runbook below
Slow query rate sustained	`pg_slow_query_log_rate > 10/min` for > 15 min	WARNING	Review `pg_stat_statements` top queries
Health evaluator cycle exceeds SLO	`hoodcloud_evaluation_cycle_duration_seconds{quantile="0.95"} > 0.5`	HIGH	`ListSnapshots` query degrading

Statement Timeout Runbook

When statement timeouts trigger frequently:

Identify the query: Check hoodcloud_db_statement_timeout_total metric labels for the service. Cross-reference with pg_stat_statements for recently degraded queries.
Check for lock contention:
```
SELECT * FROM pg_stat_activity WHERE wait_event_type = 'Lock';
```
Health evaluator batch updates and API server reads can contend.

Check for table bloat:

SELECT schemaname, relname, n_dead_tup, last_autovacuum
FROM pg_stat_user_tables ORDER BY n_dead_tup DESC;

The nodes and node_health_state tables are high-write and bloat-prone.

Check for missing indexes: Run EXPLAIN ANALYZE on the timed-out query. Verify indexes exist on all JOIN and WHERE columns.
Check for data growth: If node count has grown significantly, ListSnapshots may need pagination or the health evaluator’s statement timeout may need adjustment.
Temporary mitigation: If a single runaway query is causing cascading timeouts, identify and pg_cancel_backend() it. Do NOT increase statement timeouts without understanding root cause.
Escalation: If timeouts persist after steps 1-6, this may indicate the scale trigger for PgBouncer or a health read replica.

​Connection Pool Configuration

​Connection Budget Formula

​Tuning

​Statement Timeouts

​PostgreSQL Configuration Requirements

​pg_stat_statements (Required)

​Slow Query Logging

​Query Performance SLOs (p95)

​Application-Side Metrics

​Alert Thresholds

​Statement Timeout Runbook