Skip to main content

Connection Pool Configuration

Each service has a configured connection pool size based on its workload. Set via DB_MAX_CONNS and DB_MIN_CONNS environment variables per service.
ServiceMaxConnsMinConnsRationale
Health Evaluator52Single active instance (leader election), periodic batch queries
Auth Server52Lightweight auth queries
API Server102User-facing, variable load
Agent Gateway102Agent heartbeats, moderate load
Orchestrator102Workflow activities, Terraform operations

Connection Budget Formula

sum(MaxConns × instances) < PostgreSQL max_connections
Example with 2 instances per service:
(5×2) + (5×2) + (10×2) + (10×2) + (10×2) = 80 < 100 (default max_connections)
Example with 3 instances per stateless service:
(5×1) + (5×3) + (10×3) + (10×3) + (10×3) = 95 < 100
Note: Health evaluator uses leader election — only 1 active instance needs connections. Scale triggers for PgBouncer at N=10 connection pools.

Tuning

If db_pool_active_connections exceeds 80% of MaxConns or db_pool_acquire_duration_seconds p99 exceeds 1s:
  1. Check if queries are slow (see Statement Timeout Runbook below)
  2. Increase DB_MAX_CONNS for the affected service
  3. Recalculate connection budget to ensure it stays under max_connections
  4. If total budget exceeds PostgreSQL limits, deploy PgBouncer

Statement Timeouts

Per-service statement timeouts prevent runaway queries from holding connections. Set via DB_STATEMENT_TIMEOUT environment variable.
ServiceStatement TimeoutRationale
API Server2sUser-facing, queries should be fast
Auth Server2sAuth lookups must be fast
Agent Gateway2sAgent heartbeats, fast interactions
Health Evaluator5sBatch queries over all nodes
Orchestrator10sTerraform-related operations can be heavier
When a query exceeds the timeout, PostgreSQL cancels it and returns error code 57014. The application logs the timeout and the request fails with an appropriate error. Adjusting timeouts: If legitimate queries are timing out, investigate the root cause before increasing the timeout. See the Statement Timeout Runbook below.

PostgreSQL Configuration Requirements

pg_stat_statements (Required)

Enable in PostgreSQL configuration:
shared_preload_libraries = 'pg_stat_statements'
pg_stat_statements.track = top
pg_stat_statements.max = 5000
Requires PostgreSQL restart. Coordinate with managed PostgreSQL deployment.

Slow Query Logging

log_min_duration_statement = 500
Logs all queries exceeding 500ms with full query text. This threshold is lower than statement timeouts (2-10s) to catch degradation before it becomes a timeout.

Query Performance SLOs (p95)

Servicep95 SLOStatement TimeoutRationale
API Server< 100ms2sUser-facing, latency impacts UX
Health Evaluator< 500ms5sBatch ListSnapshots over all nodes, runs every 15s
Orchestrator< 1000ms10sTerraform-related DB operations, not latency-sensitive
Agent Gateway< 50ms2sAgent heartbeats every 30s, must be fast
Auth Server< 50ms2sAuth lookups (API key validation, token refresh)

Application-Side Metrics

The following Prometheus metrics are automatically emitted by all services via the shared database connection setup:
MetricTypeLabelsDescription
hoodcloud_db_query_duration_secondsHistogramservicePer-service query latency (p50, p95, p99)
hoodcloud_db_statement_timeout_totalCounterserviceStatement timeout errors (SQLSTATE 57014) per service

Alert Thresholds

AlertConditionSeverityResponse
p95 query latency SLO breachhoodcloud_db_query_duration_seconds{quantile="0.95"} > SLO for > 5 minWARNINGInvestigate via pg_stat_statements
Statement timeout rate spikerate(hoodcloud_db_statement_timeout_total[5m]) > 0.1 (any service)HIGHFollow statement timeout runbook below
Slow query rate sustainedpg_slow_query_log_rate > 10/min for > 15 minWARNINGReview pg_stat_statements top queries
Health evaluator cycle exceeds SLOhoodcloud_evaluation_cycle_duration_seconds{quantile="0.95"} > 0.5HIGHListSnapshots query degrading

Statement Timeout Runbook

When statement timeouts trigger frequently:
  1. Identify the query: Check hoodcloud_db_statement_timeout_total metric labels for the service. Cross-reference with pg_stat_statements for recently degraded queries.
  2. Check for lock contention:
    SELECT * FROM pg_stat_activity WHERE wait_event_type = 'Lock';
    
    Health evaluator batch updates and API server reads can contend.
  3. Check for table bloat:
    SELECT schemaname, relname, n_dead_tup, last_autovacuum
    FROM pg_stat_user_tables ORDER BY n_dead_tup DESC;
    
    The nodes and node_health_state tables are high-write and bloat-prone.
  4. Check for missing indexes: Run EXPLAIN ANALYZE on the timed-out query. Verify indexes exist on all JOIN and WHERE columns.
  5. Check for data growth: If node count has grown significantly, ListSnapshots may need pagination or the health evaluator’s statement timeout may need adjustment.
  6. Temporary mitigation: If a single runaway query is causing cascading timeouts, identify and pg_cancel_backend() it. Do NOT increase statement timeouts without understanding root cause.
  7. Escalation: If timeouts persist after steps 1-6, this may indicate the scale trigger for PgBouncer or a health read replica.