Postmortem: 2026-05-12 API Outage
Overview
On 2026-05-12, the public API suffered approximately 45 minutes of 5xx errors, fully blocking checkout. The root cause was a new endpoint that bypassed the shared database connection pool, exhausting all available Postgres connections under load. Engineering — Incidents & Decisions.md
See also: Engineering Overview · ADR-014: Postgres as Primary Datastore · Runbook: Roll Back a Bad Deploy
Impact
- Duration: ~45 minutes of 5xx responses on the public API
- User-facing effect: Checkout was completely blocked during the window Engineering — Incidents & Decisions.md
Timeline
- 2026-05-12T14:02Deploy shipped — new endpoint introduced ad-hoc DB connectionsEngineering — Incidents & Decisions.md
- 2026-05-12T14:10DB connection pool saturatedEngineering — Incidents & Decisions.md
- 2026-05-12T14:10API begins timing out (5xx responses start)Engineering — Incidents & Decisions.md
- 2026-05-12T14:47Rollback triggeredEngineering — Incidents & Decisions.md
- 2026-05-12T14:55Service recoveredEngineering — Incidents & Decisions.md
Incident Flow
Root Cause
A new endpoint opened a fresh Postgres connection per request instead of drawing from the shared connection pool. Under load, this exhausted all available Postgres connections, causing the API to time out across the board. Engineering — Incidents & Decisions.md
For context on why Postgres is the primary datastore and how connection discipline is expected, see ADR-014: Postgres as Primary Datastore.
Follow-ups
- Add a connection-count alert at 80% of max Engineering — Incidents & Decisions.md
- Add a lint rule to forbid ad-hoc connections outside the shared pool Engineering — Incidents & Decisions.md
- Load-test new endpoints before release Engineering — Incidents & Decisions.md
Recovery Procedure Used
The incident was resolved by rolling back the offending deploy. For the standard step-by-step rollback procedure, see Runbook: Roll Back a Bad Deploy.
Note: Because no database schema migration was involved in this incident, a straightforward app-only rollback was safe. If a schema migration had been applied, a more careful process would have been required per the runbook. Engineering — Incidents & Decisions.md
Source
Full incident record: Engineering — Incidents & Decisions.md