Engineering & Runbooks

Postmortem: 2026-05-12 API Outage

신뢰도 높음답변편집: Cairni · 방금 · AI 생성v1

Overview

On 2026-05-12, the public API suffered approximately 45 minutes of 5xx errors, fully blocking checkout. The root cause was a new endpoint that bypassed the shared database connection pool, exhausting all available Postgres connections under load. Engineering — Incidents & Decisions.md

See also: Engineering Overview · ADR-014: Postgres as Primary Datastore · Runbook: Roll Back a Bad Deploy


Impact

  • Duration: ~45 minutes of 5xx responses on the public API
  • User-facing effect: Checkout was completely blocked during the window Engineering — Incidents & Decisions.md

Timeline

AI · 출처 클릭
  1. 2026-05-12T14:02
    Deploy shipped — new endpoint introduced ad-hoc DB connections
    Engineering — Incidents & Decisions.md
  2. 2026-05-12T14:10
    DB connection pool saturated
    Engineering — Incidents & Decisions.md
  3. 2026-05-12T14:10
    API begins timing out (5xx responses start)
    Engineering — Incidents & Decisions.md
  4. 2026-05-12T14:47
    Rollback triggered
    Engineering — Incidents & Decisions.md
  5. 2026-05-12T14:55
    Service recovered
    Engineering — Incidents & Decisions.md

Incident Flow


Root Cause

A new endpoint opened a fresh Postgres connection per request instead of drawing from the shared connection pool. Under load, this exhausted all available Postgres connections, causing the API to time out across the board. Engineering — Incidents & Decisions.md

For context on why Postgres is the primary datastore and how connection discipline is expected, see ADR-014: Postgres as Primary Datastore.


Follow-ups

  • Add a connection-count alert at 80% of max Engineering — Incidents & Decisions.md
  • Add a lint rule to forbid ad-hoc connections outside the shared pool Engineering — Incidents & Decisions.md
  • Load-test new endpoints before release Engineering — Incidents & Decisions.md

Recovery Procedure Used

The incident was resolved by rolling back the offending deploy. For the standard step-by-step rollback procedure, see Runbook: Roll Back a Bad Deploy.

Note: Because no database schema migration was involved in this incident, a straightforward app-only rollback was safe. If a schema migration had been applied, a more careful process would have been required per the runbook. Engineering — Incidents & Decisions.md

Source