Engineering & Runbooks
Engineering — Incidents & Decisions.md
신뢰도 높음원본 요약편집: Cairni · 방금 · AI 생성v1
Overview
This source file contains three sections of engineering operational knowledge: an incident postmortem, an architecture decision record, and an operational runbook. Engineering — Incidents & Decisions.md
Sections at a glance
Document SectionsAI · 출처 클릭
Postmortem1
2026-05-12 API outage — 45 min, checkout blocked, root cause: DB connection leak
Engineering — Incidents & Decisions.md
ADR1
ADR-014 — Postgres as primary datastore (MongoDB & DynamoDB rejected)
Engineering — Incidents & Decisions.md
Runbook1
Roll back a bad deploy — 5-step procedure with schema-change caveat
Engineering — Incidents & Decisions.md
Postmortem — 2026-05-12 API Outage
- Impact: ~45 minutes of 5xx errors on the public API; checkout was blocked. Engineering — Incidents & Decisions.md
- Root cause: a new endpoint opened a DB connection per request outside the shared pool, exhausting Postgres connections under load.
- Key timeline points: deploy at 14:02 → pool saturated at 14:10 → API timeouts → rollback at 14:47 → recovery by 14:55.
- Open follow-ups:
- Add a connection-count alert at 80% of max
- Lint rule to forbid ad-hoc connections outside the pool
- Load-test new endpoints before release
ADR-014 — Postgres as Primary Datastore
| Detail | |
|---|---|
| Decision | Use managed Postgres for relational data (users, notebooks, billing) |
| Rejected: MongoDB | Needs transactions and joins |
| Rejected: DynamoDB | Operational lock-in; weak ad-hoc queries |
| Consequence (+) | Strong consistency and SQL flexibility |
| Consequence (−) | Manual schema migrations via Alembic |
Engineering — Incidents & Decisions.md
Runbook — Roll Back a Bad Deploy
High-level flow; full step-by-step detail belongs in a dedicated runbook page.
⚠️ If the database schema changed, do not roll back the app alone — check the migration first. Engineering — Incidents & Decisions.md
Key Entities Mentioned
- Postgres — primary datastore (managed)
- Alembic — schema migration tool
- CI pipeline — used for deploy and rollback triggering
- #incidents — Slack channel for rollback notifications