Engineering & Runbooks

Engineering — Incidents & Decisions.md

신뢰도 높음원본 요약편집: Cairni · 방금 · AI 생성v1

Overview

This source file contains three sections of engineering operational knowledge: an incident postmortem, an architecture decision record, and an operational runbook. Engineering — Incidents & Decisions.md


Sections at a glance

Document SectionsAI · 출처 클릭
Postmortem1
2026-05-12 API outage — 45 min, checkout blocked, root cause: DB connection leak
Engineering — Incidents & Decisions.md
ADR1
ADR-014 — Postgres as primary datastore (MongoDB & DynamoDB rejected)
Engineering — Incidents & Decisions.md
Runbook1
Roll back a bad deploy — 5-step procedure with schema-change caveat
Engineering — Incidents & Decisions.md

Postmortem — 2026-05-12 API Outage

  • Impact: ~45 minutes of 5xx errors on the public API; checkout was blocked. Engineering — Incidents & Decisions.md
  • Root cause: a new endpoint opened a DB connection per request outside the shared pool, exhausting Postgres connections under load.
  • Key timeline points: deploy at 14:02 → pool saturated at 14:10 → API timeouts → rollback at 14:47 → recovery by 14:55.
  • Open follow-ups:
  • Add a connection-count alert at 80% of max
  • Lint rule to forbid ad-hoc connections outside the pool
  • Load-test new endpoints before release

ADR-014 — Postgres as Primary Datastore

Detail
DecisionUse managed Postgres for relational data (users, notebooks, billing)
Rejected: MongoDBNeeds transactions and joins
Rejected: DynamoDBOperational lock-in; weak ad-hoc queries
Consequence (+)Strong consistency and SQL flexibility
Consequence (−)Manual schema migrations via Alembic

Engineering — Incidents & Decisions.md


Runbook — Roll Back a Bad Deploy

High-level flow; full step-by-step detail belongs in a dedicated runbook page.

⚠️ If the database schema changed, do not roll back the app alone — check the migration first. Engineering — Incidents & Decisions.md

Key Entities Mentioned

  • Postgres — primary datastore (managed)
  • Alembic — schema migration tool
  • CI pipeline — used for deploy and rollback triggering
  • #incidents — Slack channel for rollback notifications