Engineering & Runbooks

Engineering Overview

신뢰도 높음개념편집: Cairni · 방금 · AI 생성v1

This knowledge base is the canonical reference for the engineering team's institutional memory: it records how the system is built, why critical decisions were made, what has gone wrong in production, and exactly how to recover when things break. It spans three interlocking document types — Architecture Decision Records (ADRs), incident postmortems, and operational runbooks — all drawn from Engineering — Incidents & Decisions.md. Reading this page gives you a working mental model of the whole; every section links out to the deeper document where you need it.

System Architecture

The production system exposes a public API that serves end-user clients, including the checkout flow. All application code interacts with the database exclusively through a shared DB connection pool; opening ad-hoc, per-request connections is explicitly forbidden by both policy and a lint rule. The primary datastore is a managed Postgres instance, which holds relational data — users, notebooks, and billing records. Schema evolution is handled through Alembic migrations applied directly to Postgres. This architecture reflects the decision captured in ADR-014: Postgres as Primary Datastore, and the critical importance of the shared connection pool was demonstrated painfully during the Postmortem: 2026-05-12 API Outage. Engineering — Incidents & Decisions.md

The architectural constraint at the heart of this diagram — that the connection pool is the *only* sanctioned path from application code to Postgres — is not merely a best practice. It is a hard-won rule: violating it under load will exhaust all available database connections and take down the API. Engineering — Incidents & Decisions.md

Architecture Decision Records

ADRs are the team's mechanism for recording *why* the system is built the way it is, so that future engineers don't relitigate settled questions without context. The currently documented decision is ADR-014: Postgres as Primary Datastore. Engineering — Incidents & Decisions.md

When the team evaluated datastores for relational data, three candidates were considered. MongoDB was rejected because the workload requires multi-table transactions and joins, which MongoDB cannot provide reliably. DynamoDB was rejected for two reasons: it introduces significant operational lock-in to a single cloud vendor, and its ad-hoc query capabilities are too weak for the team's needs. Managed Postgres was chosen because it delivers ACID transactions, full SQL flexibility including arbitrary joins, and strong consistency guarantees — all of which the users, notebooks, and billing data models require. The accepted cost is that schema changes must be managed manually through Alembic migrations rather than being schema-free. Engineering — Incidents & Decisions.md

This decision has downstream consequences for how incidents unfold. Because the system is relational and schema-managed, any deploy that includes a database migration requires special care during a rollback — you cannot simply redeploy an older commit if the schema has already been advanced. This caveat is a first-class concern in the Runbook: Roll Back a Bad Deploy.

Incident Postmortems

Postmortems capture what happened when something broke, so the team can learn from it and prevent recurrence. The documented incident is the Postmortem: 2026-05-12 API Outage. Engineering — Incidents & Decisions.md

On 2026-05-12, a deploy at 14:02 introduced a new endpoint that opened a fresh Postgres connection for every incoming HTTP request, bypassing the shared pool entirely. Under real production load, this exhausted all available database connections within eight minutes — by 14:10 the pool was saturated, and the API began returning 5xx errors. Checkout was completely blocked. The team triggered a rollback at 14:47, and the service recovered by 14:55, for a total outage duration of approximately 45 minutes. Engineering — Incidents & Decisions.md

The incident generated three concrete follow-up action items, all of which remain open: adding a connection-count alert that fires at 80% of the pool's maximum capacity, introducing a lint rule that statically forbids ad-hoc connection opens outside the pool, and requiring load testing of new endpoints before any production release. These follow-ups represent the team's commitment to preventing the same class of failure from recurring. Engineering — Incidents & Decisions.md

The outage is also a direct validation of the architectural constraint encoded after ADR-014: Postgres as Primary Datastore was adopted: the connection pool is load-bearing infrastructure, not a suggestion.

Operational Runbooks

Runbooks are the team's step-by-step playbooks for high-stakes operations, written so that an on-call engineer can execute them correctly under pressure without needing to reason from first principles. The documented procedure is the Runbook: Roll Back a Bad Deploy. Engineering — Incidents & Decisions.md

The rollback procedure has five steps. First, confirm the regression by checking the dashboard for an elevated error rate or latency spike — do not act until the problem is confirmed. Second, find the last green pipeline on the main branch in CI; this is the target commit for the rollback. Third, trigger the rollback deploy by running the deploy job pinned to that last-known-good commit. Fourth, verify recovery by watching the error rate return to baseline within five minutes of the rollback completing. Fifth, post in #incidents with the rollback commit hash and a one-line description of the cause. Engineering — Incidents & Decisions.md

The critical edge case in the runbook is the database migration check. Before triggering the rollback, the engineer must determine whether the bad deploy included a schema migration. If it did, rolling back the application code alone may leave the schema in an incompatible state — the migration must be examined and potentially reversed first. This is the exact scenario that makes Postgres's explicit Alembic migration model a double-edged sword: it provides full control, but also full responsibility. Engineering — Incidents & Decisions.md

The rollback runbook was the procedure applied — or should have been applied — during the Postmortem: 2026-05-12 API Outage, where the rollback was triggered at 14:47 and recovery confirmed at 14:55.

How These Documents Relate

The three document types in this knowledge base form a closed loop. ADRs establish the architectural constraints the system must respect. Postmortems record what happens when those constraints are violated or when the system encounters conditions it was not designed to handle. Runbooks encode the operational responses that postmortems prove are necessary. Together, they represent a living record of the engineering team's collective learning. Engineering — Incidents & Decisions.md

The source of truth for all current content is Engineering — Incidents & Decisions.md. As new incidents occur, new decisions are made, and new procedures are documented, they should be added to that source file and reflected in the linked pages: Postmortem: 2026-05-12 API Outage, ADR-014: Postgres as Primary Datastore, and Runbook: Roll Back a Bad Deploy.

Document Index

Type	Document	One-line Summary
ADR	ADR-014: Postgres as Primary Datastore	Managed Postgres chosen over MongoDB and DynamoDB for relational data
Postmortem	Postmortem: 2026-05-12 API Outage	45-min checkout outage caused by DB connection pool exhaustion
Runbook	Runbook: Roll Back a Bad Deploy	5-step procedure to confirm, execute, and verify a production rollback
Source	Engineering — Incidents & Decisions.md	Raw source file containing all three documents above