Engineering Overview
What this knowledge base covers
This wiki captures the engineering team's institutional knowledge: architecture decisions, incident postmortems, and operational runbooks. Use it as the starting point to understand how the system is built, why key decisions were made, and how to respond when things go wrong. Engineering — Incidents & Decisions.md
System architecture
The following diagram reflects what the source material describes: a public API backed by a managed Postgres primary datastore, with a shared connection pool as a critical intermediary between application code and the database. Engineering — Incidents & Decisions.md
Key constraint: application code must use the shared connection pool. Opening ad-hoc per-request DB connections is explicitly forbidden — this pattern caused the 2026-05-12 API outage. Engineering — Incidents & Decisions.md
Key documents
Architecture Decision Records (ADRs)
| ADR | Decision | Status |
|---|---|---|
| ADR-014 | Use Postgres as the primary datastore for relational data | Decided |
Postmortems
| Incident | Impact | Root Cause |
|---|---|---|
| 2026-05-12 API Outage | ~45 min of 5xx, checkout blocked | DB connection pool exhaustion from ad-hoc connections |
Runbooks
| Runbook | Purpose |
|---|---|
| Roll Back a Bad Deploy | Step-by-step procedure to revert a bad production deploy |
Outstanding follow-ups
The items below were opened after the 2026-05-12 outage and remain tracked in that postmortem. Engineering — Incidents & Decisions.md
- Add a connection-count alert at 80% of max
- Lint rule to forbid ad-hoc connections outside the pool
- Load-test new endpoints before release
Source material
This knowledge base is compiled from Engineering — Incidents & Decisions.md. Engineering — Incidents & Decisions.md