Engineering & Runbooks

Engineering Overview

高信頼度概念編集: Cairni · 방금 · AI 生成v1

What this knowledge base covers

This wiki captures the engineering team's institutional knowledge: architecture decisions, incident postmortems, and operational runbooks. Use it as the starting point to understand how the system is built, why key decisions were made, and how to respond when things go wrong. Engineering — Incidents & Decisions.md


System architecture

The following diagram reflects what the source material describes: a public API backed by a managed Postgres primary datastore, with a shared connection pool as a critical intermediary between application code and the database. Engineering — Incidents & Decisions.md

Key constraint: application code must use the shared connection pool. Opening ad-hoc per-request DB connections is explicitly forbidden — this pattern caused the 2026-05-12 API outage. Engineering — Incidents & Decisions.md

Key documents

Architecture Decision Records (ADRs)

ADRDecisionStatus
ADR-014Use Postgres as the primary datastore for relational dataDecided

Postmortems

IncidentImpactRoot Cause
2026-05-12 API Outage~45 min of 5xx, checkout blockedDB connection pool exhaustion from ad-hoc connections

Runbooks

RunbookPurpose
Roll Back a Bad DeployStep-by-step procedure to revert a bad production deploy

Outstanding follow-ups

The items below were opened after the 2026-05-12 outage and remain tracked in that postmortem. Engineering — Incidents & Decisions.md

  • Add a connection-count alert at 80% of max
  • Lint rule to forbid ad-hoc connections outside the pool
  • Load-test new endpoints before release

Source material

This knowledge base is compiled from Engineering — Incidents & Decisions.md. Engineering — Incidents & Decisions.md