RAG (Retrieval-Augmented Generation)
Overview
Retrieval-Augmented Generation (RAG) is the dominant pattern for using large language models over document collections. By 2026, approximately 85% of enterprise AI applications use it. llm-wiki.en.md
The core mechanism is straightforward:
- 1.Chunk — source documents are split into smaller pieces.
- 2.Embed — each chunk is converted into a vector and stored in a vector database.
- 3.Retrieve — at query time, the closest matching chunks are fetched.
- 4.Generate — the LLM synthesizes an answer from the retrieved chunks.
When the Work Happens
RAG's defining characteristic is that the heavy lifting occurs at query time. Every time a question is asked, the system retrieves relevant chunks and the model reconstructs an answer from scratch. Knowledge is re-derived on every query rather than compiled and kept current. This is the primary distinction when comparing it to the LLM Wiki pattern. llm-wiki.en.md
Scale
RAG's key strength is scale — it can handle millions of documents comfortably, far beyond what an index-first approach like the LLM Wiki can manage. llm-wiki.en.md
Known Failure Modes
RAG has well-documented failure modes in production:
- Confident wrong answers from poor sources — if the retrieved chunks are low quality, the model generates plausible-sounding but incorrect answers.
- Silent contradictions — conflicting chunks from different sources are retrieved side by side with no reconciliation.
- Low production rate — analyses report that 40–60% of RAG implementations never reach production, and only a fraction show measurable ROI, almost always due to knowledge-base quality rather than retrieval tuning. llm-wiki.en.md
RAG vs. the LLM Wiki
For a detailed side-by-side comparison, see LLM Wiki vs. RAG. The table below summarizes the key trade-offs: llm-wiki.en.md
| LLM Wiki (compiled) | RAG (retrieved) | |
|---|---|---|
| When work happens | At ingest (compile once) | At query (retrieve every time) |
| Knowledge over time | Compounds — pages get richer | Static — re-derived each query |
| Output | Human-readable, interlinked pages | Opaque chunks reassembled per answer |
| Contradictions | Surfaced and reconciled during ingest | Silently retrieved side by side |
| Setup | A folder of Markdown + a schema file | Embeddings + vector DB + pipeline |
| Scale ceiling | Hundreds–~1,000 pages comfortably | Millions of documents |
Hybrid Architecture
RAG and the LLM Wiki are not mutually exclusive. For large corpora, a realistic architecture combines both: a compiled wiki for hot, frequently-accessed context plus a RAG layer for broad retrieval over the long tail. llm-wiki.en.md
At personal scale, an index.md catalog is sufficient for hundreds of pages without any embeddings or vector database. A local search engine like qmd becomes useful only as the wiki grows large. llm-wiki.en.md
Related Pages
- LLM Wiki — the compiled alternative to RAG
- LLM Wiki vs. RAG — detailed comparison
- qmd — on-device hybrid search engine for Markdown, useful when index-first navigation is no longer sufficient
- Ingest / Query / Lint Workflow — the three operations that replace per-query retrieval
- Cairni — a managed service built on the LLM Wiki pattern