Research Deep-Dive

LLM Agents — Research Wiki Overview

신뢰도 높음개념편집: Cairni · 방금 · AI 생성v1

What Makes an LLM Agent Actually Work?

The central question driving this research is deceptively simple: *what actually makes an LLM "agent" work in practice, beyond a single prompt?* A language model responding to a one-shot query is impressive, but it is not an agent. An agent must reason across multiple steps, reach beyond its training data to interact with the world, retain information across turns, and recover from mistakes — all without a human intervening at every step. This wiki collects, organizes, and cross-links research notes exploring how that is achieved, where current approaches break down, and what remains genuinely unresolved. The full map of findings lives on LLM Agents — Research Overview; the pages linked throughout this article go deeper on each sub-topic. The unresolved tensions are catalogued on Open Questions. Research — LLM Agents.md


From Single Prompt to Agent: The Conceptual Leap

Before any of the patterns described below make sense, it helps to understand what separates an agent from a standard language model call. A single prompt extracts a response from the model's parametric knowledge — whatever was baked in during training. An agent, by contrast, executes a *loop*: it reasons about what it needs, takes an action (calling a tool, querying a database, running code), observes the result, and then reasons again in light of that observation. This loop can repeat many times before a final answer is produced.

The intellectual foundation for this loop is Chain-of-Thought Reasoning — the technique of prompting a model to emit explicit intermediate reasoning steps before a final answer. Chain-of-thought (CoT) was among the first demonstrations that large models could perform multi-step problem solving within a single forward pass. It is the direct precursor to everything else in this wiki: every more sophisticated agent pattern either extends CoT, compensates for its limitations, or builds new structure around it. The key limitation CoT leaves unaddressed is that its reasoning chain is entirely internal — the model has no mechanism to check intermediate conclusions against the real world, which opens the door to confident-sounding hallucinations. Research — LLM Agents.md


Grounding Reasoning in Reality: The ReAct Pattern

The most prominent pattern to emerge from the literature for addressing CoT's isolation problem is the ReAct Pattern. ReAct (Reason + Act) interleaves thinking steps with actual tool calls: the model reasons about what it needs, calls a tool, reads the observation returned, and then continues reasoning — in a repeating cycle until the goal is reached. This grounds each reasoning step in real-world feedback rather than pure parametric inference, meaning errors introduced by hallucination or stale training data can, in principle, be caught and corrected mid-run.

The difference from pure CoT is structural. CoT produces a chain of thoughts; ReAct produces a chain of (thought → action → observation) triples. Each observation is a reality check. This tight coupling between inference and the external world is what gives ReAct much of its practical value, and it is also why Tool Use & Function Calling is so central to the overall agent picture — without reliable tools, the grounding that makes ReAct valuable simply collapses.

ReAct is not without failure modes, however. The research notes specifically flag the looping problem: the model can get stuck repeating a failing action rather than backtracking or trying a different approach. This is one of the Open Questions left unresolved in the current notes — no mitigation strategy is documented yet. Research — LLM Agents.md


Extending the Agent Beyond Training Data: Tool Use

Tool Use & Function Calling is the mechanism that lets an agent interact with the world rather than merely describe it. By exposing external functions — web search, code execution, database queries, third-party APIs — the agent can retrieve live information, perform computation, and trigger real-world effects that no amount of parametric knowledge could substitute for.

The core challenge is reliability, and it manifests in two distinct ways. First, *tool selection*: given a set of available tools, the model must pick the right one. Research notes flag a clear degradation effect — the more tools that are available, the worse the model's selection accuracy becomes. Second, *argument formation*: even when the correct tool is chosen, the model may pass malformed arguments, causing calls to fail silently or return garbage. One proposed mitigation is structured or forced tool schemas, which constrain the output format the model must produce and are reported to cut argument errors sharply. The precise threshold at which tool-set size begins to hurt selection accuracy is, however, not yet characterized — another entry on Open Questions. Research — LLM Agents.md


Remembering Across Turns: The Memory Problem

If tool use is the hardest *engineering* problem in building agents, Agent Memory is widely considered the hardest *conceptual* problem. The challenge is fundamental: a language model's context window is finite, but real tasks accumulate information across many turns.

The research notes draw a clean two-tier distinction. Short-term memory is the active context window — fast, exact, but strictly size-limited. Everything the agent can directly "see" during a turn lives here. Long-term memory is information persisted outside the context and retrieved on demand, and this is where the real design complexity lives.

In practice, long-term agent memory almost universally reduces to Retrieval-Augmented Generation (RAG): relevant chunks from an external store are retrieved via vector similarity search, optionally summarized or reranked, and injected into the context window before each model call. The research notes make a pointed observation about this: most systems marketed as having "agent memory" are, when examined closely, just a retrieval + summarization pipeline. The apparent sophistication of recall is often reducible to a well-tuned RAG stack rather than any deeper form of persistent understanding.

The Embeddings Contradiction

The most significant unresolved tension in the entire notebook sits here. The research notes explicitly flag a direct conflict between two positions on RAG-based memory:

  • Position A: Naive vector search misses temporal and structural relationships — the order in which events happened, and hierarchical dependencies between facts, are invisible to a flat embedding lookup that treats all stored content as equally related and equally recent. On this view, retrieval + summarization is necessary but not sufficient.
  • Position B: A cited paper argues that embeddings alone are sufficient for practical agent memory.

These two claims are genuinely contradictory and remain unresolved. The full conflict is documented on Agent Memory, and it is flagged as an active open question on Open Questions. Any practitioner building long-term memory into an agent system should treat this as an unresolved design decision rather than settled practice. Research — LLM Agents.md


How Agents Are Structured: Planner vs. Reactive

Beneath the specific patterns above lies a more fundamental architectural question: *how should an agent sequence its reasoning and actions overall?* The research notes identify two broad camps, compared in depth on Planner vs Reactive Agent Architectures.

Explicit planners decompose a goal into sub-tasks upfront, before any tools are called or observations are made. The resulting plan is produced in one step and then executed sequentially. This approach is more predictable and auditable — you can inspect the plan before execution begins — and it tends to work well for well-defined, stable tasks. Its critical weakness is brittleness: if an early step produces unexpected output, the rest of the plan may be invalidated, and the agent may not recover gracefully.

Reactive loops — typified by ReAct Pattern — do not plan ahead at all. They interleave reasoning and action at each step, adapting dynamically to whatever the environment returns. This makes them much more resilient to unexpected mid-task changes. The corresponding weakness is that without structured goal decomposition, a reactive agent can wander — losing track of the original objective or cycling through repetitive actions without converging.

DimensionExplicit PlannerReactive Loop
StructureDecompose, then executeInterleave reasoning and action at each step
PredictabilityHighLow
AdaptabilityLow — brittle to mid-task changesHigh — adjusts to new observations
Primary failure modePlan invalidated by unexpected early outputWandering or looping without convergence
Best suited forWell-defined, stable tasksOpen-ended, dynamic tasks

The research notes do not resolve which architecture to prefer — and this is explicitly listed on Open Questions as the most important unresolved design question. The answer likely depends heavily on task structure and environment stability, but no principled guidance for making that call is yet documented. Research — LLM Agents.md


How the Patterns Relate to Each Other

It is easy to treat these sub-topics as independent, but they form an interconnected system. Chain-of-Thought Reasoning is the reasoning primitive that everything else builds on. ReAct Pattern extends CoT by anchoring reasoning steps to real observations returned by Tool Use & Function Calling. Retrieval-Augmented Generation (RAG) is what makes Agent Memory scale beyond the context window — it feeds long-term knowledge back into the short-term context that the ReAct loop operates over. And the Planner vs Reactive Agent Architectures debate is the meta-level question about how to orchestrate all of the above.


Open Questions and What to Resolve Next

The research notes are honest about what they do not settle. Open Questions collects these systematically, but the four most consequential gaps are worth naming directly here:

  1. 1.Planner vs. reactive — when does each win? No principled guidance exists in the current notes for choosing between the two architectures given a concrete task.
  2. 2.Embeddings alone vs. retrieval + summarization for memory. An active contradiction between sources that any serious memory implementation must confront.
  3. 3.Preventing ReAct loops. The looping failure mode is documented but no mitigation strategy is recorded.
  4. 4.Tool-set size and selection accuracy. The degradation effect is noted, but the threshold and tradeoff curve are uncharacterized.

These gaps define the frontier of what this research collection has not yet resolved. They are the natural starting points for the next round of reading. Research — LLM Agents.md


Navigating This Wiki

This page is the hub. Each link below leads to a dedicated page that goes deeper on one sub-topic, concept, or cluster of findings: