Agentic design patterns: the spectrum from a single prompt to a multi-agent system
Agentic design patterns are the reusable ways to arrange model calls, retrieval, tools, and control logic around an LLM so it behaves like a production system. Andrew Ng grouped the agentic ones into four: reflection, tool use, planning, and multi-agent collaboration. In practice they sit on a wider spectrum, from a single model call to a coordinating crew, where each step up trades more autonomy for more cost. The skill is picking the simplest pattern that clears your eval, and moving right only when the task forces it.

The short version
- AI architecture patterns are reusable ways to arrange model calls, retrieval, tools, and control logic around an LLM. They escalate by autonomy and cost: direct prompt to RAG to tool use to single agent to multi-agent.
- The load-bearing distinction is workflows vs agents. A workflow orchestrates LLMs and tools along predefined code paths; an agent lets the model dynamically direct its own steps and tool use (Anthropic, 2024).
- Anthropic’s headline guidance: find the simplest solution and add complexity only when it demonstrably improves outcomes. The most successful implementations used simple, composable patterns, not heavy frameworks.
- Multi-agent is the most powerful pattern and the most expensive. Anthropic’s research system beat a single-agent baseline by 90.2% but burned about 15x the tokens of a normal chat, so gate it to high-value work.
- Cut cost and latency with prompt caching (up to about 90% off cached input on OpenAI for prompts over 1,024 tokens, automatic; up to about 90% cost reduction on Anthropic for long reused prompts), routing to smaller models, parallelizing independent calls, and capping agent loops.
What agentic design patterns are, and why they matter
Agentic design patterns are reusable, composable ways of arranging LLM calls, tools, data, and control logic around a model to turn a raw text generator into a reliable production system. They are the design-patterns layer of AI engineering, the equivalent of established software patterns but for orchestrating models, retrieval, tools, and feedback loops. Andrew Ng's widely cited framing names four agentic patterns, reflection, tool use, planning, and multi-agent collaboration, which an agent layers as needed.11 They sit inside a wider spectrum of rising autonomy and cost that runs from a direct prompt through RAG and tool use to single and multi-agent systems, and the engineering skill is moving right only when the task demands it.
The single most important distinction underneath all of them is workflows versus agents. Anthropic defines a workflow as a system where LLMs and tools are orchestrated through predefined code paths, and an agent as a system where the LLM dynamically directs its own processes and tool usage, keeping control over how it accomplishes a task.1 A workflow is a path you draw; an agent is a path the model draws at runtime. Almost every choice on this page reduces to deciding which of those two you actually need.
"The most successful implementations weren’t using complex frameworks or specialized libraries. Instead, they were building with simple, composable patterns."
That quote is the spine of the whole topic. Anthropic’s headline guidance is to find the simplest solution possible and increase complexity only when it demonstrably improves outcomes, which sometimes means not building an agent at all.1 Patterns matter because the right choice is usually the simpler one, and the gap between a system that ships and one that quietly burns tokens is almost always a pattern picked one notch too far up the spectrum.
The core patterns, and when to use each
The core AI architecture patterns layer onto one atomic unit, the augmented LLM, and escalate from there. In rough order of rising autonomy: direct prompt and prompt chaining, RAG, tool use, routing, parallelization, fine-tuning, the evaluator-optimizer loop, single agents, and multi-agent orchestration, wrapped by cross-cutting layers for guardrails, caching, and resilience. Each adds capability at a cost, so you climb the ladder one rung at a time.
The atomic unit is the augmented LLM: a model given retrieval, tools, and memory that it decides when to invoke. Anthropic advises investing in tailoring those capabilities to your use case and giving the model a clear, well-documented interface to them.1 Everything below is a way of arranging augmented LLMs.
It helps to see how Ng's four agentic design patterns line up with the spectrum below. Reflection is the evaluator-optimizer loop, where the model critiques and revises its own output. Tool use is function calling, the bridge from answering to acting. Planning is what a single agent does when it decides its own steps at runtime rather than following a fixed path. Multi-agent collaboration is the orchestrator-workers shape, a lead model delegating to specialists. The other rungs here, prompt chaining, RAG, routing, and parallelization, are the workflow scaffolding those agentic patterns plug into. Naming them this way matters because the most common mistake is reaching for planning or multi-agent collaboration when reflection or plain tool use would have shipped.
Direct prompt and prompt chaining
A single well-crafted model call, with no retrieval or tools, is the default and covers most production AI features. Prompt chaining is the workflow extension: decompose the task into fixed sequential steps, each consuming the prior output, with optional programmatic gates between them. Anthropic frames it as ideal when a task cleanly decomposes into fixed subtasks, trading latency for higher accuracy.1
RAG and tool use
RAG, retrieval-augmented generation, pulls relevant documents from an external store at query time and injects them into the prompt so the model answers grounded in your data. Microsoft’s framing is that you update the knowledge base by adding a document to the index, with no model change, and that the pattern naturally produces attributable, citable answers.5 AWS treats it as the standard grounding pattern for current or proprietary knowledge.6 When the retrieval step itself needs reasoning, it becomes agentic RAG, where a retrieval agent chooses among retrievers and plans multi-step lookups. Tool use, or function calling, gives the model a schema of callable functions and lets it decide when to call them and with what arguments; the runtime executes and returns the result.2 It is the bridge from answering to acting. Anthropic’s tool-design principle is worth printing on the wall: invest in the agent-computer interface as much as you would a human one, give the model enough tokens to think, and document tools thoroughly.1
Routing and parallelization
Routing classifies the input and sends it down a specialized path, prompt, or model. Anthropic notes routing works well when distinct categories are better handled separately, and its canonical example sends simple queries to a cheap, fast model and hard ones to a more capable model.1 That cost-tiering idea, cheapest model first and escalate only when quality is insufficient, is what we call a cascade; treat the word as our framing and the citable primitive as Anthropic’s routing. Parallelization runs several calls at once and aggregates them, in two flavors: sectioning splits independent subtasks to run concurrently, and voting runs the same task several times for confidence.1
Fine-tuning and the evaluator-optimizer loop
Fine-tuning bakes behavior into the weights through training, so a format, tone, or narrow skill holds without runtime prompting. The durable rule across Microsoft and AWS guidance is to use RAG for knowledge that changes and fine-tuning for behavior that is stable; many systems do both, and the full tradeoff is in our guide on retrieval patterns. The evaluator-optimizer loop pairs a generator LLM with a second LLM that scores the output against criteria and returns feedback, iterating until it is good enough. Anthropic recommends it when evaluation criteria are clear and iterative refinement provides measurable value, as in literary translation or multi-round search.1
Single agent and multi-agent
A single agent plans and operates in a loop, using tools based on environmental feedback and running until it judges the task done or hits a checkpoint. Anthropic positions it for open-ended problems where you cannot predict the number of steps or hardcode a fixed path, with the caveat that it carries higher cost and a risk of compounding errors, so it needs sandboxing and guardrails.1 The deeper treatment is in our guide on multi-agent systems. A multi-agent system, in the orchestrator-workers shape, uses a lead LLM to decompose a task at runtime, delegate subtasks to worker LLMs in parallel, and synthesize the result. It is the most powerful pattern and the most expensive, which the chart below makes concrete.
| Metric | Multi-agent result |
|---|---|
| Quality lift over single-agent baseline (internal research eval) | +90.2% |
| Token usage vs a normal chat interaction | about 15x |
| Share of performance variance explained by token usage | about 80% |
The cost caveat is as important as the headline. In the same study, token usage alone explained about 80% of the performance variance, with the number of tool calls and the model choice as the two other explanatory factors (the three together reaching roughly 95%), and the multi-agent baseline used roughly 15x the tokens of a normal chat.3 The lesson is blunt: multi-agent only pays off for high-value tasks that can absorb the token multiplier. The market is already learning it the hard way, Gartner predicts more than 40% of agentic AI projects will be canceled by the end of 2027 on escalating cost, unclear value, and weak risk controls, which is the same failure the simplest-thing-first rule exists to prevent.12 Around all of these patterns sit three cross-cutting layers. Guardrails inspect inputs for injection, PII, and toxicity and outputs for policy and leakage, using products like NVIDIA NeMo Guardrails, OpenAI’s Moderation API, or Azure AI Content Safety.7 Human-in-the-loop review inserts approval at high-risk or irreversible steps. Caching and resilience keep cost and reliability in check, which the next section covers.
How to choose a pattern: a decision framework
Choose by what the task needs and escalate only when forced, which mirrors Anthropic’s simplest-thing-first guidance. Three questions settle most cases. One, does the model already know this? If not, reach for RAG or tools. Two, can you predict the steps? If yes, build a workflow; if no, build an agent. Three, does the added autonomy demonstrably beat the simpler version on your eval? If it does not, do not ship it.
The table below maps a need to the pattern that meets it. Read it top to bottom as the same escalation ladder: knowledge and action sit low, reliability loops in the middle, full agency at the top, and the cross-cutting layers apply throughout. Building this decision into a real product is the work our AI application development team does, where the pattern gets chosen against an eval before any prompt is written.
| If you need | Reach for | Autonomy |
|---|---|---|
| Fresh or proprietary knowledge, with citations | RAG (agentic RAG if retrieval needs reasoning) | None |
| To take actions, hit APIs, run code | Tool use / function calling | Low |
| Consistent domain behavior, tone, or format | Fine-tuning, often with RAG | None |
| Different inputs down different paths | Routing; cascade when the goal is cost | Low |
| Reliability and quality on hard outputs | Evaluator-optimizer, voting, human-in-the-loop | Low |
| Open-ended, unpredictable step count | Single agent, then multi-agent only if one stalls | High |
| Speed on independent subtasks | Parallelization (sectioning) | Low |
| Lower cost and latency | Caching, routing to small models | n/a |
| Safety, compliance, abuse resistance | Guardrails layer plus human-in-the-loop | n/a |
| Production resilience | Fallback plus circuit breaker | n/a |
The order matters more than any single row. Prompt and few-shot come first because they are cheapest, then RAG when the gap is knowledge, then tools when the model needs to act, then a workflow when the steps are predictable, then an agent when they are not. You earn each rung by proving on an eval that it beats the rung below.
Production concerns: evals, cost, latency, reliability
Four concerns decide whether a pattern survives contact with production: evaluation, cost, latency, and reliability. Build a task-specific eval set before you pick a pattern, because you cannot tell whether added complexity helps without one. Control cost and latency with caching and routing. Hold reliability with validation gates, retries, fallbacks, and a circuit breaker. None of these is optional once real traffic arrives.
Start with evals, because the whole decision framework rests on them. Anthropic recommends optimizing a simple prompt with comprehensive evaluation before adding any agentic steps; you cannot manage what you do not measure.1 Observability is the runtime half: trace every model and tool call, and track tokens, latency, cache-hit rate, and cost per feature. Since token usage explains about 80% of multi-agent performance variance, token telemetry is the highest-signal metric you have, and the vendor-neutral standard for emitting it is the OpenTelemetry GenAI semantic conventions.8
Cost and latency move together. Prompt caching is the safest lever: OpenAI's prompt-caching guide reports automatic caching for prompts over 1,024 tokens with no code change, cutting input token cost up to about 90% and latency up to about 80%,4 and Anthropic reports prompt caching can cut cost up to about 90% and latency up to about 85% on long, reused prompts.9 The engineering tip that makes caching pay is to keep the shared prefix stable and put dynamic content, like retrieved chunks and chat history, last, so more of the prompt stays cache-eligible. Beyond caching, route trivial queries to small models, parallelize independent calls, stream responses, and set per-run token and cycle budgets.
Reliability is a layered defense against unreliable provider calls. Retries with exponential backoff handle transient failures. Fallbacks reroute to an alternate model or provider, then to a smaller or retrieval-only path, and finally to a human, when retries will not help. A circuit breaker stops calling a failing provider for a cooldown after repeated failures, then half-opens to test recovery, which is canonical software engineering applied to model gateways.10 Anthropic’s own multi-agent post flags the absence of per-run caps and circuit breakers as a production risk, since the 15x baseline can compound far higher without them.3
Anti-patterns to avoid
The recurring failures are predictable. Most of them are some version of adding autonomy the task did not ask for, or shipping without the eval that would have caught the regression. Naming them makes them easier to refuse in a design review.
- Complexity for its own sake. More agents do not mean better results; every added agent adds latency, cost, and debugging difficulty. Build the workflow unless the task is genuinely open-ended.1
- Reaching for a framework before the primitives. Frameworks add layers of abstraction that obscure the underlying prompts and responses. Anthropic suggests starting with the API directly and reducing abstraction as you go to production.1
- No evals. Choosing a pattern by vibes means you cannot know whether the added complexity demonstrably improves outcomes.1
- No cost or loop guardrails. Unbounded agent loops and recursively spawning subagents are how a 15x baseline compounds into a runaway bill.3
- Wrong tool for the knowledge axis. Fine-tuning for facts that change, or RAG for behavior that should be trained in, mismatches the pattern to the problem.
- Weak tool and interface design. Vague tool schemas, missing docs, and formats unlike natural text make the model misuse the tools it is given.1
- Trusting an agent with irreversible actions. No guardrails, no sandbox, and no human-in-the-loop on high-stakes steps is how a single bad tool call becomes an incident.
- Cache-hostile prompts. Putting dynamic content at the front of the prompt destroys the cache-hit rate and quietly inflates every bill.
Agentic design patterns questions
What are agentic design patterns?
What is the difference between an AI workflow and an AI agent?
RAG or fine-tuning: which should I use?
When should I use a multi-agent architecture?
How do I reduce the cost and latency of an LLM application?
Sources
- Anthropic, Building Effective Agents (2024).
- OpenAI, Function calling guide (2026).
- Anthropic Engineering, How we built our multi-agent research system (2025).
- OpenAI, Prompt caching guide (2026).
- Microsoft Learn, Retrieval-Augmented Generation in Azure AI Search (2026).
- AWS, What is Retrieval-Augmented Generation? (2026).
- NVIDIA, NeMo Guardrails (2026).
- OpenTelemetry, Semantic conventions for generative AI (2026).
- Anthropic, Prompt caching with Claude (2024).
- Martin Fowler, CircuitBreaker (2014).
- Andrew Ng / DeepLearning.AI, Agentic Design Patterns (2024).
- Gartner, Over 40% of Agentic AI Projects Will Be Canceled by End of 2027 (2025).
Strategy, architecture & ops
AI Architecture Patterns for SaaS: A Technical Guide
Generative AI architecture for SaaS: layered design, multi-tenant isolation, LLM gateway, RAG, and security. Built by Res...
Read guide →
Strategy, architecture & ops
AI Cost Optimization
A senior-engineer guide to AI cost optimization: where LLM spend comes from, the levers ranked by payoff, the five number...
Read guide →
Strategy, architecture & ops
AI Deployment Checklist: 9 Gates Before You Ship
How to deploy AI models to production: a 9-gate pre-launch checklist anchored to the OWASP LLM Top 10 (2025), NIST AI RMF...
Read guide →
Strategy, architecture & ops
AI Evaluation and Evals
LLM evaluation and AI evals, explained: the eval taxonomy, how to build an eval suite, LLM-as-a-judge bias, offline vs pr...
Read guide →
Strategy, architecture & ops
AI Features SaaS Customers Actually Want
What AI powered SaaS customers actually want: the time-savers and answers they value, the automation they distrust, and h...
Read guide →
Strategy, architecture & ops
AI Security Best Practices
Generative AI security best practices: the OWASP Top 10 for LLMs, NIST AI RMF, MITRE ATLAS, lifecycle controls, agentic-A...
Read guide →
Agents & RAG
Agentic RAG: When to Use It and How to Build It
Agentic RAG explained: how it differs from naive and advanced RAG, the key patterns like corrective RAG and self-RAG, the...
Read guide →
Agents & RAG
AI Agent for Fintech: Risk, Compliance, Ops, Customer
AI agents in finance: fraud, AML, KYC and servicing use cases, how to build with money-movement guardrails and human appr...
Read guide →
Agents & RAG
AI Agent for Healthcare: Use Cases, Governance & Implementation
AI agents in healthcare: the use cases that pay off first, how to build one HIPAA-safe on FHIR with clinician review, and...
Read guide →
