Agentic design patterns: the spectrum from a single prompt to a multi-agent system

Agentic design patterns are the reusable ways to arrange model calls, retrieval, tools, and control logic around an LLM so it behaves like a production system. Andrew Ng grouped the agentic ones into four: reflection, tool use, planning, and multi-agent collaboration. In practice they sit on a wider spectrum, from a single model call to a coordinating crew, where each step up trades more autonomy for more cost. The skill is picking the simplest pattern that clears your eval, and moving right only when the task forces it.

By Kanika Mathur, Head of Service Delivery

Reviewed by Resourcifi engineeringPublished Mar 29, 2026Updated Mar 29, 202611 min read

Architecture

Key takeaways

The short version

AI architecture patterns are reusable ways to arrange model calls, retrieval, tools, and control logic around an LLM. They escalate by autonomy and cost: direct prompt to RAG to tool use to single agent to multi-agent.
The load-bearing distinction is workflows vs agents. A workflow orchestrates LLMs and tools along predefined code paths; an agent lets the model dynamically direct its own steps and tool use (Anthropic, 2024).
Anthropic’s headline guidance: find the simplest solution and add complexity only when it demonstrably improves outcomes. The most successful implementations used simple, composable patterns, not heavy frameworks.
Multi-agent is the most powerful pattern and the most expensive. Anthropic’s research system beat a single-agent baseline by 90.2% but burned about 15x the tokens of a normal chat, so gate it to high-value work.
Cut cost and latency with prompt caching (up to about 90% off cached input on OpenAI for prompts over 1,024 tokens, automatic; up to about 90% cost reduction on Anthropic for long reused prompts), routing to smaller models, parallelizing independent calls, and capping agent loops.

What agentic design patterns are, and why they matter

Agentic design patterns are reusable, composable ways of arranging LLM calls, tools, data, and control logic around a model to turn a raw text generator into a reliable production system. They are the design-patterns layer of AI engineering, the equivalent of established software patterns but for orchestrating models, retrieval, tools, and feedback loops. Andrew Ng's widely cited framing names four agentic patterns, reflection, tool use, planning, and multi-agent collaboration, which an agent layers as needed.¹¹ They sit inside a wider spectrum of rising autonomy and cost that runs from a direct prompt through RAG and tool use to single and multi-agent systems, and the engineering skill is moving right only when the task demands it.

The single most important distinction underneath all of them is workflows versus agents. Anthropic defines a workflow as a system where LLMs and tools are orchestrated through predefined code paths, and an agent as a system where the LLM dynamically directs its own processes and tool usage, keeping control over how it accomplishes a task.¹ A workflow is a path you draw; an agent is a path the model draws at runtime. Almost every choice on this page reduces to deciding which of those two you actually need.

"The most successful implementations weren’t using complex frameworks or specialized libraries. Instead, they were building with simple, composable patterns."

Anthropic, Building Effective Agents (2024)

That quote is the spine of the whole topic. Anthropic’s headline guidance is to find the simplest solution possible and increase complexity only when it demonstrably improves outcomes, which sometimes means not building an agent at all.¹ Patterns matter because the right choice is usually the simpler one, and the gap between a system that ships and one that quietly burns tokens is almost always a pattern picked one notch too far up the spectrum.

The core patterns, and when to use each

The core AI architecture patterns layer onto one atomic unit, the augmented LLM, and escalate from there. In rough order of rising autonomy: direct prompt and prompt chaining, RAG, tool use, routing, parallelization, fine-tuning, the evaluator-optimizer loop, single agents, and multi-agent orchestration, wrapped by cross-cutting layers for guardrails, caching, and resilience. Each adds capability at a cost, so you climb the ladder one rung at a time.

The atomic unit is the augmented LLM: a model given retrieval, tools, and memory that it decides when to invoke. Anthropic advises investing in tailoring those capabilities to your use case and giving the model a clear, well-documented interface to them.¹ Everything below is a way of arranging augmented LLMs.

It helps to see how Ng's four agentic design patterns line up with the spectrum below. Reflection is the evaluator-optimizer loop, where the model critiques and revises its own output. Tool use is function calling, the bridge from answering to acting. Planning is what a single agent does when it decides its own steps at runtime rather than following a fixed path. Multi-agent collaboration is the orchestrator-workers shape, a lead model delegating to specialists. The other rungs here, prompt chaining, RAG, routing, and parallelization, are the workflow scaffolding those agentic patterns plug into. Naming them this way matters because the most common mistake is reaching for planning or multi-agent collaboration when reflection or plain tool use would have shipped.

Direct prompt and prompt chaining

A single well-crafted model call, with no retrieval or tools, is the default and covers most production AI features. Prompt chaining is the workflow extension: decompose the task into fixed sequential steps, each consuming the prior output, with optional programmatic gates between them. Anthropic frames it as ideal when a task cleanly decomposes into fixed subtasks, trading latency for higher accuracy.¹

RAG and tool use

RAG, retrieval-augmented generation, pulls relevant documents from an external store at query time and injects them into the prompt so the model answers grounded in your data. Microsoft’s framing is that you update the knowledge base by adding a document to the index, with no model change, and that the pattern naturally produces attributable, citable answers.⁵ AWS treats it as the standard grounding pattern for current or proprietary knowledge.⁶ When the retrieval step itself needs reasoning, it becomes agentic RAG, where a retrieval agent chooses among retrievers and plans multi-step lookups. Tool use, or function calling, gives the model a schema of callable functions and lets it decide when to call them and with what arguments; the runtime executes and returns the result.² It is the bridge from answering to acting. Anthropic’s tool-design principle is worth printing on the wall: invest in the agent-computer interface as much as you would a human one, give the model enough tokens to think, and document tools thoroughly.¹

Routing and parallelization

Routing classifies the input and sends it down a specialized path, prompt, or model. Anthropic notes routing works well when distinct categories are better handled separately, and its canonical example sends simple queries to a cheap, fast model and hard ones to a more capable model.¹ That cost-tiering idea, cheapest model first and escalate only when quality is insufficient, is what we call a cascade; treat the word as our framing and the citable primitive as Anthropic’s routing. Parallelization runs several calls at once and aggregates them, in two flavors: sectioning splits independent subtasks to run concurrently, and voting runs the same task several times for confidence.¹

Fine-tuning and the evaluator-optimizer loop

Fine-tuning bakes behavior into the weights through training, so a format, tone, or narrow skill holds without runtime prompting. The durable rule across Microsoft and AWS guidance is to use RAG for knowledge that changes and fine-tuning for behavior that is stable; many systems do both, and the full tradeoff is in our guide on retrieval patterns. The evaluator-optimizer loop pairs a generator LLM with a second LLM that scores the output against criteria and returns feedback, iterating until it is good enough. Anthropic recommends it when evaluation criteria are clear and iterative refinement provides measurable value, as in literary translation or multi-round search.¹

Single agent and multi-agent

A single agent plans and operates in a loop, using tools based on environmental feedback and running until it judges the task done or hits a checkpoint. Anthropic positions it for open-ended problems where you cannot predict the number of steps or hardcode a fixed path, with the caveat that it carries higher cost and a risk of compounding errors, so it needs sandboxing and guardrails.¹ The deeper treatment is in our guide on multi-agent systems. A multi-agent system, in the orchestrator-workers shape, uses a lead LLM to decompose a task at runtime, delegate subtasks to worker LLMs in parallel, and synthesize the result. It is the most powerful pattern and the most expensive, which the chart below makes concrete.

Multi-agent pays in quality and in tokens

Two numbers from one Anthropic study, side by side. A multi-agent research system beat the single-agent baseline by 90.2% on an internal eval, and it consumed about 15x the tokens of a normal chat to do it. Read them together: the lift is real, and so is the bill.

Data behind this chart
Metric	Multi-agent result
Quality lift over single-agent baseline (internal research eval)	+90.2%
Token usage vs a normal chat interaction	about 15x
Share of performance variance explained by token usage	about 80%

Source: Anthropic Engineering, How we built our multi-agent research system (2025). Figures are from that study and are not a general guarantee.

The cost caveat is as important as the headline. In the same study, token usage alone explained about 80% of the performance variance, with the number of tool calls and the model choice as the two other explanatory factors (the three together reaching roughly 95%), and the multi-agent baseline used roughly 15x the tokens of a normal chat.³ The lesson is blunt: multi-agent only pays off for high-value tasks that can absorb the token multiplier. The market is already learning it the hard way, Gartner predicts more than 40% of agentic AI projects will be canceled by the end of 2027 on escalating cost, unclear value, and weak risk controls, which is the same failure the simplest-thing-first rule exists to prevent.¹² Around all of these patterns sit three cross-cutting layers. Guardrails inspect inputs for injection, PII, and toxicity and outputs for policy and leakage, using products like NVIDIA NeMo Guardrails, OpenAI’s Moderation API, or Azure AI Content Safety.⁷ Human-in-the-loop review inserts approval at high-risk or irreversible steps. Caching and resilience keep cost and reliability in check, which the next section covers.

How to choose a pattern: a decision framework

Choose by what the task needs and escalate only when forced, which mirrors Anthropic’s simplest-thing-first guidance. Three questions settle most cases. One, does the model already know this? If not, reach for RAG or tools. Two, can you predict the steps? If yes, build a workflow; if no, build an agent. Three, does the added autonomy demonstrably beat the simpler version on your eval? If it does not, do not ship it.

The table below maps a need to the pattern that meets it. Read it top to bottom as the same escalation ladder: knowledge and action sit low, reliability loops in the middle, full agency at the top, and the cross-cutting layers apply throughout. Building this decision into a real product is the work our AI application development team does, where the pattern gets chosen against an eval before any prompt is written.

Match the need to the pattern
If you need	Reach for	Autonomy
Fresh or proprietary knowledge, with citations	RAG (agentic RAG if retrieval needs reasoning)	None
To take actions, hit APIs, run code	Tool use / function calling	Low
Consistent domain behavior, tone, or format	Fine-tuning, often with RAG	None
Different inputs down different paths	Routing; cascade when the goal is cost	Low
Reliability and quality on hard outputs	Evaluator-optimizer, voting, human-in-the-loop	Low
Open-ended, unpredictable step count	Single agent, then multi-agent only if one stalls	High
Speed on independent subtasks	Parallelization (sectioning)	Low
Lower cost and latency	Caching, routing to small models	n/a
Safety, compliance, abuse resistance	Guardrails layer plus human-in-the-loop	n/a
Production resilience	Fallback plus circuit breaker	n/a

The order matters more than any single row. Prompt and few-shot come first because they are cheapest, then RAG when the gap is knowledge, then tools when the model needs to act, then a workflow when the steps are predictable, then an agent when they are not. You earn each rung by proving on an eval that it beats the rung below.

Production concerns: evals, cost, latency, reliability

Four concerns decide whether a pattern survives contact with production: evaluation, cost, latency, and reliability. Build a task-specific eval set before you pick a pattern, because you cannot tell whether added complexity helps without one. Control cost and latency with caching and routing. Hold reliability with validation gates, retries, fallbacks, and a circuit breaker. None of these is optional once real traffic arrives.

Start with evals, because the whole decision framework rests on them. Anthropic recommends optimizing a simple prompt with comprehensive evaluation before adding any agentic steps; you cannot manage what you do not measure.¹ Observability is the runtime half: trace every model and tool call, and track tokens, latency, cache-hit rate, and cost per feature. Since token usage explains about 80% of multi-agent performance variance, token telemetry is the highest-signal metric you have, and the vendor-neutral standard for emitting it is the OpenTelemetry GenAI semantic conventions.⁸

Cost and latency move together. Prompt caching is the safest lever: OpenAI's prompt-caching guide reports automatic caching for prompts over 1,024 tokens with no code change, cutting input token cost up to about 90% and latency up to about 80%,⁴ and Anthropic reports prompt caching can cut cost up to about 90% and latency up to about 85% on long, reused prompts.⁹ The engineering tip that makes caching pay is to keep the shared prefix stable and put dynamic content, like retrieved chunks and chat history, last, so more of the prompt stays cache-eligible. Beyond caching, route trivial queries to small models, parallelize independent calls, stream responses, and set per-run token and cycle budgets.

Reliability is a layered defense against unreliable provider calls. Retries with exponential backoff handle transient failures. Fallbacks reroute to an alternate model or provider, then to a smaller or retrieval-only path, and finally to a human, when retries will not help. A circuit breaker stops calling a failing provider for a cooldown after repeated failures, then half-opens to test recovery, which is canonical software engineering applied to model gateways.¹⁰ Anthropic’s own multi-agent post flags the absence of per-run caps and circuit breakers as a production risk, since the 15x baseline can compound far higher without them.³

Anti-patterns to avoid

The recurring failures are predictable. Most of them are some version of adding autonomy the task did not ask for, or shipping without the eval that would have caught the regression. Naming them makes them easier to refuse in a design review.

Complexity for its own sake. More agents do not mean better results; every added agent adds latency, cost, and debugging difficulty. Build the workflow unless the task is genuinely open-ended.¹
Reaching for a framework before the primitives. Frameworks add layers of abstraction that obscure the underlying prompts and responses. Anthropic suggests starting with the API directly and reducing abstraction as you go to production.¹
No evals. Choosing a pattern by vibes means you cannot know whether the added complexity demonstrably improves outcomes.¹
No cost or loop guardrails. Unbounded agent loops and recursively spawning subagents are how a 15x baseline compounds into a runaway bill.³
Wrong tool for the knowledge axis. Fine-tuning for facts that change, or RAG for behavior that should be trained in, mismatches the pattern to the problem.
Weak tool and interface design. Vague tool schemas, missing docs, and formats unlike natural text make the model misuse the tools it is given.¹
Trusting an agent with irreversible actions. No guardrails, no sandbox, and no human-in-the-loop on high-stakes steps is how a single bad tool call becomes an incident.
Cache-hostile prompts. Putting dynamic content at the front of the prompt destroys the cache-hit rate and quietly inflates every bill.

Frequently asked

Agentic design patterns questions

What are agentic design patterns?

Agentic design patterns are reusable, composable ways to arrange model calls, retrieval, tools, and control logic around an LLM so it works as a reliable production system. Andrew Ng's widely cited framing names four: reflection, where the model critiques and revises its own output; tool use, where it calls functions to act; planning, where it decides its own steps; and multi-agent collaboration, where specialized agents divide the work. They sit on a wider spectrum of rising autonomy and cost, from a direct prompt through RAG to a multi-agent system, and the skill is picking the simplest pattern that clears your eval and moving up only when the task forces it.

What is the difference between an AI workflow and an AI agent?

A workflow orchestrates LLMs and tools along predefined code paths that you draw in advance. An agent lets the LLM dynamically direct its own steps and tool use, keeping control over how it accomplishes the task, with the path determined at runtime. This is Anthropic’s definition, and it is the load-bearing distinction in the whole topic: build a workflow when you can predict the steps, and an agent only when you cannot.

RAG or fine-tuning: which should I use?

Use RAG for knowledge that is fresh, proprietary, or changing and needs citations, since you update it by editing an index with no retraining. Use fine-tuning for behavior that is stable, like a consistent tone, format, or narrow skill baked into the weights. Many production systems use both: fine-tune for domain fluency and layer RAG for current, citable facts. The axis is knowledge versus behavior.

When should I use a multi-agent architecture?

Use multi-agent only for complex, open-ended tasks where you cannot predict the subtasks and the value of the result justifies the cost. Anthropic measured a multi-agent research system beating a single-agent baseline by 90.2 percent, but it used about 15 times the tokens of a normal chat, and token usage alone explained roughly 80 percent of the performance difference. Reach for a single agent first and add agents only if one stalls.

How do I reduce the cost and latency of an LLM application?

Start with prompt caching: OpenAI applies automatic caching for prompts over 1,024 tokens, cutting cached-input cost up to about 90 percent, and Anthropic reports up to about 90 percent cost reduction and 85 percent latency reduction on long reused prompts. Keep the shared prefix stable and put dynamic content last so more of the prompt is cache-eligible. Then route trivial queries to smaller models, parallelize independent calls, add semantic caching, and cap agent loops with per-run token budgets.

Kanika Mathur

Head of Service Delivery, Resourcifi

Kanika Mathur is Head of Service Delivery at Resourcifi, where her engineering pods choose the pattern before they write the prompt and defend that choice against a task-specific eval. She has watched teams reach for a multi-agent build when a chained prompt would have shipped in a week, and she writes these guides to make the simpler call easier to argue for.

Resourcifi on LinkedIn →