Multi-agent systems: topologies, tradeoffs, and when they actually help
Multi-agent systems are the loudest phrase in AI engineering right now, and most of the time a single agent or a plain workflow is the better call. This guide defines what a multi-agent system is, lays out the topologies you can build, and is honest about the cost, latency, and failure modes that decide whether splitting the work pays off.

The short version
- A multi-agent system is an architecture where several autonomous agents, each with its own role, instructions, and tools, coordinate to complete a goal that is hard for one agent alone. In LLM systems each agent is usually a language model given a scoped objective and the ability to use tools and talk to other agents.
- The common topologies are orchestrator-worker, sequential, hierarchical, parallel, debate, and network. Orchestrator-worker, where a lead agent delegates to parallel workers, is the most common production pattern.
- The honest headline is cost. Anthropic reports that agents use about 4x the tokens of a chat, and multi-agent systems about 15x, and that token usage alone explained around 80% of performance variance in one of their evals.
- Multi-agent does not always win. It underperforms on tightly interdependent tasks (Anthropic names coding), adds latency through coordination, and compounds errors. A UC Berkeley study found many failures are design failures, not model failures.
- The decision rule from Anthropic: find the simplest solution that works and add agents only when they demonstrably improve outcomes. Reach for multi-agent when the task is high-value, parallelizable, and exceeds one context window. Otherwise a single agent or a deterministic workflow is cheaper and easier to operate.
What a multi-agent system is, versus a single agent
A multi-agent system is an architecture in which multiple autonomous agents, each with its own role, instructions, and tools, coordinate to accomplish a goal that would be hard for any single agent to complete alone. In the LLM era each agent is typically a language model given a scoped objective, a toolset, and the ability to communicate with other agents or a coordinator. Multi-agent systems are a specialization of AI agents, so the agent fundamentals carry over before any coordination is added.
The term predates LLMs. In classical distributed AI, Michael Wooldridge defines an agent as a computer system situated in an environment that is capable of flexible autonomous action to meet its design objectives, and a multi-agent system as a collection of such interacting agents that coordinate toward goals.5 The LLM version keeps that shape and swaps in language models as the reasoning core.
The contrast that matters for a buyer is simple. A single agent is one LLM in a loop: it reasons, calls tools, observes the result, and repeats until the task is done, with all state in one context window. A multi-agent system distributes the work across several agents. Anthropic frames the canonical structure as orchestrator-worker, where a lead agent coordinates the process while delegating to specialized subagents that operate in parallel.1 Single-agent keeps everything in one reasoning thread, which is simpler, cheaper, and easier to debug. Multi-agent buys parallelism and separation of concerns at the cost of more tokens, more coordination, and more failure surface.
One nuance avoids a common overclaim. Anthropic separates workflows, where LLMs and tools are orchestrated through predefined code paths, from agents, where the model dynamically directs its own process and tool use.2 Many systems sold as multi-agent are really multi-step workflows, which is often the better and more predictable choice. Multiple LLM calls do not by themselves make a system agentic. Teams that want the deterministic version of this often start with AI workflow automation and add autonomous agents only when the workflow ceiling is reached.
Architectures and topologies
The common multi-agent topologies are orchestrator-worker (a supervisor delegates to parallel workers), sequential or pipeline (agents run in a fixed order), hierarchical (supervisors of supervisors), parallel (independent agents by section or vote), debate or critic (agents critique each other to improve reasoning), and network (decentralized many-to-many handoffs). A useful way to group them: orchestrator-worker, sequential, and hierarchical are the controlled family where you keep determinism, while debate and network are the emergent family where the model keeps control.
Each topology trades coordination, parallelism, and cost differently. The table below is the practical comparison, drawn from Anthropic's pattern writeups, the LangGraph multi-agent concepts, and the multi-agent debate research.1273
| Topology | Coordination | Parallelism | Cost and latency | Best for | Watch-out |
|---|---|---|---|---|---|
| Orchestrator-worker | Central lead delegates | High | High | Breadth-first research, routing to specialists | Lead-agent prompt quality decides everything |
| Sequential / pipeline | Fixed order, each consumes the last output | None | Low | Decomposable, predictable processes | Errors propagate forward |
| Hierarchical | Supervisors of supervisors | High | Highest | Large task trees, organization-shaped problems | Latency compounds at every layer |
| Parallel (section or vote) | Independent, then aggregate | Highest | High | Independent subtasks, self-consistency | Only helps if subtasks are truly independent |
| Debate / critic | Agents critique across rounds | Medium | High | Hard reasoning, factuality-sensitive answers | Large token cost for modest gains |
| Network / decentralized | Many-to-many handoffs, no central lead | Variable | Hard to bound | Open-ended negotiation, unknown path | Loops and miscoordination, hardest to debug |
Two of these deserve a note. Orchestrator-worker is what Anthropic's research feature runs: a lead agent plans, spins up three to five subagents in parallel, and a separate pass handles citations and synthesis.1 Debate has the clearest research backing for quality gains: Du et al. show that having multiple agents propose and critique answers across a few rounds improves factual accuracy and arithmetic reasoning over a single-agent baseline, and that adding agents or rounds helps further.3 The caveat is that debate spends a lot of tokens for those gains, so it is justified mainly on genuinely hard reasoning.
How agents coordinate
Agents coordinate through four primitives: roles (each agent gets a scoped persona, objective, and tools), handoffs (one agent transfers control of the task to another), agents-as-tools (a manager calls a specialist like a function and keeps control), and shared state or message passing (agents read and write a common state object, or talk in messages). The teachable distinction is handoffs versus agents-as-tools: a handoff passes the mic, agents-as-tools keeps it.
The OpenAI Agents SDK makes that last distinction concrete. A handoff is exposed to the model as a tool call such as transfer_to_refund_agent, after which the receiving agent owns the next part of the interaction; use it when routing itself is part of the workflow. Agents-as-tools lets a manager agent call a specialist that returns a result without taking over the user-facing conversation, which fits a bounded subtask.6 For shared state, LangGraph models the whole system as a graph of nodes that pass a common state object between them, with a supervisor maintaining global state and dispatching subtasks.7 For message passing, AutoGen's building block is a conversable agent that initiates and replies to messages, though centralizing that through a group chat manager can become a coordination bottleneck.9
The single most actionable coordination lesson is about delegation quality. Anthropic found the orchestrator must give each subagent an objective, an output format, guidance on the tools and sources to use, and clear task boundaries, otherwise subagents duplicate work or leave gaps.1 Vague delegation is the fastest way to turn a multi-agent system into an expensive way to get a worse answer. Building these coordination layers in production is exactly the work our AI agent development team does.
When multi-agent helps, and when it does not
Multi-agent systems help on valuable tasks that involve heavy parallelization, information that exceeds a single context window, and interfacing with many complex tools. They do not help on tightly interdependent work, low-value tasks where the roughly 15x token cost is not justified, or anything latency-sensitive. The decision rule from Anthropic is to find the simplest solution that works and add agents only when they demonstrably improve outcomes.
Start with the honest cost figure, because it reframes the whole decision. Anthropic states plainly that agents typically use about 4x more tokens than chat interactions, and multi-agent systems use about 15x more tokens than chats.1 In one browsing-agent eval they found that token usage by itself explained about 80% of the performance variance, meaning a large share of multi-agent's advantage is simply spending more compute on the problem. On a fixed budget, that changes the math.
| Dimension | Single agent | Multi-agent system |
|---|---|---|
| Token cost | Baseline, about 4x a chat for an agent | About 15x a chat |
| Latency | Lower | Higher, from coordination overhead |
| Context | One window | Distributed across agents |
| Best tasks | Linear, interdependent, low to mid value | Parallelizable, breadth-first, high value |
| Debuggability | Easier, one trace | Harder, distributed traces |
| Error surface | Contained | Compounds across agents |
| Coding suitability | Better today | Underperforms, tasks are interdependent |
| When to choose | The default | Only when it demonstrably improves outcomes |
The cases where multi-agent wins are specific: breadth-first work that requires exploring many independent paths at once, context that exceeds one window so each agent gets its own budget, and tasks that touch many heterogeneous tools where specialization keeps each agent's tool surface manageable. On its own internal research evaluation, Anthropic reports a multi-agent system outperformed a single-agent Claude Opus 4 baseline by 90.2%, though that is an internal eval and not a universal benchmark.1
The cases where it loses are just as specific. Tightly coupled tasks that share context or have many dependencies between agents underperform, and Anthropic names coding directly because subtasks are highly interdependent and current systems cannot coordinate the real-time changes well enough yet.1 Agentic systems trade latency and cost for performance, and every added agent adds a place for errors to start and propagate.2 The practical rule is to prefer the simplest thing that works: reach for multi-agent when the task is high-value, parallelizable, and exceeds one context window, and reach for a single agent or a deterministic workflow for everything else.
Frameworks and challenges
The widely used 2026 frameworks are LangGraph (a graph with shared state for maximum control), CrewAI (role-based crews plus deterministic flows), AutoGen (conversational message passing), and the OpenAI Agents SDK (lightweight handoffs and agents-as-tools). They are convenience layers over the same underlying topologies, so choose by how much control you need and how your team models the problem, not by the framework name. The biggest challenge is not the model but the system design around it.
Each framework has a native mental model. LangGraph gives you explicit control of control-flow and context as a graph of nodes with shared state, suited to production systems that need determinism.7 CrewAI is built on roles: you define each agent's role, goal, and backstory, assemble them into a crew, and use flows when you want a predictable pipeline.8 AutoGen centers on agents that converse and negotiate toward a result.9 The OpenAI Agents SDK offers the lightest primitives, with the clean handoff versus agents-as-tools split and built-in tracing.6 Because all of them implement the same underlying topologies, the framework is a convenience layer; the architecture is what actually determines outcomes.
The challenges are well documented, and most are design problems. A UC Berkeley-led study of more than 200 tasks across seven popular frameworks built the first Multi-Agent System Failure Taxonomy, with 14 failure modes in three categories: system and specification design, inter-agent misalignment, and task verification or termination. Its headline finding is that many failures stem from poor system design rather than model performance, with agents operating on incorrect assumptions, ignoring peer input, or failing to verify outputs.4 Beyond that taxonomy, the recurring practical challenges are the 15x token cost, latency and coordination overhead, error propagation and loops in decentralized topologies, and the observability burden of following distributed traces, which is why frameworks ship tracing and explicit graph state in the first place.
Multi-agent systems questions
What is a multi-agent system in AI?
What is the difference between single-agent and multi-agent systems?
What are the main multi-agent architectures or topologies?
When should you not use a multi-agent system?
What frameworks are used to build multi-agent systems?
Sources
- Anthropic, How We Built Our Multi-Agent Research System (2025).
- Anthropic, Building Effective Agents (2024).
- Du, Li, Torralba, Tenenbaum, and Mordatch, Improving Factuality and Reasoning in Language Models through Multiagent Debate (2023).
- Cemri, Pan, Yang et al., Why Do Multi-Agent LLM Systems Fail? (2025).
- Michael Wooldridge, An Introduction to MultiAgent Systems, 2nd ed. (2009).
- OpenAI, Agents SDK, Orchestrating Multiple Agents (2025).
- LangChain, LangGraph Multi-Agent Concepts (2025).
- CrewAI, CrewAI Documentation (2025).
- Microsoft, AutoGen Documentation (2025).
Agents & RAG
Agentic RAG: When to Use It and How to Build It
Agentic RAG explained: how it differs from naive and advanced RAG, the key patterns like corrective RAG and self-RAG, the...
Read guide →
Agents & RAG
AI Agent for Fintech: Risk, Compliance, Ops, Customer
AI agents in finance: fraud, AML, KYC and servicing use cases, how to build with money-movement guardrails and human appr...
Read guide →
Agents & RAG
AI Agent for Healthcare: Use Cases, Governance & Implementation
AI agents in healthcare: the use cases that pay off first, how to build one HIPAA-safe on FHIR with clinician review, and...
Read guide →
Agents & RAG
AI Agent for HR: Recruiting, Onboarding, People Ops
AI agents for HR: screening, employee Q and A and onboarding use cases, how to build them, and the bias, EEOC and Local L...
Read guide →
Agents & RAG
AI Agent for Legal: Intake, Discovery, Contracts, Research
AI for legal research: real use cases, how accurate the tools are, the documented sanctions risk, and why attorney verifi...
Read guide →
Agents & RAG
AI Agent for SaaS: How to Embed Autonomous Agents in Your Product
AI agents' disruptive impact on the SaaS industry in 2025: Gartner sees agentic AI at 30% of app-software revenue by 2035...
Read guide →
Strategy, architecture & ops
AI Architecture Patterns
Agentic design patterns explained: reflection, tool use, planning, and multi-agent collaboration, with a framework to pic...
Read guide →
Strategy, architecture & ops
AI Architecture Patterns for SaaS: A Technical Guide
Generative AI architecture for SaaS: layered design, multi-tenant isolation, LLM gateway, RAG, and security. Built by Res...
Read guide →
Building AI
AI Copilots for SaaS: Build vs Buy Guide
AI copilot vs AI agent for SaaS: a copilot assists, an agent acts. How an in-app copilot works, the RAG and multi-tenant...
Read guide →
