How to make an AI assistant (copilot): the eight engineering steps, start to ship
To make an AI assistant that is genuinely useful in a product, you build a copilot: a context-grounded, human-supervised AI that can retrieve from your data and act through your API, with the user reviewing every step. This guide covers the eight engineering steps to build one, from scoping a single use case through RAG, tool calling, guardrails, evals, and telemetry, plus the architecture underneath and the hard parts that decide whether it ships safe.

The short version
- A copilot sits between a chatbot and an agent. A chatbot runs a scripted, single-domain conversation; a copilot is grounded in the user's working context and can act through the app's API, with the human in control of each turn; an agent is goal-driven and, when designed for it, autonomous.
- Building one is eight steps: scope one in-context job, choose the model, add RAG, wire tool calling, design the UX surface, add guardrails, build evals, ship with telemetry. Each is a decision more than a line of code.
- The orchestrator is the engine: build the prompt, call the model, and if it returns a tool-use signal, execute the tool, feed the result back, and repeat until a final answer. Microsoft, OpenAI, and Anthropic document the same loop.
- Grounding is the trust mechanism. Answer only from retrieved context, attach the source, and measure it: Microsoft calls this groundedness, the RAGAS framework calls it faithfulness. Build the eval set before you ship and gate the release on it.
- A copilot is the safe first build because the human reviews each turn. Gate high-impact write actions behind confirmation, keep tool schemas strict, and graduate to an autonomous agent only once evals and guardrails hold.
What an AI copilot is, versus a chatbot and an agent
An AI copilot is an AI-powered assistant embedded in the flow of work that uses the user's own context and your product data to retrieve, draft, summarize, and take steps through the app's API, with the human in control of every turn. It sits between two things it is often confused with: a chatbot, which runs a mostly scripted conversation over a single domain, and an agent, which is goal-driven and can run multi-step and act on its own. The defining trait of a copilot is that it is grounded in your data and human-supervised, which sets it apart from a fixed script on one side and full autonomy on the other.
The cleanest way to hold the three apart is by who, or what, is driving. A chatbot captures a message, identifies intent with NLP, and replies through the same channel; Nielsen Norman Group notes that chatbots struggle when users deviate from expected linear flows and need an escape route to a human.5 Microsoft frames a copilot as enhancing workflows using enterprise data, which is the grounding that separates it from a script.1 An agent, in Microsoft's words, can range from simple prompt-and-response to fully autonomous, and OpenAI defines agents as systems that independently accomplish tasks on your behalf.12
The axis that matters is autonomy. Microsoft's declarative agents are user-initiated, while its custom-engine path can trigger actions automatically without direct user input.1 A copilot deliberately stays on the human-supervised end of that axis, which is exactly why it is the safe first build for most products. For the autonomous end of the spectrum, follow the AI agents guide; for the in-product, multi-tenant version of this build, see the AI copilot for SaaS guide.
How to make an AI assistant: the eight engineering steps
Building an AI assistant (copilot) is eight engineering decisions: scope one narrow, in-context job; choose the model; add RAG over your docs and user data; wire tool and function calling onto your product API; design the UX surface; add guardrails; build an eval set; and ship with telemetry. The order matters because each step constrains the next, and the early scoping decision is the one that most often decides whether the assistant is useful.
Read the table as the spine of the whole build. The first four steps make the copilot capable; the last four make it trustworthy and observable enough to release.
| # | Step | What you decide or do | Primary source |
|---|---|---|---|
| 1 | Scope the use case | Pick one narrow, high-frequency job the user already does in your app; start with a single agent and add complexity only when evals justify it. Conversation starters beat an open ask-me-anything. | OpenAI; NN/g |
| 2 | Choose the model | Build on the most capable model to set a quality baseline, then route simpler turns to smaller models only where evals show no quality loss. | OpenAI; Microsoft |
| 3 | RAG over your data | Ground answers in retrieved product docs and user data (chunk, embed, store, retrieve top results, inject) instead of model memory, so answers stay current and citable. | Microsoft |
| 4 | Tool / function calling | Expose the app API as tools, each with a name, description, and input schema; group them as read (data), write (action), and orchestration. This is what turns a Q&A box into a copilot that does things. | OpenAI; Anthropic |
| 5 | UX surface | Choose side panel, inline, or command bar before writing prompts; the surface dictates how much context you pass and how the user reviews or undoes output. | Microsoft; NN/g |
| 6 | Guardrails | Layer input and output classifiers, PII filters, strict schemas, tool-risk ratings, and human-in-the-loop confirmation on high-impact writes. | OpenAI; Anthropic |
| 7 | Evals | Build the dataset before shipping and gate the release on groundedness, relevance, and faithfulness, plus tool-call accuracy and task completion. | Microsoft; RAGAS |
| 8 | Telemetry | Add tracing on OpenTelemetry, then monitor token consumption, latency, error rates, and quality scores with alerts on quality and safety thresholds. | Microsoft |
Two of these steps carry most of the engineering weight. Step 4 is the line between a chat box and a copilot: OpenAI describes function calling as letting the model interface with external systems and decide when it needs to call a tool, and Anthropic's loop has the model return a tool-use stop signal, your code execute the tool, and the result feed back as a tool result.43 Step 6 is where safety lives: OpenAI prescribes guardrails at every stage, from input filtering and tool use to human-in-the-loop intervention, and Anthropic's strict mode keeps tool calls matching your schema exactly.23 Wiring those onto a product API is the core of what our AI copilot development team does.
The reference architecture underneath
A grounded, tool-using copilot decomposes into seven parts: a UX layer in the host app, an orchestrator that runs the tool-calling loop, a foundation model for reasoning, a retrieval layer for RAG, the tools bound to your API, a guardrail layer wrapping the loop, and observability spanning the whole path. The orchestrator is the engine: it builds the prompt, calls the model, and when the model asks for a tool, executes it, feeds the result back, and repeats until a final answer.
Microsoft gives the component vocabulary: the orchestrator is the central engine that manages how the system interacts with knowledge, skills, and autonomy, and the foundation model powers reasoning, language understanding, and response generation.1 The retrieval layer returns grounded context with source pointers so answers can carry citations; the tool layer holds read, write, and orchestration tools, each with a schema and a risk rating.42 The guardrail layer and observability are not bolted on at the end; they wrap every layer above.
The orchestration loop itself is the same across vendors. Microsoft, OpenAI, and Anthropic all describe a request-and-response cycle where the model can pause to call a tool, receive the tool's output, and continue reasoning until it produces a final answer.143 The key build decision is buy versus build the orchestration: Microsoft's declarative path reuses the platform's orchestrator and model and inherits its compliance posture, while a custom-engine build brings your own orchestrator and model, supports autonomy and external channels, and puts the compliance burden on you.1 Most product copilots embedded in a SaaS application sit on the custom-engine side.
UX surfaces and grounding
A copilot appears in one of three canonical placements: a side panel docked beside the workspace, inline suggestions where the user is already working, or a command bar invoked on demand. Whichever you pick, the cross-cutting rules hold: offer both clickable options and free text, restate what the copilot understood, let users review and undo before and after an action, give an escape route to a human, and ground every substantive answer in a retrieved source you can cite.
The side panel is what Microsoft 365 Copilot ships, with users selecting an assistant from a pane and conversation starters reducing the articulation barrier that NN/g describes, where users struggle to phrase a prompt that yields a useful result.67 Inline suggestions meet the user at the point of work and suit generation and edits; a command bar invoked with a shortcut suits on-demand actions and keeps the UI uncluttered. The inline and command-bar labels are common practice rather than a single vendor taxonomy, so treat them as engineering convention, while the side-panel-with-starters pattern is documented.
Grounding is the part that earns trust. The practical rule is to retrieve, answer only from the retrieved context, attach the source links, and then measure how well the answer is supported. Microsoft measures this as groundedness; the open RAGAS framework measures the same property as faithfulness.89 NN/g's cross-cutting guidance reinforces the rest: keep both input mechanisms present, be explicit about what the copilot can and cannot do, and embed it in the existing tool so the user is not forced to switch apps.57
The hard parts to get right
Four problems decide whether a copilot ships well: accuracy and hallucination, latency against quality, permissions and data security, and proving it delivers value. None is solved by a better prompt. Each has a concrete mitigation: ground and measure for accuracy, route by eval-verified model fit for latency, least-privilege plus tool-risk gating for security, and instrumented task-lift metrics for value.
On accuracy, ungrounded generation drifts into inaccurate or poorly grounded answers; the fix is RAG plus measured groundedness and relevance on Microsoft's side and faithfulness, answer relevancy, and context precision and recall in RAGAS, with releases gated on thresholds.89 On latency, the most capable model is the slowest and costliest, so route simpler turns to smaller models only where evals hold, and monitor latency and token consumption in production.28
Security is the part a copilot makes uniquely sharp, because it acts with the user's data and, through action tools, on their behalf. Enforce least-privilege, validate arguments with strict schemas, gate high-impact writes behind confirmation, and own the compliance posture, which Microsoft makes explicit for custom-engine builds.132 Finally, prove value with telemetry, not vanity: instrument usage, task completion, tool-call accuracy, and deflection, and run continuous and scheduled production evals to catch drift, because message volume is not the same as real task lift.85 One more design note: Anthropic observes that more-capable models are likelier to ask a clarifying question than to invent a missing value, so design for clarification on under-specified requests.3
How to build an AI copilot, answered
How do you make an AI assistant?
What is an AI copilot, and how is it different from a chatbot?
What are the steps to build an AI copilot?
Do I need RAG, or can the model just know my product?
How do you keep an AI copilot accurate and safe in production?
Should I build a copilot or an AI agent first?
Sources
- Microsoft Learn, Agents for Microsoft 365 Copilot, overview (updated 2026), including the declarative versus custom-engine distinction and the orchestrator, model, knowledge, and action components.
- OpenAI, A practical guide to building agents (2025), on model selection, tool types, and layered guardrails.
- Anthropic, Tool use overview (2026), on tool definitions, the tool-use to tool-result loop, strict tool use, and clarifying-question behavior.
- OpenAI, Function calling guide (2026), on the request-and-response loop and tool-definition best practices.
- Nielsen Norman Group, The User Experience of Chatbots (2018), on chatbots as limited-task conversational UIs, dual input, and escape-to-human.
- Microsoft Learn, Declarative agents for Microsoft 365 Copilot (updated 2026), on knowledge sources, plugins, and the side-panel pattern.
- Nielsen Norman Group, AI as a UX Assistant (2023), on the articulation barrier and integration into existing tools.
- Microsoft Learn, Observability in generative AI (updated 2026), on the eval lifecycle, evaluator categories, OpenTelemetry tracing, and monitoring.
- RAGAS, RAG evaluation framework documentation (2024 to 2026), on faithfulness, answer relevancy, and context precision and recall.
Building AI
AI Copilots for SaaS: Build vs Buy Guide
AI copilot vs AI agent for SaaS: a copilot assists, an agent acts. How an in-app copilot works, the RAG and multi-tenant...
Read guide →
Building AI
How to Add AI to Your SaaS Product: A Production-First Playbook
Learn how to build an AI SaaS product: the build-order playbook (prompt, RAG, fine-tune, agents), multi-tenant isolation...
Read guide →
Building AI
How to Build a Domain-Specific LLM
How to build a domain-specific LLM: RAG for facts, LoRA fine-tuning for behavior. Practical guide with compute costs from...
Read guide →
Building AI
How to Build a RAG System
Learn how to implement RAG with a seven-stage pipeline guide covering chunking, embeddings, retrieval, and evaluation. Bu...
Read guide →
Building AI
How to Build an AI SaaS Product
How to build a SaaS product with AI: the 5-phase build path, stack, margin reality, and pricing models. Trusted by 200+ e...
Read guide →
Building AI
How to Train a Custom Model
How to train an AI model: when to train vs. use an API, the 7-stage workflow, classical ML vs LLM fine-tuning, and the pi...
Read guide →
Agents & RAG
Agentic RAG: When to Use It and How to Build It
Agentic RAG explained: how it differs from naive and advanced RAG, the key patterns like corrective RAG and self-RAG, the...
Read guide →
Agents & RAG
AI Agent for Fintech: Risk, Compliance, Ops, Customer
AI agents in finance: fraud, AML, KYC and servicing use cases, how to build with money-movement guardrails and human appr...
Read guide →
Agents & RAG
AI Agent for Healthcare: Use Cases, Governance & Implementation
AI agents in healthcare: the use cases that pay off first, how to build one HIPAA-safe on FHIR with clinician review, and...
Read guide →
