How to deploy AI models: the nine-gate checklist before you ship
How to deploy AI models to production is less a modeling problem than an operational one: a working demo is not a deployment, and the gap between a proof of concept that impresses in a meeting and a system that survives real users, real data, and real adversaries is where most AI projects die. This is the pre-launch gate list that closes that gap, with each of the nine gates anchored to a named standard you can audit against.

The short version
- A working demo is not a deployment. RAND (2024) found more than 80% of AI projects fail, roughly twice the non-AI IT rate, and the root causes are operational and governance gaps, not weak models.
- The checklist below is nine pre-launch gates, ordered launch-blocking first: evals, guardrails, an OWASP LLM Top 10 security review, monitoring and drift, canary and rollback, latency and cost budgets, human-in-the-loop, compliance, and hallucination observability.
- Each gate is anchored to a named standard you can audit against: OWASP LLM Top 10 2025, NIST AI RMF (AI 600-1), and first-party platform docs from Anthropic, OpenAI, Google Cloud, Google SRE, and AWS.
- Two of the named industry failure modes, "inadequate risk controls" (Gartner) and "inadequate infrastructure to deploy and manage" (RAND), are exactly what a pre-launch checklist exists to catch.
- A checklist is necessary, never sufficient. It gets you safely to production; it does not make the use case worth shipping, and guardrails reduce risk without eliminating it.
Why AI deployment fails before it reaches production
AI deployment fails far more often at the operational gate than at the modeling step. RAND found more than 80% of AI projects fail, about twice the rate of non-AI IT projects, and traced the cause to mis-framed problems, poor data, and inadequate infrastructure to deploy and manage, not to models that were not smart enough.1 The honest frame for this page: the checklist below is the thing that closes the proof-of-concept-to-production gap, because each gate is one of those failure modes wearing a different hat.
The abandonment data points the same way. Gartner predicted at least 30% of generative AI projects would be abandoned after proof of concept by the end of 2025, naming poor data quality, inadequate risk controls, escalating costs, and unclear business value.2 S&P Global Market Intelligence then found the share of companies abandoning most of their AI initiatives rose from 17% in 2024 to 42% in 2025, with the average organization scrapping roughly 46% of its proofs of concept before they reach production.3 Two of these named root causes, inadequate risk controls and inadequate infrastructure to deploy and manage, are precisely what a pre-launch checklist exists to catch. The journey that ends at this gate is covered in our PoC to production guide.
One caution before the numbers get repeated elsewhere: present these as adoption and abandonment facts, attributed and dated. The often-quoted "87% of AI projects fail" line began as a 2019 media figure about data-science pilots and is not a current peer study, so this page leads with RAND, Gartner, and S&P instead.
| Metric | Value |
|---|---|
| Abandoning most AI initiatives (2024) | 17% |
| Abandoning most AI initiatives (2025) | 42% |
| PoCs scrapped before production (avg) | ~46% |
How to deploy AI models: the nine-gate pre-launch checklist
To deploy AI models to production safely, clear nine gates, ordered launch-blocking first: evals with a written acceptance threshold, input and output guardrails, an OWASP LLM Top 10 security review, monitoring and drift detection, a canary and rollback plan, latency and cost budgets, human-in-the-loop for high-stakes actions, a compliance and governance mapping, and observability for hallucination. Run down the table before any go-no-go call; each row names what "done" looks like and the standard it anchors to.
| Gate | What to verify | Source / standard |
|---|---|---|
| 1. Evals and thresholds | A held-out eval set with a written pass bar the build must clear, re-run on every model or prompt change | OpenAI Evals; Anthropic test-and-evaluate |
| 2. Guardrails | Input and output filters live: PII detect or redact, toxic-content block, output-format validation before anything hits a downstream system | AWS Bedrock Guardrails; OWASP LLM02, LLM05 (2025) |
| 3. Security review | Walked the 2025 list; prompt injection, sensitive-info disclosure, improper output handling, and excessive agency explicitly mitigated | OWASP Top 10 for LLM Applications 2025 |
| 4. Monitoring and drift | Online monitoring of inputs, outputs, and quality with alerts; training-serving skew and inference drift watched against a baseline | Google Cloud Vertex AI Model Monitoring |
| 5. Canary and rollback | Progressive rollout gated on SLOs, with automated rollback wired in before the first user sees the change | Google SRE, Canarying Releases |
| 6. Latency and cost budgets | A per-request latency SLO and a per-request cost ceiling, defined and load-tested before launch | Google SRE (SLOs); OWASP LLM10 (2025) |
| 7. Human-in-the-loop | High-impact or irreversible actions require human confirmation; agent autonomy and tool access scoped to least privilege | OWASP LLM06; NIST AI 600-1 |
| 8. Compliance and governance | Mapped to a framework; EU AI Act obligations checked against the phased timeline if EU users are in scope | NIST AI RMF (AI 600-1); EU AI Act timeline |
| 9. Hallucination observability | A path to detect, log, and review wrong-but-confident outputs; "I do not know" is allowed; critical outputs validated | Anthropic, Reduce hallucinations; NIST AI 600-1 |
The rest of this guide expands each gate into what "done" actually requires, with the source link beside it. If you would rather run these gates with a team that has shipped them before, that is the work our AI deployment practice does.
Gates 1 and 2: evals, acceptance thresholds, and guardrails
Evals test outputs against criteria you specify, with an explicit pass bar that gates the launch. Guardrails are the input and output filters, PII detection, toxic-content blocking, and output-format validation, that run before anything reaches a user or a downstream system. Together they turn a "vibes-based" go-no-go into a measurable one.
Writing evals to understand how an application performs against your expectations, especially when upgrading or trying new models, is an essential part of building reliable applications. Define the objective, a dataset of test cases, the metrics, and the explicit success criteria, then gate the launch on clearing them. Use ground-truth checks where answers are verifiable and model-graded scoring (LLM-as-judge) for subjective quality.4 The deeper treatment lives in our AI evaluation and evals guide.
Guardrails run on both the prompt and the output. AWS Bedrock Guardrails is a useful first-party reference: content filters cover predefined harmful categories with configurable strength, sensitive-information filters detect PII and either block or mask it (for example replacing a value with {NAME} or {EMAIL}) on input and output, and denied-topic rules block subjects you name.5 Output handling is its own OWASP risk, LLM05 Improper Output Handling: validate and sanitize a generation before it reaches a browser, shell, SQL query, or another system, because unchecked output is an injection vector.6
Gate 3: a security review against the OWASP LLM Top 10 (2025)
The authoritative pre-launch security gate is the OWASP Top 10 for LLM Applications 2025. Walk the list and confirm each named risk is mitigated before launch: prompt injection, sensitive-information disclosure, supply chain, data and model poisoning, improper output handling, excessive agency, system-prompt leakage, vector and embedding weaknesses, misinformation, and unbounded consumption.
The 2025 list (second edition, released November 2024) is the current, standards-body view of how LLM applications get attacked: LLM01 Prompt Injection, LLM02 Sensitive Information Disclosure, LLM03 Supply Chain, LLM04 Data and Model Poisoning, LLM05 Improper Output Handling, LLM06 Excessive Agency, LLM07 System Prompt Leakage, LLM08 Vector and Embedding Weaknesses, LLM09 Misinformation, and LLM10 Unbounded Consumption.6 Treat it as a checklist within the checklist: for each entry, write down the concrete mitigation and who verified it. The full depth, with mitigations per category, is in our AI security best practices guide.
Gates 4, 5, and 6: monitoring, rollout, and budgets
After launch, watch two things against a baseline you capture at go-live: training-serving skew (production inputs differ from training data) and inference drift (input distributions move over time), plus output quality. Roll out progressively behind a canary with automated rollback, and load-test against a latency SLO and a per-request cost ceiling before any of it reaches full traffic.
MLOps advocates automation and monitoring at every step. Google Cloud Vertex AI Model Monitoring watches models for training-serving skew and inference drift and alerts when incoming inference data skews too far from the training baseline, routing notifications through email, Cloud Logging, or Cloud Monitoring.7 The catch is that drift detection assumes a baseline: if you never captured a clean launch baseline, monitoring can only show change relative to launch, so capture the baseline as part of the launch itself.
Canarying is a partial and time-limited deployment of a change and its evaluation: ship to a small slice of production, send it a small share of traffic, compare against the control, and widen only if it stays healthy. Google SRE's lesson is that manual graph-watching is not reliable enough, so automate the canary analysis and the rollback, and "rollback early, rollback often."8 A staged pattern such as 5%, 20%, 50%, then 100% gated on SLOs is common practice rather than a single canonical figure. For budgets, define a latency SLO and a per-request cost ceiling and load-test against them, because AI systems fail softly by getting slow and expensive before they error; OWASP names this failure mode directly as LLM10 Unbounded Consumption.6 The cost-budget mechanics, setting max_tokens, rate-limiting per user, and metering spend, are in our AI cost optimization guide.
Gates 7, 8, and 9: human-in-the-loop, compliance, and hallucination
The more autonomy and tool access you grant, the larger the blast radius, so high-stakes or irreversible actions need a human confirmation step and least-privilege tooling. Map the deployment to a recognized framework before go-live, check the EU AI Act phased timeline if EU users are in scope, and stand up a path to catch wrong-but-confident output, which is the LLM-specific production failure.
OWASP LLM06 Excessive Agency is the risk of an LLM-based system granted enough autonomy and tool access to take damaging actions; the mitigation is least-privilege tools plus human approval on impactful actions.6 NIST AI 600-1 treats Human-AI Configuration as a named risk area, the points of human interaction that can introduce harm, with suggested management actions.9 Calibrate by blast radius: low-stakes, reversible actions can run autonomously, while irreversible ones wait for a person.
For governance, map the deployment to a recognized framework before launch. The NIST AI RMF Generative AI Profile (NIST AI 600-1, July 2024) identifies twelve generative-AI risk areas, including Data Privacy, Information Security, Confabulation, and Human-AI Configuration, and provides more than 200 suggested actions organized by RMF function.9 If EU users are in scope, check the EU AI Act's phased timeline: general-purpose-model obligations applied from 2 August 2025, and the bulk of high-risk obligations apply from 2 August 2026.10 Frame these as published, phased dates and check the current phase for your tier and region; this is not legal advice, and the high-risk regime phases in from August 2026 rather than being in force across 2026.
The ninth gate is observability for hallucination, which NIST names Confabulation. First-party mitigations from Anthropic: explicitly allow the model to say "I do not know," require word-for-word quote extraction before answering over long documents, restrict to provided context, and use best-of-N consistency checks. Anthropic's own caveat belongs on this page: these techniques reduce but do not eliminate hallucinations, and critical information should always be validated.11 Stand up a path to detect, log, and review confidently-wrong outputs before they reach users unchecked, and link it back to the eval suite so each real failure refreshes the test set.
The limits of a deployment checklist
A checklist is necessary, never sufficient. Passing every gate reduces deployment risk; it does not guarantee business value, and it does not make a green dashboard equal zero risk. The honest framing is that the checklist gets you safely to production, while whether the use case is worth shipping is a separate question.
- MIT NANDA found that 95% of enterprises see no measurable profit-and-loss return from generative AI pilots, largely because systems do not retain feedback, adapt to context, or improve over time, a product and learning problem a launch checklist does not fix.12
- Guardrails and hallucination controls reduce risk without eliminating it. Anthropic states plainly that its techniques do not eliminate hallucinations and that critical outputs must still be validated.11
- Evals can be gamed or go stale. A pass bar is only as good as the eval set; an unrepresentative or aging set ships regressions with a green light, so treat the suite as a living asset refreshed from real production failures.
- Acceptance thresholds are a quality, cost, and latency trade. A stricter bar (more human-in-the-loop, more validation) costs latency and money; the right threshold is per-use-case, set by stakes rather than a universal number.
- Compliance is jurisdiction- and date-specific. The EU AI Act phases in, and obligations differ by risk tier and by whether a system predates key dates, so check the current phase for your tier and region.10
None of this is an argument against the checklist. It is an argument for reading a fully green checklist as "safe to ship," which is the question it answers, instead of "worth shipping" or "risk-free," which it does not. The broader discipline this page belongs to is laid out in our production-first AI cornerstone.
AI deployment checklist questions
What is an AI deployment checklist?
Why do so many AI projects fail to reach production?
What security risks should I check before deploying an LLM?
How do you monitor an AI model after deployment?
Do I need human-in-the-loop for AI in production?
Sources
- RAND, The Root Causes of Failure for AI Projects and How They Can Succeed (2024). More than 80% of AI projects fail, about twice the non-AI IT rate.
- Gartner, At Least 30% of Generative AI Projects Will Be Abandoned After Proof of Concept by End of 2025 (2024).
- S&P Global Market Intelligence, Generative AI shows rapid growth but yields mixed results (Oct 2025). Abandonment of most AI initiatives rose 17% to 42%; n=1,006.
- OpenAI, Working with evals and Evaluation best practices (2026).
- AWS, Detect and filter harmful content with Amazon Bedrock Guardrails and Remove PII with sensitive information filters (2026).
- OWASP Gen AI Security Project, OWASP Top 10 for LLM Applications 2025; official PDF (v2025).
- Google Cloud, Monitor feature skew and drift (Vertex AI Model Monitoring) and MLOps: Continuous delivery and automation pipelines in ML (2026).
- Google SRE, Canarying Releases (SRE Workbook) and Reliable Product Launches; Google Cloud CRE Life Lessons, How release canaries can save your bacon (source of "rollback early, rollback often").
- NIST, AI RMF: Generative AI Profile (NIST AI 600-1) (2024); AI Risk Management Framework overview.
- EU AI Act, Implementation Timeline (2026). GPAI obligations from 2 Aug 2025; bulk of high-risk obligations from 2 Aug 2026.
- Anthropic, Reduce hallucinations (2026). Techniques reduce but do not eliminate hallucinations; critical outputs must be validated.
- MIT NANDA, The GenAI Divide: State of AI in Business 2025. 95% of enterprises see no measurable P&L return from GenAI pilots.
Strategy, architecture & ops
AI Architecture Patterns
Agentic design patterns explained: reflection, tool use, planning, and multi-agent collaboration, with a framework to pic...
Read guide →
Strategy, architecture & ops
AI Architecture Patterns for SaaS: A Technical Guide
Generative AI architecture for SaaS: layered design, multi-tenant isolation, LLM gateway, RAG, and security. Built by Res...
Read guide →
Strategy, architecture & ops
AI Cost Optimization
A senior-engineer guide to AI cost optimization: where LLM spend comes from, the levers ranked by payoff, the five number...
Read guide →
Strategy, architecture & ops
AI Evaluation and Evals
LLM evaluation and AI evals, explained: the eval taxonomy, how to build an eval suite, LLM-as-a-judge bias, offline vs pr...
Read guide →
Strategy, architecture & ops
AI Features SaaS Customers Actually Want
What AI powered SaaS customers actually want: the time-savers and answers they value, the automation they distrust, and h...
Read guide →
Strategy, architecture & ops
AI Security Best Practices
Generative AI security best practices: the OWASP Top 10 for LLMs, NIST AI RMF, MITRE ATLAS, lifecycle controls, agentic-A...
Read guide →
Agents & RAG
Agentic RAG: When to Use It and How to Build It
Agentic RAG explained: how it differs from naive and advanced RAG, the key patterns like corrective RAG and self-RAG, the...
Read guide →
Agents & RAG
AI Agent for Fintech: Risk, Compliance, Ops, Customer
AI agents in finance: fraud, AML, KYC and servicing use cases, how to build with money-movement guardrails and human appr...
Read guide →
Agents & RAG
AI Agent for Healthcare: Use Cases, Governance & Implementation
AI agents in healthcare: the use cases that pay off first, how to build one HIPAA-safe on FHIR with clinician review, and...
Read guide →
