AI cost optimization: the levers that actually move the bill
An LLM bill is a token meter with a handful of dials behind it. AI cost optimization is the discipline of finding which dials your workload is actually paying for, then turning them down without dropping below your quality bar. This guide walks the levers in order of effort-to-payoff, the five numbers to instrument, and where to stop.

The short version
- An LLM bill reduces to one formula: cost-per-request = (input tokens x input rate) + (output tokens x output rate), summed over every call a single user action triggers, including retries and agent sub-calls. Most teams under-count the sub-calls.
- Output is the expensive half. On Claude Sonnet 4.6 output runs 5x the input rate ($15 vs $3 per million tokens), so capping and compressing generations is one of the highest-payoff moves.
- Prompt caching is the cheapest win: Anthropic bills cache reads at 0.1x base input, a 90% discount on the reused portion, and it stacks with the 50% Batch discount.
- Routing simple queries to a small model first is the biggest structural lever. A tuned cascade matched GPT-4 quality at up to 98% lower cost in the FrugalGPT study; production routing commonly lands 50% to 70%.
- Per-token cost is falling roughly 10x a year (a16z), yet total spend keeps rising because usage grows faster. Optimize to a quality floor, then minimize cost beneath it, and do not architect around today’s prices.
Where AI cost optimization starts: the five drivers of an LLM bill
AI cost optimization starts with one fact: an LLM bill is a token meter. Five drivers move it. Input tokens are everything you send (system prompt, tool schemas, retrieved context, history, the user message). Output tokens are what the model generates, and they are the expensive half. Model tier swings the rate by an order of magnitude for the same vendor. Request volume scales the whole thing linearly. And context size plus hidden multipliers (retries, agent turns, reasoning tokens, tool-use overhead) quietly inflate the first two on every call.
The asymmetry between input and output is the single most useful thing to internalize. Output is generated one token at a time, so providers price it well above input. On Anthropic, Claude Sonnet 4.6 is $3 per million input tokens against $15 output, a 5:1 ratio, and Opus 4.8 is $5 in against $25 out.1 Secondary syntheses put the gap at 4x to 8x across providers.2 Model tier is the other big swing: on the same vendor, Haiku 4.5 ($1 in / $5 out) sits roughly 5x below the Opus frontier tier on both halves.1
The drivers that catch teams out are the hidden multipliers. Failed calls, re-prompts and multi-turn agent loops each multiply token spend, and reasoning ("thinking") tokens bill as output. Tool use adds its own tax: the tools parameter plus the auto-injected tool system prompt add roughly 290 to 810 input tokens to every call on Claude, with individual tools costing more on top, and server-side web search billing at $10 per 1,000 searches.1 The practical formula to carry into every design review is cost-per-request = (input tokens x input rate) + (output tokens x output rate), summed over every call one user action triggers. The sub-calls are where the surprise lives.
The optimization levers, ranked by payoff
The levers, roughly ordered by effort-to-payoff: prompt caching for repeated context, model routing to send each query to the cheapest model that can handle it, RAG to shrink context instead of stuffing whole documents, output-length control, batching for latency-tolerant work, semantic caching for near-duplicate queries, and distillation or fine-tuning for high-volume steady-state traffic. Each one moves a specific number on the bill, so pick by which driver dominates your workload.
Start with prompt caching because it is the cheapest reversible win. Reusing an already-processed prefix (a stable system prompt, a big document, RAG context) avoids reprocessing it every call. Anthropic bills cache reads at 0.1x base input, a 90% discount on the cached portion, breaking even after a single read on the 5-minute cache, and it stacks with the Batch discount so a cached batch request can land near 5% of standard cost.1 OpenAI's caching is automatic above 1,024 tokens and launched at a 50% cached-input discount.3 Real-workload reports range widely: one team cut total LLM cost 59% with caching, and a practitioner went from $720 to $72 a month.4 Treat those as anecdotes that bracket the upside, never as benchmarks.
The biggest structural lever is model routing. Send each query to the cheapest model that can clear the bar and escalate only on low confidence. The peer-reviewed FrugalGPT cascade matched GPT-4 quality at up to 98% lower cost, or beat its accuracy by 4 points at equal cost.5 That 98% is a research ceiling on a tuned threshold; production routing more commonly lands 50% to 70%, because an estimated 60% to 80% of queries are simple enough for a small model.6 Cutting context with RAG is the next lever: a clinical-NLP study fed GPT-4o full text at about 172M tokens ($430) against chunk-based retrieval at about 13.2M tokens (roughly $33), a 90%-plus cut at comparable quality.7 Leaner context can also raise quality, since models weaken past about half their context window. This routing, caching and retrieval layer is the work our AI deployment team builds into production.
| Lever | What it cuts | Headline saving |
|---|---|---|
| Prompt caching | Repeated input | ~90% on cached tokens (Anthropic) |
| Model routing | Blended model rate | up to 98% (FrugalGPT ceiling) |
| RAG vs context-stuffing | Input tokens | ~90% (clinical study) |
| Semantic caching | Duplicate calls | ~69% fewer calls (arXiv) |
| Batch API | Input + output | 50% flat (Anthropic, OpenAI) |
The remaining levers fill in the rest of the table. Output-length control attacks the expensive half directly: set max_tokens, request structured output, ask for the answer instead of the essay, and strip reasoning where it adds nothing. One illustrative trim of a 500-token template to a 50-token one saved about $0.0045 a query, which is $4,500 per million queries.8 Batching gives a flat 50% on both input and output for async jobs under 24 hours on Anthropic and OpenAI.1 Semantic caching embeds the query and returns a stored answer on a vector-similarity hit: the peer-reviewed GPT Semantic Cache removed up to 68.8% of API calls at over 97% positive-hit accuracy with a starting threshold near 0.8.9 Distillation and fine-tuning move steady-state traffic to a smaller model, reported at 5x to 30x cheaper inference, and a fine-tuned model needs no long few-shot prompt, which shortens input on every call.10 Rate limits and budget guards are the always-on safety net: they stop a runaway loop from burning the month, though the saving depends entirely on the incident prevented.
The build: instrument five numbers, then engineer toward a budget
Measure before you optimize, because you cannot cut what you do not meter. Providers expose a usage object (input, output, cache-read, cache-creation, server-tool-use) on every response, so instrument token usage per request, per feature and per user, and establish a cost-per-request baseline first. Then set a cost budget per request and engineer toward it.
Reduce every decision to five numbers and watch them on a dashboard: input tokens per request, output tokens per request, calls per request (retries plus agent turns), cache-hit rate, and blended dollars per request. Those five fully determine spend, and each lever moves one of them. Caching, RAG and fine-tuning pull number one. Output control pulls number two. Rate limiting and agent design pull number three. Prompt and semantic caching pull number four. Routing and smaller models pull number five. Optimize the number, not the vibe.
A budget per request makes the work concrete: decide what a single user action is allowed to cost, then build toward it. Anthropic's own worked example runs about 10,000 support tickets at roughly 3,700 tokens each on Haiku 4.5 for about $37 total, near $0.0037 a ticket, which is a usable anchor for a support-reply budget.1 Sequence the levers by effort: cheap reversible wins first (prompt caching, output caps), then routing and RAG, then distillation and fine-tuning, with rate limits running underneath the whole time. Scoping that budget and the cost model behind it is exactly the conversation our AI consulting engagements open with.
Where to stop: cost is one axis, not the only one
The right model is the smallest one that still passes your eval, not the cheapest on the price sheet. Cost trades against accuracy, latency and reliability, so optimize to a quality floor and then minimize cost beneath it. Squeezing the last 5% of cost often spends more in eval, maintenance and incident recovery than it saves.
| Lever | What it trades | Guardrail |
|---|---|---|
| Routing / cascade | Tail accuracy for price | Hold a validation set; a bad confidence threshold ships wrong answers cheaply |
| RAG / context trimming | Recall for fewer input tokens | Over-aggressive retrieval drops the chunk the answer needed; tune recall |
| Smaller / distilled model | Headroom for per-call rate | Keep an eval that proves it still clears the bar after a base-model upgrade |
| Output capping | Completeness for output tokens | Cap to the task, then watch for truncated or degraded answers |
Two guardrails are worth stating plainly. Cascades and routing trade tail accuracy for price, and FrugalGPT's headline number assumes a confidence threshold tuned on a validation set, so a careless threshold ships wrong answers cheaply.5 Shrinking context can raise quality because models weaken past roughly half their window, but over-aggressive retrieval drops the chunk the answer needed.7 The honest framing is to set the quality bar first and treat cost as the thing you trim underneath it. Over-optimization is its own failure mode.
The falling-cost trend, and why optimization still matters
Per-token cost for an LLM of equivalent performance is falling about 10x a year, a16z's LLMflation, yet total inference spend keeps rising because usage grows faster than price drops. So do not architect around today's prices, and do not assume cheaper tokens mean a cheaper bill. Optimization stays a permanent discipline precisely because falling prices invite more usage.
The numbers are stark. A GPT-3-quality model (MMLU 42) cost $60 per million tokens in November 2021; by late 2024 the cheapest equivalent, Llama 3.2 3B, cost $0.06, a roughly 1,000x drop in three years.11 The higher tier that has existed since GPT-4 fell about 62x over the same window. a16z frames this as faster than PC-era compute or dotcom-era bandwidth declines, where each 10x unlocks use cases that were previously uneconomic.11 The paradox is that aggregate spend still climbs, because volume, agentic multi-call patterns and reasoning tokens push it up faster than the per-token price falls.
Two consequences follow for how you build. Do not over-engineer around current prices, because the use case you cannot afford this year is plausibly affordable next year; build the capability and treat unit cost as a moving target downward. But cheaper per token does not mean a cheaper bill, since LLMflation invites exactly the volume that pushes spend up. That tension is the bridge to how you price an AI feature when its cost of goods is a falling-but-volatile token bill, covered in our SaaS AI pricing guide.
AI cost optimization questions
What is AI cost optimization?
Why are my LLM costs so high?
How much can prompt caching actually save?
Does using a smaller model hurt quality?
Will AI just get cheap enough that optimization stops mattering?
Sources
- Anthropic, Claude API Pricing (2026). Per-token rates, cache-read 0.1x discount, Batch 50%, tool-use overhead and the support-ticket worked example.
- CodeAnt, Input vs Output vs Reasoning Tokens (2026). Secondary synthesis of the 4x to 8x output-to-input ratio across providers.
- OpenAI, Prompt Caching in the API (2024). Automatic above 1,024 tokens; documented 50% launch discount on cached input.
- ProjectDiscovery, How we cut LLM cost with prompt caching (59%); and Du’An Lightfoot, $720 to $72 a month. Practitioner anecdotes that bracket the upside, never benchmarks.
- Chen, Zaharia and Zou, FrugalGPT, arXiv:2305.05176 (2023). Up to 98% cost reduction at matched quality with a tuned LLM cascade.
- Pristren, Model Routing Guide (2026); and DigitalApplied, LLM Model Routing 2026. Typical 50% to 70% in production. Vendor blogs, directional.
- Less Context, Same Performance: a RAG framework, arXiv:2505.20320 (2025). Full-text 172M tokens ($430) vs chunk-based RAG 13.2M tokens (~$33).
- Dynamic Template Selection for Output Token Reduction, arXiv:2511.20683 (2025). Illustrative output-trim arithmetic of about $4,500 per million queries.
- Regmi and Pun, GPT Semantic Cache, arXiv:2411.05276 (2024). Up to 68.8% fewer API calls at over 97% positive-hit accuracy.
- TensorZero, Distillation for 5x to 30x cheaper inference; and Humanloop, Model distillation. Vendor figures, directional.
- Guido Appenzeller / a16z, Welcome to LLMflation (2024). Roughly 10x per year; $60 to $0.06 per million tokens for a GPT-3-class model in three years.
Strategy, architecture & ops
AI Architecture Patterns
Agentic design patterns explained: reflection, tool use, planning, and multi-agent collaboration, with a framework to pic...
Read guide →
Strategy, architecture & ops
AI Architecture Patterns for SaaS: A Technical Guide
Generative AI architecture for SaaS: layered design, multi-tenant isolation, LLM gateway, RAG, and security. Built by Res...
Read guide →
Strategy, architecture & ops
AI Deployment Checklist: 9 Gates Before You Ship
How to deploy AI models to production: a 9-gate pre-launch checklist anchored to the OWASP LLM Top 10 (2025), NIST AI RMF...
Read guide →
Strategy, architecture & ops
AI Evaluation and Evals
LLM evaluation and AI evals, explained: the eval taxonomy, how to build an eval suite, LLM-as-a-judge bias, offline vs pr...
Read guide →
Strategy, architecture & ops
AI Features SaaS Customers Actually Want
What AI powered SaaS customers actually want: the time-savers and answers they value, the automation they distrust, and h...
Read guide →
Strategy, architecture & ops
AI Security Best Practices
Generative AI security best practices: the OWASP Top 10 for LLMs, NIST AI RMF, MITRE ATLAS, lifecycle controls, agentic-A...
Read guide →
Agents & RAG
Agentic RAG: When to Use It and How to Build It
Agentic RAG explained: how it differs from naive and advanced RAG, the key patterns like corrective RAG and self-RAG, the...
Read guide →
Agents & RAG
AI Agent for Fintech: Risk, Compliance, Ops, Customer
AI agents in finance: fraud, AML, KYC and servicing use cases, how to build with money-movement guardrails and human appr...
Read guide →
Agents & RAG
AI Agent for Healthcare: Use Cases, Governance & Implementation
AI agents in healthcare: the use cases that pay off first, how to build one HIPAA-safe on FHIR with clinician review, and...
Read guide →
