AI cost optimization: the levers that actually move the bill

An LLM bill is a token meter with a handful of dials behind it. AI cost optimization is the discipline of finding which dials your workload is actually paying for, then turning them down without dropping below your quality bar. This guide walks the levers in order of effort-to-payoff, the five numbers to instrument, and where to stop.

By Kanika Mathur, Head of Service Delivery

Reviewed by Resourcifi engineeringPublished Feb 20, 2026Updated Feb 20, 202611 min read

Cost

Key takeaways

The short version

An LLM bill reduces to one formula: cost-per-request = (input tokens x input rate) + (output tokens x output rate), summed over every call a single user action triggers, including retries and agent sub-calls. Most teams under-count the sub-calls.
Output is the expensive half. On Claude Sonnet 4.6 output runs 5x the input rate ($15 vs $3 per million tokens), so capping and compressing generations is one of the highest-payoff moves.
Prompt caching is the cheapest win: Anthropic bills cache reads at 0.1x base input, a 90% discount on the reused portion, and it stacks with the 50% Batch discount.
Routing simple queries to a small model first is the biggest structural lever. A tuned cascade matched GPT-4 quality at up to 98% lower cost in the FrugalGPT study; production routing commonly lands 50% to 70%.
Per-token cost is falling roughly 10x a year (a16z), yet total spend keeps rising because usage grows faster. Optimize to a quality floor, then minimize cost beneath it, and do not architect around today’s prices.

Where AI cost optimization starts: the five drivers of an LLM bill

AI cost optimization starts with one fact: an LLM bill is a token meter. Five drivers move it. Input tokens are everything you send (system prompt, tool schemas, retrieved context, history, the user message). Output tokens are what the model generates, and they are the expensive half. Model tier swings the rate by an order of magnitude for the same vendor. Request volume scales the whole thing linearly. And context size plus hidden multipliers (retries, agent turns, reasoning tokens, tool-use overhead) quietly inflate the first two on every call.

The asymmetry between input and output is the single most useful thing to internalize. Output is generated one token at a time, so providers price it well above input. On Anthropic, Claude Sonnet 4.6 is $3 per million input tokens against $15 output, a 5:1 ratio, and Opus 4.8 is $5 in against $25 out.¹ Secondary syntheses put the gap at 4x to 8x across providers.² Model tier is the other big swing: on the same vendor, Haiku 4.5 ($1 in / $5 out) sits roughly 5x below the Opus frontier tier on both halves.¹

The drivers that catch teams out are the hidden multipliers. Failed calls, re-prompts and multi-turn agent loops each multiply token spend, and reasoning ("thinking") tokens bill as output. Tool use adds its own tax: the tools parameter plus the auto-injected tool system prompt add roughly 290 to 810 input tokens to every call on Claude, with individual tools costing more on top, and server-side web search billing at $10 per 1,000 searches.¹ The practical formula to carry into every design review is cost-per-request = (input tokens x input rate) + (output tokens x output rate), summed over every call one user action triggers. The sub-calls are where the surprise lives.

The optimization levers, ranked by payoff

The levers, roughly ordered by effort-to-payoff: prompt caching for repeated context, model routing to send each query to the cheapest model that can handle it, RAG to shrink context instead of stuffing whole documents, output-length control, batching for latency-tolerant work, semantic caching for near-duplicate queries, and distillation or fine-tuning for high-volume steady-state traffic. Each one moves a specific number on the bill, so pick by which driver dominates your workload.

Start with prompt caching because it is the cheapest reversible win. Reusing an already-processed prefix (a stable system prompt, a big document, RAG context) avoids reprocessing it every call. Anthropic bills cache reads at 0.1x base input, a 90% discount on the cached portion, breaking even after a single read on the 5-minute cache, and it stacks with the Batch discount so a cached batch request can land near 5% of standard cost.¹ OpenAI's caching is automatic above 1,024 tokens and launched at a 50% cached-input discount.³ Real-workload reports range widely: one team cut total LLM cost 59% with caching, and a practitioner went from $720 to $72 a month.⁴ Treat those as anecdotes that bracket the upside, never as benchmarks.

The biggest structural lever is model routing. Send each query to the cheapest model that can clear the bar and escalate only on low confidence. The peer-reviewed FrugalGPT cascade matched GPT-4 quality at up to 98% lower cost, or beat its accuracy by 4 points at equal cost.⁵ That 98% is a research ceiling on a tuned threshold; production routing more commonly lands 50% to 70%, because an estimated 60% to 80% of queries are simple enough for a small model.⁶ Cutting context with RAG is the next lever: a clinical-NLP study fed GPT-4o full text at about 172M tokens ($430) against chunk-based retrieval at about 13.2M tokens (roughly $33), a 90%-plus cut at comparable quality.⁷ Leaner context can also raise quality, since models weaken past about half their context window. This routing, caching and retrieval layer is the work our AI deployment team builds into production.

What each lever cuts, with the figure behind it

Headline savings by lever. The verified figures come from provider pricing and peer-reviewed studies; the routing 98% is a tuned research ceiling, so read it as a best case rather than a default.

Data behind this chart
Lever	What it cuts	Headline saving
Prompt caching	Repeated input	~90% on cached tokens (Anthropic)
Model routing	Blended model rate	up to 98% (FrugalGPT ceiling)
RAG vs context-stuffing	Input tokens	~90% (clinical study)
Semantic caching	Duplicate calls	~69% fewer calls (arXiv)
Batch API	Input + output	50% flat (Anthropic, OpenAI)

Sources: Anthropic pricing docs (2026); Chen, Zaharia and Zou, FrugalGPT (2023); arXiv:2505.20320 (2025); Regmi and Pun, GPT Semantic Cache (2024); OpenAI pricing (2026). Figures come from different workloads, so treat them as directional.

The remaining levers fill in the rest of the table. Output-length control attacks the expensive half directly: set max_tokens, request structured output, ask for the answer instead of the essay, and strip reasoning where it adds nothing. One illustrative trim of a 500-token template to a 50-token one saved about $0.0045 a query, which is $4,500 per million queries.⁸ Batching gives a flat 50% on both input and output for async jobs under 24 hours on Anthropic and OpenAI.¹ Semantic caching embeds the query and returns a stored answer on a vector-similarity hit: the peer-reviewed GPT Semantic Cache removed up to 68.8% of API calls at over 97% positive-hit accuracy with a starting threshold near 0.8.⁹ Distillation and fine-tuning move steady-state traffic to a smaller model, reported at 5x to 30x cheaper inference, and a fine-tuned model needs no long few-shot prompt, which shortens input on every call.¹⁰ Rate limits and budget guards are the always-on safety net: they stop a runaway loop from burning the month, though the saving depends entirely on the incident prevented.

The build: instrument five numbers, then engineer toward a budget

Measure before you optimize, because you cannot cut what you do not meter. Providers expose a usage object (input, output, cache-read, cache-creation, server-tool-use) on every response, so instrument token usage per request, per feature and per user, and establish a cost-per-request baseline first. Then set a cost budget per request and engineer toward it.

Reduce every decision to five numbers and watch them on a dashboard: input tokens per request, output tokens per request, calls per request (retries plus agent turns), cache-hit rate, and blended dollars per request. Those five fully determine spend, and each lever moves one of them. Caching, RAG and fine-tuning pull number one. Output control pulls number two. Rate limiting and agent design pull number three. Prompt and semantic caching pull number four. Routing and smaller models pull number five. Optimize the number, not the vibe.

A budget per request makes the work concrete: decide what a single user action is allowed to cost, then build toward it. Anthropic's own worked example runs about 10,000 support tickets at roughly 3,700 tokens each on Haiku 4.5 for about $37 total, near $0.0037 a ticket, which is a usable anchor for a support-reply budget.¹ Sequence the levers by effort: cheap reversible wins first (prompt caching, output caps), then routing and RAG, then distillation and fine-tuning, with rate limits running underneath the whole time. Scoping that budget and the cost model behind it is exactly the conversation our AI consulting engagements open with.

Where to stop: cost is one axis, not the only one

The right model is the smallest one that still passes your eval, not the cheapest on the price sheet. Cost trades against accuracy, latency and reliability, so optimize to a quality floor and then minimize cost beneath it. Squeezing the last 5% of cost often spends more in eval, maintenance and incident recovery than it saves.

The quality tradeoff each lever carries
Lever	What it trades	Guardrail
Routing / cascade	Tail accuracy for price	Hold a validation set; a bad confidence threshold ships wrong answers cheaply
RAG / context trimming	Recall for fewer input tokens	Over-aggressive retrieval drops the chunk the answer needed; tune recall
Smaller / distilled model	Headroom for per-call rate	Keep an eval that proves it still clears the bar after a base-model upgrade
Output capping	Completeness for output tokens	Cap to the task, then watch for truncated or degraded answers

Two guardrails are worth stating plainly. Cascades and routing trade tail accuracy for price, and FrugalGPT's headline number assumes a confidence threshold tuned on a validation set, so a careless threshold ships wrong answers cheaply.⁵ Shrinking context can raise quality because models weaken past roughly half their window, but over-aggressive retrieval drops the chunk the answer needed.⁷ The honest framing is to set the quality bar first and treat cost as the thing you trim underneath it. Over-optimization is its own failure mode.

The falling-cost trend, and why optimization still matters

Per-token cost for an LLM of equivalent performance is falling about 10x a year, a16z's LLMflation, yet total inference spend keeps rising because usage grows faster than price drops. So do not architect around today's prices, and do not assume cheaper tokens mean a cheaper bill. Optimization stays a permanent discipline precisely because falling prices invite more usage.

The numbers are stark. A GPT-3-quality model (MMLU 42) cost $60 per million tokens in November 2021; by late 2024 the cheapest equivalent, Llama 3.2 3B, cost $0.06, a roughly 1,000x drop in three years.¹¹ The higher tier that has existed since GPT-4 fell about 62x over the same window. a16z frames this as faster than PC-era compute or dotcom-era bandwidth declines, where each 10x unlocks use cases that were previously uneconomic.¹¹ The paradox is that aggregate spend still climbs, because volume, agentic multi-call patterns and reasoning tokens push it up faster than the per-token price falls.

Two consequences follow for how you build. Do not over-engineer around current prices, because the use case you cannot afford this year is plausibly affordable next year; build the capability and treat unit cost as a moving target downward. But cheaper per token does not mean a cheaper bill, since LLMflation invites exactly the volume that pushes spend up. That tension is the bridge to how you price an AI feature when its cost of goods is a falling-but-volatile token bill, covered in our SaaS AI pricing guide.

Frequently asked

AI cost optimization questions

What is AI cost optimization?

It is the practice of reducing the token and inference spend of an AI system, through caching, model routing, output control, RAG, batching and fine-tuning, while holding quality and latency above a defined bar. It starts with measuring cost-per-request before you ever pick a cheaper model. The discipline is to fix a quality bar first and only then chase the cheapest way to hold it.

Why are my LLM costs so high?

Usually one of a few causes: paying frontier-tier rates for queries a smaller model could handle, reprocessing the same long context every call with no caching, stuffing whole documents instead of retrieving relevant chunks, and verbose uncapped outputs (output runs about 4x to 5x input cost). Hidden multipliers from retries, agent turns and reasoning tokens are the most commonly under-counted cause.

How much can prompt caching actually save?

Anthropic bills cache reads at 0.1x base input, a 90% discount on the reused portion, breaking even after a single read on the short cache. Real-workload reports range from about 59% to about 90% total bill reduction, though those are vendor and practitioner anecdotes. Actual savings depend on how much of your prompt is reused and your cache-hit rate.

Does using a smaller model hurt quality?

Not if you route by difficulty. Research shows a small-model-first cascade can match GPT-4 quality at up to 98% lower cost (FrugalGPT) because most production queries are simple. The rule is to use the cheapest model that clears your eval bar and escalate the rest, while holding a validation set so a bad confidence threshold does not quietly ship wrong answers.

Will AI just get cheap enough that optimization stops mattering?

Per-token cost is falling about 10x a year per a16z, but total spend is rising because usage and agentic call-counts grow faster than prices drop. Cheaper tokens invite more tokens, so optimization stays a permanent engineering discipline. The practical advice is to build the capability without over-engineering around current prices, since unit cost is a moving target downward.

Kanika Mathur

Head of Service Delivery, Resourcifi

Kanika Mathur is Head of Service Delivery at Resourcifi, where her pods instrument token usage and tune the caching, routing and retrieval layers that decide what an AI feature costs to run in production. She spends most of her review time arguing teams out of chasing the last 5% of token cost and into measuring cost-per-request before they touch a single model setting.

Resourcifi on LinkedIn →

Sources

Anthropic, Claude API Pricing (2026). Per-token rates, cache-read 0.1x discount, Batch 50%, tool-use overhead and the support-ticket worked example.
CodeAnt, Input vs Output vs Reasoning Tokens (2026). Secondary synthesis of the 4x to 8x output-to-input ratio across providers.
OpenAI, Prompt Caching in the API (2024). Automatic above 1,024 tokens; documented 50% launch discount on cached input.
ProjectDiscovery, How we cut LLM cost with prompt caching (59%); and Du’An Lightfoot, $720 to $72 a month. Practitioner anecdotes that bracket the upside, never benchmarks.
Chen, Zaharia and Zou, FrugalGPT, arXiv:2305.05176 (2023). Up to 98% cost reduction at matched quality with a tuned LLM cascade.
Pristren, Model Routing Guide (2026); and DigitalApplied, LLM Model Routing 2026. Typical 50% to 70% in production. Vendor blogs, directional.
Less Context, Same Performance: a RAG framework, arXiv:2505.20320 (2025). Full-text 172M tokens ($430) vs chunk-based RAG 13.2M tokens (~$33).
Dynamic Template Selection for Output Token Reduction, arXiv:2511.20683 (2025). Illustrative output-trim arithmetic of about $4,500 per million queries.
Regmi and Pun, GPT Semantic Cache, arXiv:2411.05276 (2024). Up to 68.8% fewer API calls at over 97% positive-hit accuracy.
TensorZero, Distillation for 5x to 30x cheaper inference; and Humanloop, Model distillation. Vendor figures, directional.
Guido Appenzeller / a16z, Welcome to LLMflation (2024). Roughly 10x per year; $60 to $0.06 per million tokens for a GPT-3-class model in three years.