How much does AI cost? An AI TCO calculator for the real price of owning a feature

How much does AI cost? Not the per-token price you screenshot, which is the smallest line on the bill, but the fully loaded yearly cost of owning an AI feature in production. This AI TCO calculator models all of it, well beyond one API call: inference, standing infrastructure, the one-time build, and the ongoing engineering everyone forgets. Edit the numbers, compare build against buy, and read the breakeven.

By Kanika Mathur, Head of Service Delivery

Reviewed by Resourcifi engineeringPublished Apr 26, 2026Updated Apr 26, 202610 min read

Cost

Key takeaways

The short version

AI TCO is the fully loaded yearly cost of owning an AI feature, well past the price of one API call. It has five parts: inference, standing infrastructure, one-time build, ongoing engineering, and governance.
The token price is the smallest line. Output is the expensive half (about 5x input on Claude Sonnet 4.6), and the costs that do not fall with token price (build labor, FTE upkeep, infra floors) dominate the bill.
In the worked default below, a build-it feature runs about $115k in year one and roughly $59k a year at steady state once the one-time build drops off.
Per-token cost falls roughly 10x a year (a16z LLMflation), so model an editable rate. Cheaper tokens invite more usage, which is why total spend keeps climbing.
Cost is a named project killer: RAND found over 80% of AI projects fail, and Gartner flags escalating cost as a driver behind abandoned GenAI and agentic projects. A TCO with no cap is a budget with no floor.

How much does AI cost? What goes into AI TCO

How much does AI cost in practice? The honest answer is its total cost of ownership, not a per-token rate. AI TCO is the fully loaded yearly cost of building and running an AI feature in production, well beyond the sticker price of one API call. It has five parts. Inference is input tokens plus output tokens, multiplied by volume and by every call one user action triggers. Infrastructure is the standing bill (vector database, hosting, observability). Build is the one-time engineering to ship it. Ongoing engineering is the upkeep to keep it in production. Governance, the evaluation and monitoring that standards bodies now treat as obligations, lives inside that upkeep line.

The reason a single per-token figure misleads is that it ignores three of those five parts entirely, and it under-prices a fourth. Output tokens are the expensive half: on Claude Sonnet 4.6 the rate is $3 per million input against $15 output, a 5:1 ratio.¹ One user action can trigger several calls once you add retries and agent turns. And the costs that genuinely matter over a feature's life, the build, the standing infra floor, and the engineer-hours to maintain it, do not appear on any pricing page.

Two facts justify modeling all five rather than estimating tokens. Most AI projects never reach production: RAND found over 80% of AI projects fail, roughly twice the rate of non-AI IT projects.⁴ And cost is a named cause of death: Gartner predicts at least 30% of generative AI projects are abandoned after proof of concept, and over 40% of agentic AI projects canceled by end of 2027, citing escalating cost in both.⁵ A feature that pencils out on tokens but blows up on infra and maintenance is exactly the one that gets killed. This is the cost model our AI deployment team builds before code, and it is one of the five numbers that gate a launch in our production-first method.

The calculator

Enter your numbers and the AI TCO calculator recomputes live: monthly inference cost, annual run cost, build-path year-one TCO, buy-path year-one TCO, and the breakeven month. The defaults below are illustrative example values rather than benchmarks. Token rates, the base wage, and the vector-database floor are externally sourced (see the footnote); build effort, maintenance percentage, request volume, and vendor prices are placeholders you must replace with your own.

AI TCO calculator

Edit any field. Outputs update on every keystroke. All figures are representative; replace the placeholders with your own numbers before trusting the result.

Usage

Monthly request volume

Avg input tokens / request

Avg output tokens / request

Calls per request (retries + agent turns)

Model rate ($ / 1M tokens)

Model input $ / 1M

Model output $ / 1M

Infrastructure

Infra / vector DB ($ / mo)

Build path (one-time + upkeep)

Build effort (engineer-months)

Blended rate ($ / eng-month)

Ongoing eng (% of one FTE)

Buy path

Pricing model

Seat price ($ / seat / mo)

Seats

Resolution price ($ each)

Vendor setup (one-time $)

Monthly inference cost-

Annual run cost (inference + infra)-

Build path, year-1 TCO-

Build path, year-2+ steady state / yr-

Buy path, year-1 TCO-

Build vs buy breakeven-

Externally sourced defaults: token rates Claude Sonnet 4.6 $3 in / $15 out (Anthropic, 2026)¹; blended eng rate derived from the US median software-developer wage of $133,080 (BLS OOH, May 2024) grossed up about 1.25x for loaded cost⁸; infra floor near the Pinecone Standard $50/mo minimum plus illustrative hosting (Pinecone, 2026)². Build effort, FTE percentage, request volume, and vendor prices are illustrative placeholders rather than market rates. Figures are representative; the output is only as good as your inputs.

How the math works

Six formulas drive the whole model. Price input and output separately, because output is the expensive half and a single blended rate misleads. Then add the standing infra floor, the one-time build, and the ongoing engineering, and compare that against a buy path over the same horizon.

Cost per request is calls x (input tokens x input rate + output tokens x output rate), with rates in dollars per token. On the Sonnet 4.6 default (3,000 in, 700 out, one call) that is 3000 x $3/1M plus 700 x $15/1M, which is $0.009 plus $0.0105, or about $0.0195 a request.¹ Monthly inference is that figure times request volume: at 100,000 requests, about $1,950 a month. Annual run cost adds the infra floor and annualizes: ($1,950 plus $200) x 12, about $25,800 a year. Build-path year-one TCO is the one-time build plus the annual run plus the ongoing engineering: $56,000 plus $25,800 plus $33,600, about $115,400. Year two onward drops the build, leaving about $59,400 a year.

The buy path is the vendor fee over the same horizon. Per seat it is seat price x seats x 12; per resolution it is the per-call price x volume, which is the line that explodes at scale, since 1.2M resolutions a year at $1 each is $1.2M before setup. Breakeven solves for the month where cumulative build cost drops below cumulative buy cost: (build_once minus buy_setup) / (buy_monthly minus build_monthly_run). When the denominator is zero or negative, buy stays cheaper and there is no crossover within the modeled horizon. Always read year one and year-two steady state side by side, because the one-time build distorts any single-year comparison; that distortion is the most common TCO mistake.

Reading build vs buy

Model both paths as TCO over the same horizon. Build carries a large one-time cost and a low monthly run-rate; buy carries little setup but a recurring fee that, at volume, can exceed build's steady state. Per-resolution pricing in particular scales straight up with usage. Compare year one against year-two steady state, find the breakeven month, and decide on the durable number rather than the first-year sticker.

Where the build-path money goes, and how buy compares

The four build-path cost lines from the worked default (year one), set against the per-seat buy path. Example inputs, drawn to scale; your numbers will move the bars.

Build vs buy, the worked default (representative)
Cost line	Build path	Buy path (per seat)
One-time (build or vendor setup)	$56,000	$10,000
Inference / yr	$23,400	included
Infrastructure / yr	$2,400	included
Ongoing engineering / yr	$33,600	vendor-run
Recurring vendor fee / yr	n/a	$18,000
Year-1 TCO	$115,400	$28,000
Year-2+ steady state / yr	$59,400	$18,000

Worked default inputs (100,000 requests/mo, Sonnet 4.6 rates, 4 eng-months, 20% FTE upkeep, 50 seats at $30). Example values for illustration only. Buy "included" means the vendor absorbs that line in its fee. See the build vs buy guide for the decision this feeds.

The table makes the trap visible. In year one, buy looks far cheaper ($28k against $115k) because build front-loads a one-time cost. But at steady state the gap narrows and depends entirely on volume: the per-seat buy fee is fixed, while a per-resolution fee scales straight up with usage and can pass build's run-rate quickly. The right comparison is the steady-state line plus a breakeven month, and the real swing factor is volume, which is why the calculator lets you vary it. This is the number that the build vs buy decision turns on.

Why token price is not your cost

Per-token cost for an equivalent-performance model falls roughly 10x a year, a16z's LLMflation. So architecting your cost model around this month's price is a trap. The parts of TCO that do not fall, the build labor, the FTE upkeep, and the infra floor, increasingly dominate the bill. A TCO model stays useful precisely because it isolates the durable costs from the one collapsing line.

The token price is collapsing, which is exactly why TCO matters

Cost to run a GPT-3-class model (MMLU about 42) per million tokens, from a16z's LLMflation analysis. A roughly 1,000x drop in three years. The lines that do not fall this way are the ones a TCO model exists to surface.

Inference price for an equivalent model (a16z)
Date	$ / 1M tokens
Nov 2021 (GPT-3 class)	$60.00
Late 2024 (cheapest equivalent)	$0.06
Trend	about 10x cheaper / yr

Source: Guido Appenzeller / a16z, Welcome to LLMflation (2024).³ Figures describe a model of equivalent performance, so they track capability over time and are not one product's list-price history.

Two consequences follow. Make the rate field editable and re-run it, because the inference line you compute today over-states next year's. And remember that a falling unit price does not mean a falling bill: cheaper tokens invite more usage and more agent calls, so request volume and calls per request, not the headline rate, are the swing factors. Pulling those levers down without dropping below your quality bar is the subject of our AI cost optimization guide.

Limitations and cost risk

This calculator is a model, and its output is only as good as your inputs. Most of the defaults are illustrative placeholders, the token line trends down while labor and infra do not, vector-database cost rises faster than linearly at scale, and an uncapped feature carries a real "denial of wallet" risk. Treat the result as a starting frame, then replace every placeholder with your own measured numbers.

Defaults are not benchmarks. Only the token rates, the base wage, and the infra floor are externally sourced and dated. Build effort, FTE percentage, vendor prices, and request volume are placeholders. Replace them before you trust the output.
The token line trends down, labor and infra do not. LLMflation means a build TCO computed today over-states year-two inference.³ Apply an annual decline assumption, or at least caveat the inference line.
Infra cost is superlinear and bursty. A vector-database minimum sits near $50/mo before any usage, but production bills scale with vectors and concurrent load.² Re-run the model at your real corpus size.
"Denial of wallet" is a budget risk, not only a security one. OWASP lists Unbounded Consumption (LLM10:2025): uncontrolled inference, retries and abuse can generate "unsustainable financial costs."⁶ Pair the calculator with rate limits and budget guards.
Governance is part of ownership cost. NIST's voluntary AI Risk Management Framework and its Generative AI Profile treat evaluation, monitoring and incident response as ongoing risk-management practices.⁷ Those hours live in the maintenance line, which is why omitting ongoing engineering understates TCO.
Quality is the missing axis. TCO says nothing about whether the cheapest model clears your eval bar. The right model is the cheapest one that passes, addressed in AI cost optimization and the build vs buy guide.

Frequently asked

AI TCO questions

What is the total cost of ownership (TCO) of an AI feature?

AI TCO is the fully loaded yearly cost of building and running an AI feature in production, well beyond the per-token API price. It has five parts: inference (input plus output tokens, times volume, times calls per request), standing infrastructure (vector database, hosting, observability), one-time build labor, ongoing engineering to keep it in production, and governance and evaluation overhead. The token price is usually the smallest of the five.

How do you calculate the cost of an LLM feature per request?

Cost per request equals calls per request times (input tokens times input rate plus output tokens times output rate), with rates in dollars per token. Price input and output separately, because output is the expensive half, about 5x input on Claude Sonnet 4.6 ($3 in against $15 out). On the default of 3,000 input and 700 output tokens in one call, that works out to about $0.0195 a request. Remember that one user action can trigger several calls once you add retries and agent turns.

Should I build or buy an AI feature?

Model both as TCO over the same horizon. Build carries a large one-time cost (about $56k in the default) plus a low monthly run-rate; buy carries little setup but a recurring fee that, at high volume, can exceed the steady state of building, with per-resolution pricing in particular scaling straight up with usage. Compare year-one and year-two steady state, and find the breakeven month rather than deciding on the first-year sticker.

Why is my AI feature more expensive than the token price suggested?

Because the token price is the smallest line. The usual culprits are output tokens (the expensive half), extra calls per request from retries and agent loops, the standing infra floor (a vector-database minimum is about $50/mo before any usage), the one-time build, and the most underestimated of all, the ongoing engineering to keep it in production. Uncontrolled usage can also trigger a denial-of-wallet cost spike, which OWASP lists as Unbounded Consumption.

Will falling AI prices make my cost model obsolete?

The inference line keeps shrinking, since per-token cost falls roughly 10x a year for equal performance (a16z LLMflation), but the parts of TCO that do not fall, build labor, FTE maintenance and infra floors, increasingly dominate. So a TCO model stays useful precisely because it isolates the durable costs from the one collapsing line. Model the rate as editable and re-run it as prices move.

Kanika Mathur

Head of Service Delivery, Resourcifi

Kanika Mathur runs service delivery at Resourcifi, where her pods cost-model AI features before a line of code is written and then defend that budget through launch. She built this calculator out of the same spreadsheet she walks every client through, after watching too many projects price a token and forget the four bigger lines underneath it.

Resourcifi on LinkedIn →

Sources

Anthropic, Claude API Pricing (2026). Per-token rates ($3 input / $15 output on Sonnet 4.6) and the 5:1 output ratio.
Pinecone, Pricing (2026). Standard tier $50/mo floor and per-vector storage; the basis for the infrastructure default and the superlinear-at-scale caveat.
Guido Appenzeller / a16z, Welcome to LLMflation (2024). Roughly 10x per year; $60.00 to $0.06 per million tokens for a GPT-3-class model in three years.
Ryseff, De Bruhl and Newberry, The Root Causes of Failure for Artificial Intelligence Projects, RAND Corporation (2024). Over 80% of AI projects fail, about twice the rate of non-AI IT projects.
Gartner, 30% of generative AI projects abandoned after proof of concept (2024); and over 40% of agentic AI projects canceled by end of 2027 (2025). Escalating cost named in both.
OWASP GenAI Security Project, LLM10:2025 Unbounded Consumption (2025). Uncontrolled inference can generate unsustainable financial costs, the denial-of-wallet risk.
NIST, AI Risk Management Framework (AI RMF 1.0) and the Generative AI Profile (NIST AI 600-1, 2024). Evaluation, monitoring and incident response as ongoing risk-management practices inside the maintenance line.
U.S. Bureau of Labor Statistics, Occupational Outlook Handbook: Software Developers, QA Analysts, and Testers (median wage $133,080, May 2024). Base for the blended engineer rate, grossed up about 1.25x for loaded cost (the gross-up is standard practice and is not a BLS figure).