LLM evaluation and AI evals: the discipline that decides whether your model ships

LLM evaluation is the practice of measuring, with structured tests called evals, whether an AI system actually does its job. It exists because language models are non-deterministic, so a single good demo proves nothing. This guide covers what evals are, the full taxonomy, how to build an eval suite, LLM-as-a-judge and its biases, offline versus production monitoring, and the mistakes that quietly sink teams.

By Kanika Mathur, Head of Service Delivery

Reviewed by Resourcifi engineeringPublished Mar 14, 2026Updated Mar 14, 202612 min read

Evaluation

Key takeaways

The short version

An eval is a structured test: feed an AI a defined input, then apply grading logic, code-based, an LLM judge, or human review, to measure whether the output clears a bar. Evals exist because LLMs are non-deterministic, so one demo proves nothing.
Per Anthropic, the inability to measure model performance is the single biggest blocker of production LLM use cases. Evals turn "it seems to work" into "it passes," which is what makes a system shippable.
The taxonomy spans code-based, model-based, and human graders, plus reference-based, agentic, regression, adversarial, RAG-specific, and safety evals. Match the grader to the task instead of forcing one global metric.
LLM-as-a-judge is scalable and cheap, and reached over 80% agreement with humans in 2023, but it carries position, verbosity, and self-preference biases. Calibrate it against human labels before you trust it.
Mature teams run a layered "Swiss cheese" approach: offline evals for fast iteration, production monitoring for ground truth, and periodic human review for calibration. No single layer catches everything.

What LLM evaluation is, and why evals decide shippability

LLM evaluation means testing an AI system with structured tests called evals: you feed a defined input to a model or LLM application, then apply grading logic, code-based, an LLM judge, or human review, to measure whether the output meets a bar. Evals exist precisely because LLMs are non-deterministic, as OpenAI puts it, evals are a way to test your AI system despite its variability, so a single happy-path demo tells you almost nothing about production behavior.¹²

This is not a nice-to-have. Anthropic's solutions team puts it bluntly: the inability for teams to measure the performance of their models is the biggest blocker of production use cases for LLMs.¹ Manual spot-checking is fine in early prototyping, but once an agent reaches production and starts scaling, building without evals breaks down, and teams fall into reactive loops where issues only appear in front of users and fixing one failure creates the next. Evals also become the highest-bandwidth channel between product and research teams, the thing that lets a team upgrade to a new model in days instead of re-validating by feel.¹

That is the link to production-first AI: demos are easy and production is the hard part, and evals are the mechanism that operationalizes it. A model is shippable when it clears a measurable bar you can defend, regress against, and re-run on every change, rather than when it merely looks good once in a notebook. We treat the eval harness as a deliverable that de-risks the prototype-to-production handoff, built alongside the system. Resourcifi has shipped LLM features since the technology matured, and the handover includes a regression suite, not merely a trained model.

The types of evals: a working taxonomy

There are two useful axes. By grading mechanism there are three graders: code-based (deterministic), model-based (LLM-as-a-judge), and human. By what they test, the common families are reference-based, task or end-to-end (agentic), regression, adversarial or red-team, RAG-specific, and safety or responsible-AI evals. Real suites mix several, matching the grader to each task.

The three graders form the spine.¹ Code-based graders use string match, regex, schema or JSON validity, unit tests, and outcome checks; they are fast, cheap, objective, and reproducible, but brittle to valid variations in phrasing. Model-based graders apply a rubric through a stronger LLM; they are flexible and capture nuance, but they are non-deterministic and must be calibrated against human judgment, covered in detail below. Human graders are the gold standard for quality and the way you calibrate the model graders, but they are expensive, slow, and need expert access at scale.

The three graders: a tradeoff triangle

Relative profile of the three grading mechanisms across cost, speed, and the nuance each can capture. Code graders are cheapest and fastest but capture the least; human graders are the reverse; LLM judges sit between. Bars are a directional read of the Anthropic framing rather than measured units.

Directional profile of the three graders
Grader	Speed	Cost efficiency	Nuance captured
Code-based	High	High	Low
LLM-as-a-judge	Medium	Medium	Medium-high
Human	Low	Low	Highest

Profile derived from Anthropic, Demystifying evals for AI agents (2025). Bars are a directional read of the stated tradeoffs, not measured benchmark units.

The second axis, what an eval tests, gives the families a real suite is built from. The comparison table below is the reference; the families map to distinct failure modes you want to catch separately.

Eval types compared
Eval type	What it grades	Grader	Caveat	Best for
Reference-based	Output vs a known correct answer	Code (exact match, F1, ROUGE/BLEU, semantic sim)	Brittle to valid phrasings; surface metrics drift from quality	Classification, extraction, structured output
LLM-as-a-judge	Open-ended quality vs a rubric	Stronger LLM with a rubric	Position, verbosity, self-preference bias; calibrate to humans	Summaries, chat, subjective tasks at scale
Human eval	Output vs expert judgment	Human raters	Expensive, slow, needs experts	Calibration, high stakes, final sign-off
Task / agentic	Whether the task outcome succeeded	Code or LLM judge on the outcome	Complex setup; isolate environments; grade outcomes	Agents, multi-step workflows
Regression	Whether known-good behavior broke	Usually code	Needs disciplined upkeep; targets near 100% pass	CI on every change
Adversarial / red-team	Safety under attack and correct refusals	Code, human, and LLM judge	Needs balanced should and should-not sets	Safety-critical, public-facing
RAG-specific	Retrieval and grounding quality	RAGAS metrics	Some metrics need ground truth	RAG and knowledge assistants
Safety / responsible-AI	Toxicity, bias, fairness, PII	Classifiers plus human	Context-dependent thresholds	Compliance, enterprise deploys

Two families deserve a note. For RAG, the RAGAS framework gives four core metrics: faithfulness (is the answer factually consistent with the retrieved context, the anti-hallucination check) and answer relevancy grade the generator, while context precision (are the relevant chunks ranked high) and context recall (did retrieval cover everything relevant) grade the retriever.⁵ For safety, Stanford's HELM is the canonical reminder to measure more than accuracy: it scores seven categories, accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency.⁶ One distinction matters for buyers: standardized harnesses like HELM and EleutherAI's lm-evaluation-harness, with its 60-plus academic benchmarks, measure a foundation model's general capability for cross-model comparison.⁶⁷ They are not a substitute for task-specific evals of your own application.

How to build an eval suite

Build the suite as a pipeline: start with a small golden dataset of 20 to 50 real tasks, choose a grader per task, wire the suite into CI so it runs on every change, and grow the set by turning each production failure into a new case. The headline technique is eval-driven development, write the eval before the capability and build to it.

Step 1, the golden dataset. Anthropic recommends starting with 20 to 50 tasks; OpenAI says start with real questions paired with expert-approved answers.¹² Source them from reality, the manual checks you run during development, common things end users try, bug trackers, and support queues, and mine your production logs, so log everything from day one. The quality bar for a task: two domain experts would independently reach the same pass or fail verdict, every task ships with a reference solution proving it is solvable, and everything the grader checks is clear from the task description. Include negatives, the cases that should not happen, so the set stays balanced.

Step 2, choose graders per task. Use deterministic graders where you can, an LLM judge where you must, and humans to calibrate. Use partial credit for multi-component tasks instead of forcing one global metric.¹

Step 3, continuous evaluation in CI. OpenAI's guidance is to run evals on every change; Anthropic's automated evals run on every commit, are fully reproducible, and isolate each trial in a clean environment to avoid correlated flakiness.¹² Treat evals as core infrastructure. This continuous-evaluation and monitoring layer is squarely the work of our AI deployment practice.

Step 4, grow the set. Evaluation is a continuous process, a journey and not a destination, so every new production failure becomes a new eval case.²

The technique that ties it together is eval-driven development: build evals to define planned capabilities before the agent can fulfill them, then iterate until it performs well.¹ Run two grades in parallel. Capability evals start at a low pass rate and target what the system cannot yet do, so they measure headroom. Regression evals sit near a 100% pass rate and guard what already works, so a drop signals something broke. Writing the grader first also forces a useful question: if you cannot define what passing looks like, the product spec is too vague. Deciding what to measure and shaping that strategy is where teams bring in our AI consulting practice.

LLM-as-a-judge: the mechanism and its biases

LLM-as-a-judge uses a usually stronger model with an explicit rubric to grade another model's output, scoring on dimensions, comparing two responses pairwise, or passing or failing against criteria. It is scalable and far cheaper than human eval, and strong judges reached over 80% agreement with human preferences in 2023, about the rate at which humans agree with each other. But it carries real biases, so the judge must itself be evaluated.

The load-bearing evidence comes from Zheng and colleagues, who found a strong LLM judge reached over 80% agreement with human preferences on MT-Bench and Chatbot Arena, roughly human-to-human agreement.³ G-Eval frames the method as structured critique: give the judge an explicit rubric, fluency, coherence, consistency, and have it reason before it scores.⁴ OpenAI's best practice is to grade with a different, stronger model than the one that produced the answer, and to validate judge-to-human agreement before trusting it.²

The biases are well documented and they are what separate a real eval from a guess. Position bias favors an answer by its slot, first or last, in a pairwise comparison. Verbosity bias prefers longer answers even when they are not better. Self-preference bias favors text from the judge's own model family. Judges can also be weak at math and logic-heavy grading.³ A 2024 survey, Justice or Prejudice, catalogues the broader set systematically, and G-Eval's own authors caution that LLM evaluators do not always correlate with human judgment.⁴⁸ The mitigations follow directly: randomize or swap answer order and average both, control for length, use an explicit rubric with reasoning, prefer a judge from a different family, and calibrate against a human-labeled gold set while tracking agreement. LLM-as-a-judge is a force multiplier, and it earns trust only after it clears its own human-agreement check.

Offline evals vs production monitoring

Offline evals run automatically on a fixed golden dataset before deploy: fast, fully reproducible, no user impact, and the home of regression and capability evals. Online evaluation watches real production traffic with sampled LLM-judge scoring, guardrails, and drift detection to catch what synthetic tests miss, but it is reactive, problems reach users first. Mature teams run both, plus periodic human review.

Offline is where you iterate quickly and prove a change before it ships. Online monitoring reveals real user behavior at scale and catches issues synthetic evals never anticipated, through live judge scoring on a traffic sample, guardrails for toxicity, PII, and format, drift detection, user feedback signals, A/B tests on real outcomes like task completion and retention, and manual transcript review that builds intuition without scaling.¹ The structural point, Anthropic's Swiss-cheese model, is that no single layer catches everything, so the best teams combine automated evals for fast iteration, production monitoring for ground truth, and periodic human review for calibration.¹ Standing up that online layer, the sampling, guardrails, drift detection, and observability around a live model, is exactly what our AI deployment team builds, while the upstream question of what to measure maps to AI consulting.

Common mistakes that quietly sink teams

The recurring failures are vibe-based evals, no regression set, gaming the metric, over-specifying the process instead of the outcome, trusting an uncalibrated LLM judge, contaminated test environments, leaning only on academic benchmarks, and delaying evals until later. Each one lets a system look fine while it silently degrades.

Vibe-based evals. OpenAI names informal judgment in place of structured testing as an anti-pattern, and it is the most common one.²
No regression set. Without one, every prompt tweak or model upgrade is a blind bet, and a decline in score that would have flagged a break goes unnoticed.¹
Gaming the metric (Goodhart). When a measure becomes the target it stops being a good measure. Two concrete forms: eval saturation, where an agent passes every solvable task so progress turns invisible, and optimizing a proxy like ROUGE that drifts from real quality.¹
Over-specifying the process. Checking that an agent followed an exact tool-call sequence is too rigid and produces brittle tests, since agents regularly find valid approaches the eval designer never anticipated. Grade the outcome.¹
Trusting an uncalibrated judge. Deploying LLM-as-a-judge without checking human agreement inherits its position, verbosity, and self-preference biases.³
Contaminated environments. Non-isolated trials produce correlated failures from infrastructure flakiness, which read as agent failures and waste your debugging time.¹
Relying only on academic benchmarks. OpenAI flags reliance on overly generic academic metrics (it names perplexity and BLEU) as an anti-pattern; a high MMLU score, likewise, says nothing about whether your support bot is correct.²
Delaying evals. Teams put evals off, when the right move is small-scale testing right away with a handful of examples.¹

Frequently asked

AI evals questions

What are AI evals?

AI evals (evaluations) are structured tests for AI systems: you feed defined inputs to a model or LLM application and apply grading logic, code-based, an LLM judge, or human review, to measure whether the output meets a bar. Because LLMs are non-deterministic, evals replace one-off demos with repeatable, measurable pass-or-fail criteria, which is what makes a feature defensible to ship.

What is the difference between LLM evaluation and a benchmark like MMLU?

Benchmarks such as MMLU, HellaSwag, HELM, and lm-evaluation-harness measure a foundation model’s general capability for cross-model comparison. Application evals measure whether your specific system, with your prompts, retrieval, and tools, does your task correctly. A high benchmark score does not guarantee your feature works, so you still need task-specific evals on top.

How do you build an eval suite?

Start with a small golden dataset, Anthropic suggests 20 to 50 real tasks drawn from logs, support tickets, and development testing, each with a clear pass-or-fail standard two experts would agree on. Pick a grader per task (deterministic where possible, an LLM judge where needed, humans to calibrate), wire the suite into CI so it runs on every change, and grow it by turning each production failure into a new case.

Is LLM-as-a-judge reliable?

It can be. Strong LLM judges reached over 80% agreement with human preferences in 2023 (Zheng et al.), about the rate at which humans agree with each other. But judges carry position, verbosity, and self-preference biases, so you must calibrate the judge against human labels, randomize answer order, and control for length before trusting it. Treat the judge as something you evaluate in its own right rather than a free pass.

What is the difference between offline and online evaluation?

Offline evals run automatically on a fixed golden dataset before deploy: fast, reproducible, and ideal for regression testing. Online evaluation monitors real production traffic with sampled LLM-judge scoring, guardrails, and drift detection to catch what synthetic tests miss. Mature teams run both, plus periodic human review, the layered Swiss-cheese model where no single layer catches everything.

Kanika Mathur

Head of Service Delivery, Resourcifi

Kanika Mathur is Head of Service Delivery at Resourcifi, where her engineering pods treat the eval harness as part of the build instead of a phase bolted on after a model already looks good in a notebook. She has watched too many promising prototypes stall at the production line because no one could say, with numbers, whether a change made the system better or worse. This guide is her argument for measuring first.

Resourcifi on LinkedIn →