AI & LLMsMay 12, 2026 6 min read

Building LLM Evals That Actually Catch Regressions

Most teams write LLM evals once, watch them pass, and ship blind. Here's how we structure eval suites that fail loudly when a prompt tweak or model swap quietly breaks production.

Most teams write LLM evals the way they write smoke tests: once, at the start, just enough to feel responsible. Then a prompt gets tweaked, a model gets bumped from claude-3-5-sonnet to whatever Anthropic ships next, and nobody notices the silent 8% drop in citation accuracy until a customer flags it. This piece is about building eval suites that fail loudly when that happens.

Why most eval suites are theatre

Walk into a typical AI product team and the eval setup usually looks like this: a notebook with 20 hand-picked prompts, a spreadsheet of "good" answers, and a thumbs-up/thumbs-down review from whoever was on rotation that sprint. It produces a number. The number goes up. Everyone feels good.

The problems with that setup are mundane but fatal:

The 20 prompts were chosen by the person who wrote the system prompt, so they encode the same blind spots.
"Correctness" is judged by another LLM with no calibration against human labels.
There's no versioning, so you can't compare today's run to last week's.
Failures are aggregated into a single score, which hides the fact that one category (say, multi-hop questions) collapsed while everything else held.

A real eval suite has to do four things: cover the surface area, version everything, slice failures by category, and stay cheap enough to run on every PR.

Start with a failure taxonomy, not a dataset

Before you write a single test case, write down how your system can fail. For a RAG-backed support assistant we shipped last year, the taxonomy looked roughly like this:

Retrieval miss: correct doc exists in the index, top-k didn't surface it.
Retrieval hit, generation miss: doc was retrieved, model ignored it.
Hallucinated citation: model cited a chunk that doesn't support the claim.
Stale answer: model used training data instead of retrieved context.
Refusal regression: model refused something it used to handle.
Format break: JSON output failed schema validation.
Tone drift: answer technically correct but off-brand.

Each category gets its own dataset, its own metric, and its own threshold. When claude-sonnet-4 drops citation accuracy by 12% but bumps tone scores by 3%, you want to see that tradeoff explicitly, not averaged into a meaningless aggregate.

A practical rule for dataset size

In our experience, you need at least 30 – 50 examples per failure category to get a stable signal, and 100+ if you want to detect changes smaller than 5%. Below 30, run-to-run variance from sampling temperature alone will swamp any real regression.

Build the harness before you scale the dataset

A common mistake: spend three weeks labelling 500 examples, then realise your runner can't replay them deterministically. Build the harness first with 20 examples, then grow the dataset.

Here's the minimal shape we use, written in Python but the structure ports anywhere:

from dataclasses import dataclass
from typing import Callable, Literal

@dataclass
class EvalCase:
    id: str
    category: Literal["retrieval", "citation", "format", "refusal", "tone"]
    input: dict
    expected: dict
    metadata: dict  # source, difficulty, added_at, author

@dataclass
class EvalResult:
    case_id: str
    passed: bool
    score: float
    output: str
    latency_ms: int
    cost_usd: float
    judge_reasoning: str | None

def run_suite(
    cases: list[EvalCase],
    system: Callable[[dict], dict],
    judges: dict[str, Callable],
    seed: int = 42,
) -> list[EvalResult]:
    results = []
    for case in cases:
        output = system(case.input)
        judge = judges[case.category]
        score, reasoning = judge(case, output)
        results.append(EvalResult(
            case_id=case.id,
            passed=score >= 0.8,
            score=score,
            output=output["text"],
            latency_ms=output["latency_ms"],
            cost_usd=output["cost_usd"],
            judge_reasoning=reasoning,
        ))
    return results

A few things to notice. Every case has a stable id — that's how you diff runs. Every result captures latency and cost, because a 2% quality bump that triples your bill is not a win. And judges are pluggable per category, because the right way to score a JSON format check is different from the right way to score tone.

Pick the right judge for each metric

LLM-as-judge is fashionable and overused. It's the right tool for fuzzy things (tone, helpfulness, citation faithfulness) and the wrong tool for things you can check deterministically.

Metric	Judge type
JSON schema valid	Code (jsonschema)
Tool call arguments	Code (assert equals)
Retrieved chunk in top-k	Code (set membership)
Citation supports claim	LLM judge, calibrated
Tone matches brand voice	LLM judge with rubric
Answer factually correct	Hybrid: code for known-answer, LLM for open-ended

When you do use an LLM judge, do not use the same model family you're evaluating. If you're testing GPT-4-class outputs, judge with Claude or Gemini. Otherwise you encode the model's own biases into the scoring. Anthropic's docs on building evals make this point too, and OpenAI's evals cookbook converges on the same advice.

Calibrate your judge against humans

Before you trust a judge in CI, label 50 – 100 outputs by hand and check agreement. We aim for at least 85% agreement with the human label on a held-out set. If the judge disagrees more than that, the rubric is too vague. Tighten it, add examples, and re-test.

Run evals on every PR, not just before releases

The whole point is to catch regressions early. That means evals run in CI on every change to:

Prompts (yes, treat them as code — version, review, diff)
RAG config (chunk size, top-k, reranker on/off)
Model name or version pin
Tool schemas
Temperature, max_tokens, anything in the request

For cost control, we split into two tiers. A smoke suite of ~40 examples runs on every PR, takes under two minutes, and costs cents. A full suite of 400 – 800 examples runs nightly on main and before any production deploy. The smoke suite is sampled to cover every failure category at least 5 times so categorical regressions still surface.

# .github/workflows/evals.yml (sketch)
name: llm-evals
on: [pull_request]
jobs:
  smoke:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run smoke evals
        run: uv run python -m evals.run --suite smoke --baseline main
      - name: Fail on regression
        run: uv run python -m evals.compare --threshold 0.03

The --baseline main flag matters. You're not asking "did we pass an absolute bar?" — you're asking "did this PR make any category worse than main?" A 3% per-category regression threshold is a reasonable default; tighten it as your suite stabilises.

Watch the things that aren't accuracy

Quality is the headline metric, but the silent killers are usually elsewhere:

Cost per request: a prompt rewrite that adds 400 tokens of examples can quietly double your monthly bill.
P95 latency: a reranker that adds 800ms is fine for batch, fatal for chat.
Refusal rate: model upgrades sometimes make the model more cautious. Track it per category.
Output length: longer is not better. We've seen Gemini 2.x models drift toward 1.5x output length after minor version bumps, which breaks downstream parsers.

Log all of these alongside scores and chart them over time. A dashboard with seven sparklines beats a single "quality score" every day of the week.

A war story: the silent Claude swap

We migrated a document-extraction pipeline from one Claude version to a newer one. Aggregate accuracy went up 1.2%. Ship it, right?

The per-category breakdown told a different story. Extraction accuracy on invoices climbed 4%. Extraction on handwritten forms dropped 9%. We'd never have caught it without category slicing, because handwritten forms were 15% of the dataset and got buried in the average. We rolled back, added a router that sent handwritten forms to the previous version, and shipped the upgrade for the other 85% of traffic.

That router still exists. It pays for itself every month.

Where we'd start

If you have nothing today, do this in order over two weeks:

Write your failure taxonomy on a whiteboard. Six to eight categories.
Hand-label 20 examples per category. Yes, by hand.
Build the harness from the snippet above. Make it deterministic with a seed.
Wire one code-based judge and one LLM judge. Calibrate the LLM one against your hand labels.
Add the smoke suite to CI with a per-category regression threshold.
Only then grow the dataset to 50+ per category.

The team that does steps 1 – 5 with 120 examples will catch more regressions than the team that skips to step 6 with 2,000. We've watched both happen. If you want help wiring this into an existing product, that's the kind of thing our AI engineering team does day-to-day.

#AI#LLMs#Evals#Engineering

Want a team like ours?

72Technologies builds production software for the kind of teams who actually read this blog.

Start a project