Evaluating LLM Agents: Building Eval Harnesses That Catch Real Regressions
Most LLM eval setups measure the wrong things and miss the regressions that actually break production. Here's how we build harnesses that catch silent failures before users do.

Most LLM eval setups we inherit from clients fall into two buckets: a spreadsheet of 30 prompts someone graded by hand six months ago, or a dashboard full of BLEU and ROUGE scores that nobody trusts. Neither catches the regressions that actually wake you up at 2am — the ones where an agent silently stops calling a tool, or a RAG pipeline starts hallucinating citations because someone bumped a chunk size.
This is the eval setup we wish more teams had before they shipped agents to production.
Why most LLM evals miss the failures that matter
The core problem: traditional NLP metrics measure surface similarity, but agent failures are usually structural. A response can score high on semantic similarity and still be wrong because it skipped a tool call, returned the right answer to the wrong question, or cited a document it never retrieved.
When we audit eval suites, the common gaps look like this:
- No trace-level assertions. Teams grade final outputs but never check the agent's intermediate steps.
- Static golden sets that rot. The 50 prompts written at launch don't reflect what users actually ask three months in.
- Judge models marking their own homework. Using GPT-4 to grade GPT-4 outputs hides whole classes of failure.
- No regression gating. Evals run nightly in a dashboard nobody opens, instead of blocking a deploy.
The fix isn't more metrics. It's evals that look like integration tests, not academic benchmarks.
The four layers of an eval harness that actually works
We structure agent evals in four layers, each catching a different failure mode.
Layer 1: Unit evals on prompts and tools
These are deterministic where possible. For each tool the agent can call, you have a small set of inputs with known correct outputs. For each prompt template, you have snapshot tests that fail loudly when someone edits the template without thinking.
This layer catches: broken tool schemas, prompt drift from "helpful" refactors, JSON parsing regressions.
Layer 2: Trace-based assertions on full agent runs
This is the layer most teams skip, and it's the one that pays off most. Instead of grading the final answer, you assert on the trace: which tools were called, in what order, with what arguments.
def test_refund_agent_checks_order_before_refunding(agent, trace_recorder):
result = agent.run(
"I want a refund for order #A1234",
recorder=trace_recorder,
)
tool_calls = trace_recorder.tool_calls()
# Must look up the order before issuing a refund
assert tool_calls[0].name == "get_order"
assert tool_calls[0].args["order_id"] == "A1234"
# Must not refund without checking eligibility
refund_calls = [c for c in tool_calls if c.name == "issue_refund"]
if refund_calls:
eligibility_calls = [c for c in tool_calls if c.name == "check_refund_eligibility"]
assert eligibility_calls, "Refund issued without eligibility check"
assert eligibility_calls[0].timestamp < refund_calls[0].timestamp
These tests don't care about wording. They care about whether the agent did the right thing. They survive prompt rewrites, model swaps, and temperature changes — which is exactly what you want from a regression test.
Layer 3: Judge-model evals on output quality
This is where you grade subjective qualities: tone, completeness, faithfulness to retrieved context. The non-obvious rule: the judge should be a different model family than the one being evaluated, and ideally a stronger one.
If you're evaluating a Claude Haiku agent, judge with Claude Sonnet or GPT-4-class. If you're evaluating Gemini Flash, judge with something outside the Gemini family. Same-family judges share blind spots. Anthropic, OpenAI, and Google all document this risk in their evaluation guidance, and we've seen it bite teams who used a model to grade itself.
Keep judge prompts narrow. Don't ask "is this a good answer?" Ask:
- "Does this response cite at least one source from the provided context? Yes/No."
- "Does this response contradict any statement in the provided context? Yes/No, with quote."
- "On a 1–3 scale, how directly does this answer address the user's literal question?"
Narrow binary or 3-point questions are far more reliable than open-ended scoring. We've found 5-point Likert scales from judge models to be noise above a certain point.
Layer 4: Production replay and shadow evals
Once you're live, your best eval set is real traffic. Sample production traces (with PII scrubbed), replay them against candidate prompts or models, and diff the behavior.
This is what catches the drift no synthetic eval ever will: new question patterns, new edge cases, new ways users misspell your product name. We typically sample 1–5% of production traces into an evals store, tag them, and add the interesting ones to the permanent golden set.
Building your golden set without losing your mind
The golden set is the spine of the whole system. A few rules we've learned the hard way:
Stratify, don't randomize. Your golden set should over-represent failure-prone cases: ambiguous queries, multi-step tasks, edge-case tool combinations, adversarial inputs. If 80% of your set is happy-path questions, you're measuring how good your agent is at the easy stuff.
Version it like code. Golden sets live in the repo, with a changelog. When you add cases from production, the PR explains why. When you remove cases, the PR explains why harder.
Tag every case. Tags like tool:refund, intent:complaint, difficulty:hard, regression:2026-03-bug-491 let you slice eval results meaningfully. Aggregate pass rates lie. Per-tag pass rates tell you where the regression actually landed.
Cap the size, then raise the bar. A 200-case set you run on every PR beats a 2,000-case set you run once a week. Start small, keep iteration fast, expand only when you have a real reason.
Wiring evals into CI without bankrupting yourself
Running the full suite on every commit gets expensive fast. We tier it:
- On every PR: Layer 1 (unit) plus a 20–40 case smoke subset of Layer 2. Should finish in under five minutes and cost cents, not dollars.
- On merge to main: Full Layer 2 trace assertions plus Layer 3 judge evals on the full golden set.
- Nightly: Layer 4 production replay on a fresh sample.
- Before any model or prompt change: Full suite, with a required pass-rate delta.
For the gating logic, don't require 100% pass rate — judge models are noisy, and you'll spend your life chasing flakes. Instead, gate on regression: did this PR drop pass rate by more than X% on any tag? If yes, block. If no, ship.
# .github/workflows/llm-evals.yml (sketch)
eval_gate:
smoke:
cases: 30
max_regression_pct: 0
required: true
full:
cases: 200
max_regression_pct: 3
per_tag_max_regression_pct: 10
required_on: [main, release/*]
Cache aggressively. If a prompt template hasn't changed and the model version hasn't changed, you can reuse the previous run's outputs. Most eval frameworks support content-hash-based caching now, and it's the difference between a $5 CI run and a $200 one.
Common traps we still see in 2026
A few patterns that look reasonable and aren't:
- Grading on exact string match for anything generative. You'll either rewrite the golden set every week or accept false failures forever.
- One giant judge prompt that grades ten dimensions at once. Split it. Each dimension gets its own call. Cheaper to debug, more reliable scores.
- Treating eval pass rate as a product KPI. It's a regression signal, not a quality measure. A 95% pass rate on a weak golden set means nothing.
- No human review loop. Even with great judge models, you need a human eyeballing a sample of judge decisions weekly. Judges drift too, especially after model updates from the vendor.
- Evaluating only the agent, never the retrieval. For RAG systems, retrieval quality (recall@k, citation precision) needs its own eval track. The generator can only be as good as what you hand it.
Where we'd start
If you're staring at a half-broken eval setup on Monday, do this in order:
- Pick your top three failure modes from the last quarter of incidents. Write trace-based assertions for each. That's your Layer 2 starter kit.
- Build a 50-case stratified golden set, tagged by intent and difficulty. Version it in the repo.
- Wire a smoke subset into PR checks with a hard zero-regression gate. Get the team used to the muscle memory of "eval failed, look at trace."
- Add a judge-model layer for the one or two quality dimensions that matter most for your product. Use a different model family than your agent.
- Set up production trace sampling so next quarter's golden set writes itself.
Evals aren't glamorous, but they're the thing that lets you change models, refactor prompts, and ship agent updates without holding your breath. If you want a hand setting this up for your stack, our team does this work as part of our AI engineering services.
Want a team like ours?
72Technologies builds production software for the kind of teams who actually read this blog.
Start a projectKeep reading

Semantic Caching for LLM APIs: What Actually Works in Production
Semantic caching promises huge cost wins for LLM apps, but naive implementations leak wrong answers across users. Here's how we build cache layers that actually hold up.

Hybrid Search for RAG: When BM25 Beats Your Vector Database
Pure vector search loses on acronyms, product codes, and rare names. Here's how we mix BM25 with embeddings to fix recall without rewriting the stack.

Token Budgets for Agent Loops: Stopping Runaway Context Costs
Agent loops quietly balloon context until a single task costs dollars instead of cents. Here's how we budget tokens per turn, per tool, and per task — with code you can paste into your own runner.
