Evaluating Agents in CI: Building Regression Tests That Catch Real Failures
Unit tests don't catch agent regressions. Here's how to build an eval harness that runs in CI, fails fast on real breakages, and doesn't bankrupt you on token spend.
Most teams ship LLM agents the same way they ship prototypes: a few manual prompts in a notebook, a thumbs-up from the PM, merge to main. Then a model update lands, a tool signature shifts, and a customer-facing agent starts confidently booking the wrong meeting. The fix isn't more prompt engineering — it's treating agent behaviour like any other regression surface and wiring evals into CI.
This is a write-up of what we've found works (and what doesn't) when you actually try to do that on a real product timeline.
Why traditional tests miss agent regressions
A unit test asserts that add(2, 2) == 4. An agent test has to assert that a non-deterministic policy, calling non-deterministic tools, produced a reasonable outcome across a distribution of inputs. Three things break the usual testing playbook:
- Non-determinism. Even at
temperature=0, providers occasionally return different tokens for the same prompt. Anthropic and OpenAI both document this in their API references — determinism is best-effort, not guaranteed. - Drift. Models get silently updated.
gpt-4oin March is notgpt-4oin October. Pinning to dated snapshots helps but doesn't eliminate it. - Compound failure surface. An agent has prompts, tools, retrieval, memory, and routing. A regression in any one shows up as a weird final answer.
So "does the test pass" becomes "does the agent still behave within an acceptable envelope." That's a different shape of test.
The four layers of an agent eval suite
We split eval suites into four layers, and we run them at different points in the pipeline because they have wildly different cost profiles.
1. Deterministic assertions
These are cheap, fast, and should run on every PR. No model calls. You're testing the scaffolding around the model:
- Tool schemas validate against JSON Schema
- Prompt templates render without missing variables
- Retrieval returns the expected document IDs for known queries (using a frozen embedding index)
- Output parsers handle malformed JSON gracefully
If any of these break, you don't need a model to tell you something is wrong.
2. Golden trace replays
For each critical user journey, you capture a real production trace — the full sequence of model calls, tool invocations, and final output. In CI, you replay the inputs and compare the new output against the golden one using a judge.
This is where most of the signal lives. A typical suite has 30–80 traces covering happy paths, known edge cases, and previously fixed bugs.
3. LLM-as-judge scoring
For open-ended outputs (summaries, draft emails, support replies), you can't string-match. You use a stronger model to score the candidate output against rubric criteria. Anthropic's docs on building evals and OpenAI's Evals framework both push this pattern, and it works — with caveats we'll get to.
4. Adversarial and safety probes
A small set of inputs designed to break the agent: prompt injections, ambiguous instructions, out-of-scope requests, jailbreak attempts. These don't need to pass perfectly, but you want to know when the failure mode changes.
A concrete CI setup
Here's a stripped-down version of what we actually run. The harness is a Python script invoked from GitHub Actions:
import asyncio, json, os
from pathlib import Path
from anthropic import AsyncAnthropic
client = AsyncAnthropic()
JUDGE_MODEL = "claude-sonnet-4-5"
AGENT_MODEL = os.environ["AGENT_MODEL"]
async def run_case(case: dict) -> dict:
result = await agent.invoke(case["input"])
score = await judge(case, result)
return {
"id": case["id"],
"passed": score["score"] >= case.get("threshold", 0.8),
"score": score["score"],
"reason": score["reason"],
"cost_usd": result.cost,
}
async def judge(case, result):
prompt = f"""Rubric: {case['rubric']}
Expected behaviour: {case['expected']}
Actual output: {result.output}
Return JSON: {{"score": 0.0-1.0, "reason": "..."}}"""
resp = await client.messages.create(
model=JUDGE_MODEL,
max_tokens=400,
messages=[{"role": "user", "content": prompt}],
)
return json.loads(resp.content[0].text)
async def main():
cases = [json.loads(p.read_text()) for p in Path("evals/cases").glob("*.json")]
results = await asyncio.gather(*(run_case(c) for c in cases))
failed = [r for r in results if not r["passed"]]
total_cost = sum(r["cost_usd"] for r in results)
print(f"Passed: {len(results) - len(failed)}/{len(results)} | Cost: ${total_cost:.2f}")
if failed:
for f in failed: print(f"FAIL {f['id']}: {f['reason']}")
exit(1)
asyncio.run(main())
A few things worth calling out:
- Cases are flat JSON files, one per scenario, version-controlled with the code. PMs can read and edit them.
- The judge model is different and stronger than the agent model. Using the same model to judge itself is a known failure pattern — it tends to rate its own outputs generously.
- Cost is tracked per run and surfaced in the job summary. A surprise 10x in eval cost usually means someone added a long-context case without thinking.
Cost guardrails that actually hold
A serious eval suite can easily cost $5–$20 per CI run if you're not careful. On a busy repo that adds up fast. Three guardrails we won't ship without:
- Tiered runs. Layer 1 (deterministic) on every push. Layers 2–3 on PRs touching
agent/,prompts/, orretrieval/. Full suite including adversarial on nightly main. - Prompt caching on the judge. The judge prompt is mostly static rubric text. Anthropic's prompt caching can cut judge costs by 60–80% in our experience when the rubric is the cacheable prefix.
- A hard budget. The harness aborts if projected cost exceeds a ceiling. Better a failed CI job than a $400 surprise.
Where evals flake, and what to do about it
The single biggest reason teams abandon eval suites is flakiness. A test that fails 1 in 5 runs for no reason erodes trust faster than no test at all.
Judge variance
LLM judges drift. Run each judge call 3 times and take the median for borderline cases (score within 0.1 of the threshold). It roughly triples judge cost on those cases but eliminates most flake.
Threshold tuning
Don't pick 0.8 because it sounds right. For each case, run the current agent 10 times, record the score distribution, and set the threshold at roughly the 10th percentile. Now a fail genuinely means "this is worse than the worst recent baseline," not "the judge had a bad day."
Tool and network flake
Mock external tools in CI. Replay recorded responses. The eval is testing the agent's reasoning, not whether Stripe's sandbox is up. Keep a separate, smaller integration suite for live-tool checks.
What to put in version control
This matters more than people think. Treat the eval suite as a first-class artifact:
- Eval cases (
evals/cases/*.json) - Rubrics, separately from cases so they can be reused
- Golden traces with the model snapshot ID they were captured against
- A
BASELINE.mdrecording current pass rates per model, refreshed on every release
When a model snapshot is deprecated (and it will be — see OpenAI's and Anthropic's model lifecycle pages), you have a known-good reference point for the migration.
When evals tell you to not upgrade
This is the payoff. A new model drops, marketing benchmarks look great, your suite shows a 6-point drop on the customer support journey because the new model is more verbose and ignores a formatting instruction in your system prompt. Without the suite, you'd have shipped it and found out from a customer.
We've held back at least three model upgrades on client projects in the last year because the eval suite flagged regressions the vendor benchmarks didn't capture. That's the entire business case in one sentence.
Where we'd start
If you're staring at an agent in production with no evals, don't try to build all four layers at once. Do this:
- Pick the one user journey that would hurt most if it broke. Capture 10 real traces.
- Write a one-paragraph rubric for what "good" looks like on that journey.
- Wire up a judge with a stronger model than the agent uses. Run nightly, not on every PR yet.
- Watch it for two weeks. Tune thresholds. Kill flaky cases.
- Only then expand coverage and move it onto PR triggers.
The failure mode isn't "no evals." It's "ambitious eval suite that nobody trusts." Start narrow, earn the trust, then widen. If you want help wiring this into an existing product, our team does this kind of work — see our AI services for the shape of engagements we take on.
Want a team like ours?
72Technologies builds production software for the kind of teams who actually read this blog.
Start a projectKeep reading
Semantic Cache Hits: How to Stop Paying for the Same Answer Twice
Exact-match caches barely help LLM apps because users phrase things differently every time. Here's how to build a semantic cache that actually cuts spend without shipping stale or wrong answers.
Token Budgets for RAG: Stopping Retrieval Bloat Before It Eats Your Margin
Retrieval pipelines quietly inflate prompts until margins disappear. Here's how we set token budgets per stage, enforce them in code, and catch regressions before they hit production invoices.
Context Window Budgeting: How to Stop Wasting Tokens on Long-Context Models
Long-context models tempt you to stuff everything into the prompt. That's how you end up with slow, expensive, and weirdly dumb responses. Here's how we budget tokens in production.
