Eval Harnesses for AI Agents: Stop Shipping by Vibes
If you can't measure your agent, you can't trust it. This is the evaluation harness I use to keep agents honest in production.
If you can't measure your agent, you can't trust it. This is the evaluation harness I use to keep agents honest in production.

Most agents ship with a demo and a prayer. That works until the first customer asks a real question.
I don't trust vibes. I trust evidence. If an agent can't pass a repeatable eval, it doesn't ship.
The fastest way to lose trust is inconsistency. Evals are how I make “it worked yesterday” mean something.
Every agent needs a contract. Not a prompt. A contract.
Contract: A short list of what the agent must do, what it must never do, and how success is measured.
I keep contracts short and testable. If I can't write a concrete check for it, it's not a contract, it's a wish.
| Contract element | Example | Failure mode it prevents |
|---|---|---|
| Input expectations | “Accepts plain English requests, rejects JSON” | Silent prompt injection via tools |
| Output guarantees | “Always returns a single action + rationale” | Un-actionable responses |
| Safety constraints | “Never calls payments.refund without approval” | Unauthorized transactions |
| Latency budget | “p95 < 4s for tool calls” | Timeouts and retries |
I write the contract before I write the prompt. It keeps me honest and it makes evals trivial to design.
Rule of thumb: If a requirement needs three clauses, split it into a new agent or a new tool. Clarity beats cleverness.
Demos are curated. Goldens are brutal. I want inputs that break the system, not flatter it.
Goldens: A versioned dataset of requests with expected behaviors, labeled by outcome.
I pull goldens from real logs, support tickets, and incident reports. I add adversarial cases on purpose: confusing phrasing, partial info, conflicting instructions.
# eval-suite.yaml
suite: agent-core-v1
cases:
- id: refund-after-30-days
input: "Refund my last order from November"
expected: "route_to_human"
reason: "policy constraint"
- id: bug-report-triage
input: "App crashes on launch on iOS 17"
expected: "create_ticket"
required_fields: ["device", "steps", "severity"]
- id: pricing-question
input: "What's the enterprise price for 200 seats?"
expected: "quote_request"
I keep 50 to 200 of these per agent. New bugs become new goldens. This is the flywheel.
Bias check: If the suite only contains happy paths, you're training yourself to ignore the real world.
An eval harness that isn't in CI is a suggestion. Put it in the build.
Gate: Fail the deploy if the agent regresses beyond a threshold.
# .github/workflows/agent-evals.yml
jobs:
evals:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: pnpm install
- run: pnpm eval:agent-core
- run: pnpm eval:agent-tools
This is the same mindset as unit tests. The model can change, the prompt can change, but the behavior can't.
I also run canaries on model upgrades. If a new model improves creativity but fails routing, it doesn't ship. I separate “interesting” from “acceptable.”
One judge is a single point of failure. I mix deterministic checks, LLM judges, and human samples.
Judge mix: Each judge covers a different blind spot.
| Judge | Good at | Bad at |
|---|---|---|
| Deterministic rules | Schema, tool calls, forbidden actions | Nuance and tone |
| LLM judge | Reasoning quality, relevance | Consistency without calibration |
| Human review | Edge cases, trust, ambiguity | Scale and speed |
I calibrate the LLM judge with a small set of human-labeled cases. If the judge can't agree with humans on 20 examples, it doesn't get to decide for 200.
Averages hide failures. I track slices and regressions by category.
Regression scorecard: Compare current results to the last known good release.
| Slice | Baseline pass rate | Current pass rate | Delta |
|---|---|---|---|
| Tool routing | 92% | 86% | -6% |
| Safety constraints | 98% | 98% | 0% |
| Long-form reasoning | 90% | 88% | -2% |
| Latency budget | 95% | 79% | -16% |
If one slice drops, I don't ship. I wrote about this same observability mindset in observability for AI agents.
Why slices matter: A small drop in a high-risk slice (refunds, permissions, data access) is worse than a big drop in a low-risk slice (tone, style).
Full eval suites can be expensive. If it's too expensive, people skip it. So I tier them.
Tiers: Small for every commit, medium for PRs, full for releases.
Daily suite: 40 cases (fast, cheap)
PR suite: 120 cases (realistic)
Release suite: 300+ cases (full coverage)
This keeps costs predictable, which matters just as much as accuracy (see the cost problem in AI).
I also cache tool outputs and reuse them across runs. If the tool call is deterministic, the eval should be too.
Cheap evals are the only ones you will actually run every day.
A model without evals is just a demo with a deadline.