Mehmet Erturk | Eval Harnesses for AI Agents: Stop Shipping by Vibes

Abstract evaluation harness

Most agents ship with a demo and a prayer. That works until the first customer asks a real question.

I don't trust vibes. I trust evidence. If an agent can't pass a repeatable eval, it doesn't ship.

The fastest way to lose trust is inconsistency. Evals are how I make “it worked yesterday” mean something.

Define Behavioral Contracts

Every agent needs a contract. Not a prompt. A contract.

Contract: A short list of what the agent must do, what it must never do, and how success is measured.

I keep contracts short and testable. If I can't write a concrete check for it, it's not a contract, it's a wish.

Contract element	Example	Failure mode it prevents
Input expectations	“Accepts plain English requests, rejects JSON”	Silent prompt injection via tools
Output guarantees	“Always returns a single action + rationale”	Un-actionable responses
Safety constraints	“Never calls `payments.refund` without approval”	Unauthorized transactions
Latency budget	“p95 < 4s for tool calls”	Timeouts and retries

I write the contract before I write the prompt. It keeps me honest and it makes evals trivial to design.

Rule of thumb: If a requirement needs three clauses, split it into a new agent or a new tool. Clarity beats cleverness.

Build Goldens, Not Demos

Demos are curated. Goldens are brutal. I want inputs that break the system, not flatter it.

Goldens: A versioned dataset of requests with expected behaviors, labeled by outcome.

I pull goldens from real logs, support tickets, and incident reports. I add adversarial cases on purpose: confusing phrasing, partial info, conflicting instructions.

# eval-suite.yaml
suite: agent-core-v1
cases:
  - id: refund-after-30-days
    input: "Refund my last order from November"
    expected: "route_to_human"
    reason: "policy constraint"
  - id: bug-report-triage
    input: "App crashes on launch on iOS 17"
    expected: "create_ticket"
    required_fields: ["device", "steps", "severity"]
  - id: pricing-question
    input: "What's the enterprise price for 200 seats?"
    expected: "quote_request"

I keep 50 to 200 of these per agent. New bugs become new goldens. This is the flywheel.

Bias check: If the suite only contains happy paths, you're training yourself to ignore the real world.

Run Evals as CI Gates

An eval harness that isn't in CI is a suggestion. Put it in the build.

Gate: Fail the deploy if the agent regresses beyond a threshold.

# .github/workflows/agent-evals.yml
jobs:
  evals:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pnpm install
      - run: pnpm eval:agent-core
      - run: pnpm eval:agent-tools

This is the same mindset as unit tests. The model can change, the prompt can change, but the behavior can't.

I also run canaries on model upgrades. If a new model improves creativity but fails routing, it doesn't ship. I separate “interesting” from “acceptable.”

Use Multiple Judges

One judge is a single point of failure. I mix deterministic checks, LLM judges, and human samples.

Judge mix: Each judge covers a different blind spot.

Judge	Good at	Bad at
Deterministic rules	Schema, tool calls, forbidden actions	Nuance and tone
LLM judge	Reasoning quality, relevance	Consistency without calibration
Human review	Edge cases, trust, ambiguity	Scale and speed

I calibrate the LLM judge with a small set of human-labeled cases. If the judge can't agree with humans on 20 examples, it doesn't get to decide for 200.

Track Regression, Not Averages

Averages hide failures. I track slices and regressions by category.

Regression scorecard: Compare current results to the last known good release.

Slice	Baseline pass rate	Current pass rate	Delta
Tool routing	92%	86%	-6%
Safety constraints	98%	98%	0%
Long-form reasoning	90%	88%	-2%
Latency budget	95%	79%	-16%

If one slice drops, I don't ship. I wrote about this same observability mindset in observability for AI agents.

Why slices matter: A small drop in a high-risk slice (refunds, permissions, data access) is worse than a big drop in a low-risk slice (tone, style).

Make Evals Cheap Enough to Run Daily

Full eval suites can be expensive. If it's too expensive, people skip it. So I tier them.

Tiers: Small for every commit, medium for PRs, full for releases.

Daily suite: 40 cases (fast, cheap)
PR suite: 120 cases (realistic)
Release suite: 300+ cases (full coverage)

This keeps costs predictable, which matters just as much as accuracy (see the cost problem in AI).

I also cache tool outputs and reuse them across runs. If the tool call is deterministic, the eval should be too.

Cheap evals are the only ones you will actually run every day.

A model without evals is just a demo with a deadline.