Mehmet Erturk | Harness Engineering Went Mainstream. Local Is Where It Gets Hard

A small on-device model held steady by a surrounding harness structure

In February 2026, OpenAI revealed that a team of three engineers had shipped a million-line production codebase — roughly 1,500 merged PRs over five months — without hand-writing a single line. The same month, Mitchell Hashimoto gave the practice a name, threads about "harness engineering" went viral, and there's now a free twelve-lecture course on it. The discipline most AI engineers had never heard of suddenly has a name, a curriculum, and a hype cycle.

I'm glad it does, because I've spent a lot of time doing it in the regime the curriculum doesn't cover: not a frontier model in a 200k-token context with a vendor on call, but a 3B model on an iPhone with 4096 tokens and nobody to call. I built Otto, an on-device AI accountant, and the harness around its model is most of the engineering. This post is the guide I wish had existed when I started.

What a harness actually is

The course's framing is correct and worth repeating: a harness doesn't make the model smarter. It builds a closed loop around the model — rules going in, verification coming out — so that a capable-but-unreliable component becomes a reliable system.

The model stays exactly as flawed as it was. The system around it absorbs the flaws. That's the whole discipline. I made the same argument about why most AI agents fail in production before the term existed: nobody ships raw model output and survives.

What the new curriculum gets right is naming the agent pathologies — overreach, under-finish, premature victory declarations, context loss across sessions. What it can't cover yet is what happens when you take the same discipline local.

The constraint table

Everything in the mainstream harness conversation assumes resources a local harness doesn't have. Line them up and it stops looking like the same problem:

Constraint	Cloud coding agent	Local on-device model
Context window	200k+ tokens	~4096 tokens
Output control	Constrained decoding, schemas	Raw text completion (mostly)
Retry cost	API dollars	Latency, battery, thermals
Model quality	Frontier, improves quietly	0.6–3B, quirks per checkpoint
Observability	Vendor dashboards, traces	Whatever you build yourself
Failure audience	A developer reading a diff	A user mid-conversation

That last row is the one that changes your priorities. A coding agent's failure is reviewed before it ships. A local assistant's failure is the product experience, live, with no human in between. The harness isn't an optimization layer — it's the only thing standing between the model and the user.

So here are the five rules that survived contact with production.

Rule 1: Seal every fallback

Every path from your dispatch logic to an unconstrained model reply is a bug you haven't met yet. Mine produced a seven-step markdown essay about a "5-star Starbucks" when a user typed 5$ starbucks — a catch block fell through to a bare session with zero instructions, and the model did what models do: completed text.

The fix is structural, not promptual. Don't guard the fallback. Delete it. Every error path — unknown tool, missing args, parse failure, context overflow — must resolve to a calm, fixed message:

unknown tool      → "couldn't pick a clean action"
executor throw    → classified error → calm copy
parse failure     → calm copy
context overflow  → calm copy

If there is no path to unconstrained output, unconstrained output cannot happen. The full post-mortem is in the on-device tool-calling deep dive.

Rule 2: Make the model pick, not compute

A 3B model cannot do date arithmetic. It cannot carry a UUID across turns. It will hallucinate both, confidently. Stop asking.

Every place I needed computation, I replaced it with selection. Dates become a pre-resolved lookup table baked into the prompt each turn — the model echoes a row instead of calculating:

'yesterday'   → 2026-06-06
'last week'   → 2026-05-31
'1 month ago' → 2026-05-07

Free-text fields become enums where the value set is closed. Row references become natural-language queries that deterministic code resolves against the ledger. The pattern generalizes: shrink every open-ended generation into a multiple-choice question. Small models are decent pickers and terrible calculators.

Rule 3: Deterministic decisions, tolerant parsing

Two rules that sound contradictory and aren't. Be strict about what the model decides. Be lenient about how it formats the decision.

Strict: routing runs at temperature 0, always. I learned this when routing at 0.6 sent the same input to three different tools across three runs — which made my eval harness measure noise instead of regressions. Determinism isn't a style preference; it's what makes evaluation possible at all.

Tolerant: small models wrap JSON in markdown fences, prepend pleasantries, trail commentary. Parse defensively — extract the first balanced object, quote-aware, and validate the tool name against a registry so a hallucinated tool is rejected outright rather than dispatched.

And keep one source of truth. My tool registry generates the prompt catalog, the dispatch table, and the eval coverage from a single registration. Parity tests assert nothing drifts. Two definitions of a tool is one definition too many.

Rule 4: Typed state beats transcripts

The mainstream answer to multi-turn context is "replay the conversation." With 4096 tokens, that's a non-starter — and it was never a good idea anyway. Prose history is the most expensive, least reliable way to carry state.

Instead, each turn emits a small typed struct carrying only what a follow-up needs:

ConversationFocus:
  lastReadKind:  .topMerchants
  categorySlug:  coffee
  periodDays:    7
  lastSavedTxnID: <uuid>

When the user says "where?" after "how much on coffee this week?", deterministic code reads that struct and routes the follow-up — no second model call. This is the local cousin of what the harness curriculum calls the repo as system of record: durable state lives in structure, not in the context window. Same instinct as deterministic workflows — let code do what code is good at.

Rule 5: The eval is the harness's harness

A harness without an eval is a vibe with extra steps. Every production failure becomes a pinned golden fixture — my first four are marked isRegression: true and fail the build forever if they reroute. Every registered tool must have at least one passing fixture, enforced by test, so coverage can't silently lag the roster.

The uncomfortable part: the eval caught my worst bug before users did. Adding tools pushed the routing prompt to ~4742 tokens — past the 4096 hard limit — and every single turn started throwing exceededContextWindowSize. Not degraded. Dead. On a frontier model you'd never notice the prompt growing; locally, the ceiling is close enough to hit by accident. The eval is the only smoke detector you get.

Still unsolved

I want to be precise about where this stands, because "harness engineering" is being sold as a method and it's still mostly a direction.

The long tail eats golden sets. Real phrasing outruns any fixture list. Every test session surfaces an utterance that routes wrong, and the eval grows faster than the model improves.

Model churn is monthly. The best small function-calling model changes constantly. I keep weights swappable behind a spec so the harness survives the churn, but picking the default per device class is still guesswork.

Schema is fragile at this scale. Widening a structured-output schema to handle one more edge case can break core tool selection on a 3B model. The harness compensates with deterministic post-route repair — which is a patch on the problem, not a solution to it.

Memory is shallow. One turn of typed focus works. "The trip to Japan last month" doesn't fit in any of this yet.

The twelve lectures map the failure modes of coding agents well. The local failure modes don't have a curriculum yet — this post is my running attempt at one. The progress is real. The finish line is not visible.

Harness engineering got a name in 2026. It hasn't earned the word "solved" — and locally, it hasn't earned "stable" either.