/12 min read

From Hard to Harnessed: Making On-Device LLM Tool-Calling Actually Work

A 3B model on your phone should log "5$ starbucks" as a five-dollar transaction attributed to Starbucks. Mine wrote a coffee-shop review instead. Here is the harness that fixed it.

An on-device model output being constrained and routed into structured tool calls

I typed 5$ starbucks into my finance app. It should have logged a five-dollar transaction attributed to the Starbucks merchant. Instead it wrote a seven-step markdown essay about a "5-star Starbucks" and where to find one nearby.

That was the on-device model doing exactly what a language model does: completing text. No server, no GPT-4, no fallback to the cloud. Just a 3B model on an iPhone, and a prompt with no constraints. The whole product promise was fully on-device: Apple Foundation Models and downloadable MLX weights, no servers that touch the ledger. Which meant every reliability problem was mine to solve, on a model an order of magnitude weaker than what everyone benchmarks against.

This is the journey from that essay bug to a tool-calling harness that holds. I built it inside Otto, a private on-device AI accountant for iPhone. Every pattern here ships in production.

The unconstrained-fallback trap

Here were four real inputs on a fresh install, and what the model did with them:

InputExpectedObserved
i spent 5$ starbuckssave $5 Starbuckschatty coffee recommendation
add 5$ for starbucks to todays costsave $5 today"Could you provide the current cost?"
5 usdask which merchant"looks like you have a typo"
5$ starbuckssave $5 Starbucks7-step "5-star Starbucks" essay

The smoking gun was the architecture, not the model. When structured routing threw (bad schema, context pressure), the catch block fell through to a bare session with zero instructions:

// the bug: a naked session completing raw text
let session = LanguageModelSession()
let reply = try await session.respond(to: userInput)

A naked session does not know it is a finance app. It sees 5$ starbucks and writes a product review. The fix is not a better prompt on the fallback. The fix is deleting the fallback. There must be no path from tool dispatch to an unconstrained base-model reply.

func dispatch(_ envelope: ToolCallEnvelope, context: ToolContext) async -> ToolOutcome {
    guard let tool = tool(named: envelope.name) else {
        return .message(.plainBubble(kicker: "otto", text: Self.unknownText))
    }
    do {
        return try await tool.run(envelope.args, context: context).asOutcome()
    } catch {
        let failure = ToolExecutionError.classify(error, tool: envelope.name)
        return .message(.plainBubble(kicker: "otto", text: failure.userText))
    }
}

Every error path (unknown tool, missing args, executor throw, parse failure, model unavailable) resolves to a calm, fixed bubble. The essay path was structurally removed, not guarded. This is the same lesson from why most AI agents fail in production: the failure mode you don't have a path to is the one that can't bite you.

One registry, two runtimes

The app runs on two backends. Apple Foundation Models when the OS provides it, a downloaded MLX model (function-calling weights from HuggingFace) otherwise. The old code had a separate dispatch switch per runtime. Adding one tool meant editing three files in two engines, and the two engines drifted.

The fix: a single runtime-neutral contract. Tools register once. Both the MLX prompt catalog and the dispatch table derive from the same registry.

@MainActor
protocol OttoTool {
    var name: String { get }            // snake_case, identical across runtimes
    var summary: String { get }         // feeds the MLX prompt catalog
    var argHints: [ToolArgHint] { get }
    func run(_ args: ToolArgs, context: ToolContext) async throws -> ToolResult
}

Add a tool now: write one struct, add one line to the registry. The Apple FM schema, the MLX catalog, analytics keys, and dispatch all stay in sync because they read from the same source.

Runtime selection happens per call, not once at launch. A user who finishes downloading weights mid-session, or enables Apple Intelligence, should not stay on the weaker engine until relaunch.

flowchart TD
    A[chat turn] --> B{MLX installed
+ confirmed?}
    B -->|yes| C[MLX runtime]
    B -->|no| D{Apple FM
available?}
    D -->|yes| E[Apple FM]
    D -->|no| F[visible offline bubble]
    C --> G[ToolRegistry.dispatch]
    E --> G
    class B,D decision
    class C,E worker
    class G success
    class F reject

The fallback is a visible offline bubble, never a silent degrade. On-device chat is stateless per turn, so switching engines between turns inside one conversation is safe.

Model choice itself is RAM-aware, not generation-aware. An iPhone 15 and 15 Pro share a device tier but have 6GB vs 8GB. Picking by generation would hand a 3.4GB-peak model to a 6GB phone and OOM-crash it.

static func runnable(ramBytes: Int64) -> [MLXModelSpec] {
    let budget = Int64(Double(ramBytes) * 0.55)   // headroom for app + OS
    return catalog.filter { $0.approxPeakRamBytes <= budget }
}

Tool-calling reliability is a first-class field on each model spec. A purpose-built function-calling model outranks a stronger general-chat model, because an agent lives or dies on routing, not prose.

Deterministic routing, tolerant parsing

Two rules that sound contradictory but aren't: be strict about what the model decides, lenient about how it formats the answer.

Strict intent. Routing runs at temperature 0 (MLX) and greedy sampling (Apple FM). The same input must route to the same tool every time, or the eval harness is measuring noise. I learned this the hard way. MLX routing at temperature 0.6 made the golden eval meaningless, because the same 5$ starbucks routed three different ways across three runs. I wrote about why you can't ship agents by vibes; this is that, on hardware you don't control.

Lenient format. Small models wrap JSON in markdown fences, prepend "Sure! Here you go:", trail commentary. So the MLX parser extracts the first balanced object with a quote-aware brace counter, so a } inside a string value can't close the object early.

while i < trimmed.endIndex {
    let ch = trimmed[i]
    if escape { escape = false }
    else if ch == "\\" { escape = true }
    else if ch == "\"" { inString.toggle() }
    else if !inString {
        if ch == "{" { depth += 1 }
        if ch == "}" { depth -= 1; if depth == 0 { return String(trimmed[startIdx...i]) } }
    }
}

One trap worth naming: do not set keyDecodingStrategy = .convertFromSnakeCase on the decoder. It rewrites date_iso to dateIso before your CodingKeys match, so the declared key never hits and every decode silently loses the field. Use explicit CodingKeys instead. That one cost me an afternoon.

The 4096-token wall

On-device context is roughly 4096 tokens. Tiny. As the tool roster grew past a dozen tools with per-field guidance, the route prompt hit ~4742 tokens and every turn started throwing exceededContextWindowSize. The eval caught it before it shipped.

Two structural fixes, not "trim the prompt a bit."

Narrow the roster per turn. A cheap string heuristic runs on the raw message before the AI call and shows the model only the plausible tools for that turn. It does not pick the tool. That's still the model's job. It just shrinks the menu.

Candidate tools for this turn: save_transaction, amend_last_transaction, chat

The full registry is always used for dispatch; only the prompt sees the narrowed set. Tools can keep growing without the per-turn prompt growing with them.

Typed follow-up memory, not transcript replay. Replaying prose history into a 4096-token window is a non-starter. Instead, a tiny serializable struct carries only what a follow-up needs:

ConversationFocus:
  lastReadKind:   .topMerchants
  categorySlug:   coffee
  periodDays:     7
  dateWindow:     2026-05-27 … 2026-06-03
  lastSavedTxnID: <uuid>

When the user says "where?" after "how much on coffee this week?", deterministic code reads that focus and routes to top_merchants with the category and period pre-filled, with no second AI call. This is the same instinct as building a chat system that's really a workflow engine: keep state typed and let code do what code is good at.

Resolve-and-act, not UUIDs

"Update the coffee from Tuesday" is impossible if the model has to hold a row ID. Small models can't carry a UUID through a conversation, and you don't want one surfacing in chat anyway.

So edit and delete tools take a natural-language transaction_query, run a fuzzy search over the ledger, and branch on the match count:

let candidates = try executor.resolveTransactions(query: q, limit: 5)
if candidates.count == 1 {
    return .outcome(applyEdit(id: candidates[0].id, changes: changes))
}
// ambiguous: return tappable chips, each carrying the pending edit + its target id
let chips = candidates.map { tx in
    ChipAction(label: "\(tx.merchant) · \(tx.date)",
               intent: .applyEdit,
               payload: changes.with(targetId: tx.id))
}

One match, act. Many matches, render disambiguation chips where each chip carries the edit as payload. The user taps once. No UUID in chat, no multi-turn disambiguation loop the small model would fumble.

Dates get the same treatment. Instead of asking a 3B model to do calendar arithmetic, which it hallucinates, the prompt bakes in a lookup table and the model just echoes the right row:

'yesterday'   → 2026-06-02
'last week'   → 2026-05-27
'1 month ago' → 2026-05-03

Don't make a weak model compute. Make it pick.

Post-route repair

The model chooses the tool. Deterministic code cleans up after it. Two distinct jobs, kept separate.

Arg enrichment fills slots the model left empty, never overrides what it provided. Pure string-in, dict-out, fully unit-testable, runs on both runtimes. It resolves relative dates with real Calendar math, maps currency vocabulary ("bucks" → USD, "lira" → TRY), and matches category keywords while deliberately excluding brand names, which belong in the merchant slot.

Route normalization repairs high-confidence slips. A correction cue ("actually, make it 7") demotes a save into an amend. A save verb plus a money signal on a read-shaped route promotes it back to a save. After any switch, args are rebuilt from the raw utterance so stale values from the wrong route can't leak through.

LayerDecidesOwns
Modelwhich tool, core argsintent
Normalizerobvious slipstool correction
Enrichermissing argsslot filling
Dispatchexecution + errorssafety

Neither layer does the other's job. The model is never asked to be deterministic; the deterministic code is never asked to reason. That boundary is the whole trick, and it's why I keep coming back to deterministic workflows around the reasoning core instead of asking the model to be reliable by itself.

The payoff: no cloud, no per-token bill, no data leaving the phone, and a model a tenth the size routing reliably. The cost problem disappears when inference is free and local. The reliability problem doesn't. You just have to engineer it instead of buying it.

What still breaks

I want to be honest about where this sits, because "reliable" is doing a lot of work in that last sentence. The harness holds for the common path. The tail is still rough.

Long-tail phrasing. The eval is a few dozen golden cases. Real people phrase spending in ways I never wrote down: slang, two purchases in one sentence, mid-sentence corrections, languages I didn't seed the currency table for. Every TestFlight session surfaces a phrasing that routes wrong, and each one becomes a new golden case. The eval grows faster than the model improves.

Model churn. The on-device function-calling landscape moves monthly. A model that wins on tool-call accuracy this month gets beaten next month by something smaller, or something that quantizes better, or something with a longer context that lets me stop fighting the 4096-token wall. I keep the model behind a spec with a toolCallRank field precisely so I can swap weights without touching the harness, but picking the right default per device class is still guesswork backed by a thin eval.

The two-engine drift. Apple FM and MLX share the registry and the routing guidance, but they fail differently. FM throws on context pressure; MLX emits malformed JSON. Keeping both honest against one eval means every fix gets tested twice, and sometimes a fix for one regresses the other. I'm not convinced two runtimes is the right long-term answer. It might collapse to one as Apple's model matures or as MLX weights get good enough to always win.

Memory is shallow. ConversationFocus carries one turn of context well. It does not carry "the trip to Japan last month" or "my usual grocery store." Real accountant-grade memory needs something closer to a typed, queryable profile the model can read on demand, and that's a different post, one I haven't earned the right to write yet because I haven't built it.

So this is a snapshot, not a finished system. I'm still hunting for a better harness-and-model setup, and I suspect the version I ship in six months will make parts of this post look quaint.

If you want to watch that happen, or break it for me, Otto is heading into early TestFlight. There's a wishlist at ottokeep.com. I'd genuinely rather find the phrasing that routes wrong now, from real ledgers, than after launch. Sign up, throw your weirdest spending sentence at it, and tell me where it fumbles.


A weak model with a strong harness beats a strong model with none. I'm still building the harness.