Local AI Agent Loops: The Part I Had to Build From Scratch
A local tool-calling harness is not enough once users ask for more than one thing in one message. You need an agent loop you can actually trust.

I thought the hard part was making a small on-device model call tools.
It wasn't. The hard part was making the system keep going after the first tool call without losing state, repeating work, blowing the context window, leaking raw model prose, or turning a finance app into a chatbot with ledger access.
In the last two posts I wrote about the first layer: making on-device LLM tool-calling work, then why local harness engineering is harsher than the cloud version. Since then, Otto's codebase moved from "route one message to one tool" to a real local agent loop across Apple Foundation Models and MLX models.
That sounds like a small implementation detail. It is not. It changes the shape of the whole system.
The single-shot wall
The first version of the harness did one thing well:
user message -> route -> one tool call -> render result -> stop
That fixed the 5$ starbucks failure. The model stopped writing coffee-shop
essays and started emitting structured tool calls. Good.
Then the next real product sentence showed up:
save 4.50 coffee at Blue Bottle and show my week
That is not one action. It is two:
- Save the transaction
- Render the weekly heatmap
A single-shot router can only pick one. If it saves the coffee, it drops the report. If it shows the week, it ignores the new transaction. If it tries to be clever and fold both into one tool, the harness starts lying about what the tools actually do.
The obvious answer is "use the model provider's agent loop." Locally, that answer breaks down fast.
| Runtime | What I wanted | What I had |
|---|---|---|
| Apple Foundation Models | Native tool loop | Structured ChatRoute near the 4096-token ceiling |
| MLX | Constrained JSON tool calls | Raw text generation plus tolerant parsing |
| Both | Same tools, same failures, same UI | Different model APIs and different failure shapes |
The target architecture was a native observe-and-continue loop. The practical
version I shipped is lower level: a shared AgentRunner that owns the loop
invariants while each runtime keeps its native routing contract.
That is the recurring theme in local AI work. The clean abstraction is usually one layer above what the device can afford.
One runner, two model contracts
The runner does not know about Apple Foundation Models. It does not know about MLX. It knows about one normalized shape:
struct Step {
let envelope: ToolCallEnvelope
let more: Bool
}
Each runtime provides a next(completed) closure. The runner handles everything
after that:
for iteration in 1 ... maxIterations {
guard let step = try await next(completed) else {
if completed.isEmpty {
emitUnknownCopy()
}
return completed
}
let envelope = ToolRouteNormalizer.normalize(
step.envelope,
rawMessage: rawMessage,
focus: currentFocus,
now: now
)
let outcome = await registry.dispatch(envelope, context: context)
currentFocus = updateFocus(from: outcome)
emit(outcome)
completed.append(summary(of: outcome))
guard step.more else { return completed }
}
The exact code is more careful, but those are the rules:
Max eight steps. A multi-intent message cannot loop forever. If the model keeps saying more, Otto emits calm copy and asks the user to split the rest.
No re-running completed work. If step two fails after step one saved a transaction, step one stands. The system does not retry the whole turn and save the transaction twice.
One dispatch path. Every route becomes a ToolCallEnvelope, then goes
through the same ToolRegistry.dispatch, the same LocalToolExecutor, and the
same failure copy.
One focus pipeline. Conversation state updates after each tool result, so the second step can see what the first step changed.
That last point matters. An agent loop is not just "call the model again." It is observe, update state, route again from the new state.
flowchart TD
A[User message] --> B[Runtime route step]
B --> C[Normalize envelope]
C --> D[Dispatch local tool]
D --> E[Render message]
E --> F[Update typed focus]
F --> G{More actions?}
G -->|yes| H[Progress or remaining-action prompt]
H --> B
G -->|no| I[Finish turn]
class B,H worker
class C,D,F success
class G decision
The model is still unreliable. The loop is the part that must not be.
MLX can say more
MLX was easier to extend because Otto already owns the prompt format. The model
emits a JSON envelope. I added a top-level more field:
{
"tool": "save_transaction",
"args": {
"merchant": "Blue Bottle",
"amount": "4.50",
"currency": "USD"
},
"more": true
}
When more is true, the next prompt includes a small progress block:
Actions ALREADY COMPLETED for this user message (do not repeat them):
- save_transaction: saved $4.50 at Blue Bottle
Then the runtime resends the route prompt. MLX is stateless here. There is no durable chat session carrying tool results for me. The prompt has to carry enough progress to stop duplicate actions without consuming the entire context window.
That is why the summaries are hard-bounded. A tool result becomes a compact, model-readable line, not a transcript replay. Transcript replay is how you turn a 4096-token model into a time bomb.
The MLX path also got a repair step. If parsing fails, it asks once more with an explicit repair prompt. If that still fails on the first step, the turn fails into the existing safe copy. If it fails after work already executed, the loop ends and keeps the completed work.
That distinction is important. "All or nothing" sounds clean until the "all" contains a real ledger mutation.
Apple FM could not afford one more field
Apple Foundation Models had the stranger constraint. Otto uses a typed
@Generable ChatRoute for routing. It is useful because the model is
constrained into a Swift shape instead of free-form JSON.
But the schema is already at the edge. The route prompt plus schema plus context
block sits around six tokens under the on-device 4096-token ceiling in the
worst case. Adding more: Bool to ChatRoute overflows the context. The eval
caught that.
So the continue signal could not live in the main route schema.
The workaround is a tiny second-pass probe:
@Generable
struct RemainingActionsProbe {
var requestedActions: [String]
}
It only runs when the raw message has a multi-intent cue like and, then,
also, plus, or a semicolon. Single-action turns pay zero extra model calls.
The probe decomposes the original message:
"save 4.50 coffee at Blue Bottle and show my week"
-> ["save 4.50 coffee at Blue Bottle", "show my week"]
"coffee and bagel $12 at Joe's"
-> ["coffee and bagel $12 at Joe's"]
That second example is why a dumb string split would be wrong. and can mean
two actions, or it can be part of one purchase.
The Apple path routes the first step from the full contextual message. Later steps route the decomposed action verbatim:
User message: show my week
That was more reliable than resending the full message plus a progress block. The second route stayed close to the single-intent conditions already pinned by the routing eval.
This is the part people miss when they talk about "just use structured output." Structured output does not remove context limits. Sometimes the schema itself is the thing that breaks the model.
The harness got more deterministic
The loop forced other parts of the harness to become stricter.
A multi-step agent magnifies every small ambiguity. If the first step picks the wrong tool, the second step compounds the mistake. If arg recovery is sloppy, the user sees two wrong cards instead of one. If model switching is stale, you debug a runtime that is not actually running anymore.
So the recent changes are less glamorous than "agent loop", but they are the work that makes the loop usable:
| Problem | Fix |
|---|---|
| Prompt roster too large | Per-turn ToolCandidateSelector shows only plausible tools |
| Date drift east of UTC | Shared local ISO-day formatter for prompt anchors and parsing |
| Missing amount or merchant args | Runtime-neutral ToolArgEnricher fills only empty fields |
add 5$ two days ago becoming 52 | Amount parser isolates the money-looking token first |
| User activates a different MLX model | ensureReady checks active slug, not only readiness |
| First local-model turn stalls | prewarm() loads weights before the first message |
| Backgrounded app holds 3GB of weights | MLX runtime unloads on background or memory warning |
| Qwen thinking flag leaks to other templates | enable_thinking sent only for Qwen-family specs |
None of these make a good demo. All of them prevent weird product failures.
This is where local agent work starts looking less like prompt engineering and more like runtime engineering. You are managing memory, cancellation, schema size, time zones, model-specific template variables, and ledger mutation semantics at the same time.
The model catalog became part of the harness
I also refreshed the MLX model catalog. That sounds separate from the agent loop, but it is not.
An on-device agent is only as good as the model's ability to pick tools. General chat quality is secondary. A model that writes nicer prose but misses function calls is worse for Otto than a smaller model that reliably emits the right envelope.
The catalog now tracks that explicitly:
let toolCallRank: Int
let approxPeakRamBytes: Int64
let minSupportedDevice: DeviceTier
let disablesThinking: Bool
The product decision is not "best model." It is "best legally shippable tool caller for this device's RAM."
That removed some attractive candidates. Hammer was strong at tool calls, but the license was not usable for a paid finance product. Some Gemma paths were not clear enough for this use case. Phi-3.5 was retired from the user catalog because it was not tool-call-first enough.
The newer catalog adds candidates like Qwen3 4B, Phi-4 Mini, and Granite 4.0 H 1B, but the ranking is still provisional until the on-device bench says they survive Otto's actual prompts.
Local AI is not just "download weights." It is product policy, legal policy, hardware policy, and eval policy in one table.
Evals are now the harness's harness
The loop has its own tests now. Not just "does this function return something", but the invariants that protect user data:
Two-step turn executes in order. The runner emits both messages and labels
the outcome as tool:step_one+step_two.
Single-step turn stops. No extra model call for normal messages.
Iteration cap emits calm copy. No runaway loops.
No route on first step asks for rephrase. No raw model prose.
No route after first step keeps completed work. No fake rollback, no duplicate mutation.
The Apple FM path also has a live scenario eval for the real model: the probe
must split save 4.50 coffee at Blue Bottle and show my week into two actions,
but keep coffee and bagel $12 at Joe's as one action. Then a full chat turn
must save the coffee and render the week.
That test requires an Apple Intelligence host, so it is availability-gated. But the point is clear: for local models, evals are not an academic layer. They are the only way to notice when a beta OS, a model checkpoint, a schema field, or a prompt line silently changes behavior.
The earlier routing eval caught the 4096-token wall. The new loop evals catch the next class of failure: did the agent continue, stop, or repeat at the right time?
What is still open
This is not the final architecture.
Apple's native Tool protocol loop is still the cleaner long-term direction
for the Apple FM path, but migrating to it changes the wire contract and needs a
fresh device eval. MLX constrained decoding or grammar masking is still future
work; tolerant JSON parsing plus repair is a soft guarantee, not a hard one.
The typed focus model works for immediate follow-ups. It does not solve deep memory. "Where did I spend money on the Japan trip last month?" needs a richer state layer than one recent read context.
And the long tail is still undefeated. Every eval set is a net with holes. Users will always phrase things in ways the fixtures did not cover.
But the architecture crossed an important line. Otto is no longer a local model wrapped in tool dispatch. It is a local agent runtime with bounded iteration, sealed fallbacks, shared dispatch, typed focus, per-runtime continuation strategy, and evals that pin the behavior.
That is the part I did not expect to build when I started with 5$ starbucks.
A local agent is not a small chatbot. It is a runtime you own.