Observability for AI Agent Systems Is Not Just Logging
Logs tell you what happened. Traces tell you why. Here's how to build observability that actually works for multi-agent AI systems.
Logs tell you what happened. Traces tell you why. Here's how to build observability that actually works for multi-agent AI systems.

I wrote a manifest for better logging a few years ago. Everything in it still holds. But when I started building multi-agent AI systems, I realized logging — even good logging — isn't enough.
An AI agent workflow isn't a request-response cycle. It's a tree. A single user action can trigger an orchestrator that spawns three agents, each making multiple LLM calls, some with tool use, some with memory retrieval, some waiting for human approval. One of those agents might retry with a different provider after a circuit breaker trips.
logger.info("Processing request") doesn't cut it here.
Traditional observability has three pillars: logs, metrics, traces. AI agent systems need all three, but the semantics change.
A REST API returns the same response for the same input. An LLM doesn't. Same prompt, same model, different output. This means you can't just log "called LLM" — you need to capture the full context: prompt, model, temperature, token counts, and the actual response.
Without this, debugging is guesswork. "The agent made a bad decision" tells you nothing. "The agent received this 4,000-token context, generated this response with 0.7 temperature, and chose path B over path A" tells you everything.
A database query takes 5ms or 50ms. An LLM call takes 500ms or 15,000ms. The variance is enormous, and it's not a bug — it's the nature of generative models.
Every LLM call has a cost that depends on input and output token counts. Two identical-looking requests can cost $0.01 and $0.50 depending on context size. If you're not tracking this per-span, you're flying blind on both performance and spend.
Traditional services have predictable call graphs. Agent systems don't. A router node might send execution down one of five paths based on LLM classification. A parallel node spawns concurrent branches that join later. A human gate pauses execution indefinitely.
Your trace tree isn't a straight line. It's a directed acyclic graph with conditional branches, parallel fan-outs, and variable-length paths.
In my system, every workflow execution gets a single trace. Every node in the workflow gets a span. Every LLM call within a node gets a child span. This gives me a complete picture:
Trace: execution_abc123
├── Span: trigger_node (12ms)
├── Span: llm_classify (2,340ms)
│ ├── LLM Call: anthropic/claude-sonnet (2,280ms)
│ │ ├── input_tokens: 3,200
│ │ ├── output_tokens: 45
│ │ ├── cost: $0.034
│ │ └── classification: "support_request"
│ └── Decision: route → support_branch
├── Span: memory_retrieve (89ms)
│ ├── query: "customer history for user_789"
│ └── results: 3 documents, 2,100 tokens
├── Span: llm_respond (4,120ms)
│ ├── LLM Call: anthropic/claude-sonnet (4,050ms)
│ │ ├── input_tokens: 5,300
│ │ ├── output_tokens: 890
│ │ └── cost: $0.083
│ └── tools_called: [lookup_order, check_status]
├── Span: human_gate (00:03:22)
│ ├── gate_type: approval
│ ├── timeout: 10m
│ ├── decision: approved
│ └── decided_by: user_456
└── Span: action_send_email (230ms)
└── status: success
Total: 00:03:29 | Cost: $0.117 | Tokens: 11,435
This trace tells me everything. Where the time went (human gate: 3 minutes, LLM calls: 6.4 seconds). Where the money went (response generation: 71% of cost). What decisions were made and why.
At minimum, every LLM call span needs:
| Attribute | Why |
|---|---|
llm.provider | Which provider handled it |
llm.model | Which model was used |
llm.input_tokens | Input size (cost + context tracking) |
llm.output_tokens | Output size (cost tracking) |
llm.cost | Calculated cost in microdollars |
llm.temperature | Reproducibility debugging |
llm.latency_ms | Performance tracking |
llm.status | success, error, timeout, circuit_break |
llm.tools_called | What tools the model invoked |
Optional but valuable:
| Attribute | Why |
|---|---|
llm.prompt_hash | Detect prompt drift without storing full prompts |
llm.cache_hit | Prompt caching effectiveness |
llm.retry_count | How many attempts before success |
llm.fallback_from | If this was a fallback, which provider failed |
Storing full prompts and responses in traces is expensive. I hash prompts and store full content only on errors or when sampling. For debugging, the hash lets me correlate with the prompt version in my codebase.
LLM spans are only part of the picture. Each workflow node needs its own instrumentation:
Router nodes — Log the classification result, confidence score, and which branch was taken. When a workflow produces bad output, the first question is usually "why did it go down this path?"
Parallel nodes — Track fan-out count, individual branch durations, and join timing. If one branch takes 10x longer than the others, you need to see that immediately.
Memory nodes — Log the query, number of results, total token count retrieved, and relevance scores. Context quality drives output quality. Bad retrieval is the silent killer of agent performance.
Human gates — Track wait time, timeout configuration, who decided, and what they decided. This is your audit trail.
Action nodes — Log the external system called, request/response status, and any side effects. These are your irreversible operations.
Traces give you the micro view. Metrics give you the macro view. For AI agent systems, the standard RED metrics (Rate, Errors, Duration) need extension:
execution_cost_total (histogram)
labels: workflow_type, provider, model
execution_token_count (histogram)
labels: workflow_type, direction (input/output)
provider_cost_rate (gauge)
labels: provider
unit: dollars_per_minute
Cost-per-execution trending up means something changed. Maybe prompts grew. Maybe the model changed. Maybe a loop isn't terminating early enough. Without the metric, you won't notice until the invoice arrives.
execution_completion_rate (counter)
labels: workflow_type, outcome (success/error/timeout/budget_exhausted)
node_error_rate (counter)
labels: workflow_type, node_type, error_type
human_gate_override_rate (counter)
labels: workflow_type, gate_type
// high override rate = your AI decisions need work
The human gate override rate is underrated. If humans reject 40% of AI decisions at a particular gate, that's not an approval problem — it's a model quality problem. The metric surfaces what anecdotes hide.
execution_duration (histogram)
labels: workflow_type
buckets: 1s, 5s, 15s, 30s, 60s, 300s
llm_call_duration (histogram)
labels: provider, model
buckets: 500ms, 1s, 2s, 5s, 10s, 30s
human_gate_wait_duration (histogram)
labels: gate_type
buckets: 10s, 30s, 60s, 300s, 600s, 3600s
AI latency has a bimodal distribution. Most requests are fast. Some are very slow. P50 looks fine. P99 is terrible. If you're only tracking averages, you're missing the experience of your most frustrated users.
The trace ID is the thread that ties everything together. My system propagates it through:
flowchart TD
A[HTTP Request] --> B[API Handler]
B --> C[Execution Engine]
C --> D[Workflow Node]
D --> E[LLM Call]
E --> F[Provider API]
D --> G[Tool Execution]
G --> H[Database Query]
G --> I[External API]
class C special
class D,E worker
class F,G,H,I default
When something goes wrong, I start with the trace ID from the error log and walk the entire execution tree. No grep. No guessing. No "which log file is this in?"
This is where OpenTelemetry earns its keep. Standard trace propagation, standard span attributes, standard exporters. I don't want to build tracing infrastructure. I want to instrument my system and send data to whatever backend I'm using.
I've built elaborate dashboards. Most of them went unused. Here's what I actually look at daily:
Execution health — Completion rate, error rate, timeout rate. One chart, last 24 hours. If the lines move, something changed.
Cost burn rate — Dollars per hour, broken down by workflow type. With trend line. If the trend line slopes up and no new features shipped, investigate.
P95 latency by workflow — Not average. P95. Broken down by workflow type. If a workflow that's usually 5 seconds is now 15 seconds, something regressed.
Provider health — Circuit breaker state, error rate, latency per provider. If one provider is degrading, I want to know before my users do.
Active human gates — How many executions are waiting for human approval right now. If this number spikes, either something is broken or humans are overwhelmed.
Everything else is on-demand. I query traces when debugging specific issues. I don't need a dashboard for things I only look at during incidents.
One pattern I've moved toward: emitting structured events instead of log lines.
// Instead of this:
logger.info("LLM call completed", {
model: "claude-sonnet",
tokens: 5300,
duration: 4050
});
// Emit this:
emit_event(LLMCallCompleted {
trace_id,
span_id,
provider: "anthropic",
model: "claude-sonnet-4-20250514",
input_tokens: 3200,
output_tokens: 890,
cost_microdollars: 83000,
duration_ms: 4050,
cache_hit: false,
tools_called: ["lookup_order", "check_status"],
});
The structured event is typed, validated, and can be consumed by traces, metrics, and logs simultaneously. One emission point, three observability signals. No drift between what your logs say and what your metrics show.
Too much data, too little signal. I started by recording everything — full prompts, full responses, every intermediate state. Storage costs exploded. More importantly, I couldn't find anything because every search returned thousands of results.
The fix: sample heavily for full content capture (1-5% of requests). Record full content for all errors. Use hashes and token counts for everything else.
Ignoring the cost dimension. My first observability setup tracked latency and errors but not cost. I could tell you a workflow was fast and reliable, but not that it was costing 3x what it should. Cost is a first-class observability dimension for AI systems, not an afterthought.
Not correlating human decisions. Human gates produce some of the most valuable observability data: what did the AI suggest, what did the human choose, and why did they differ? I initially treated human gates as opaque wait points. Now every gate records both the AI recommendation and the human decision, so I can track agreement rates and identify patterns where the AI consistently gets it wrong.
Logs tell you what happened. Metrics tell you how much. Traces tell you why.
For AI agent systems, you need all three — and you need them aware of the things that make AI different: non-deterministic outputs, variable costs, branching execution paths, and humans in the loop.
The system that's just logging is the system you can't debug when it matters.