Feb 22, 20269 min read

Observability for AI Agent Systems Is Not Just Logging

Logs tell you what happened. Traces tell you why. Here's how to build observability that actually works for multi-agent AI systems.

Observability for AI Agent Systems

I wrote a manifest for better logging a few years ago. Everything in it still holds. But when I started building multi-agent AI systems, I realized logging — even good logging — isn't enough.

An AI agent workflow isn't a request-response cycle. It's a tree. A single user action can trigger an orchestrator that spawns three agents, each making multiple LLM calls, some with tool use, some with memory retrieval, some waiting for human approval. One of those agents might retry with a different provider after a circuit breaker trips.

logger.info("Processing request") doesn't cut it here.

What Makes AI Observability Different

Traditional observability has three pillars: logs, metrics, traces. AI agent systems need all three, but the semantics change.

Non-Deterministic Outputs

A REST API returns the same response for the same input. An LLM doesn't. Same prompt, same model, different output. This means you can't just log "called LLM" — you need to capture the full context: prompt, model, temperature, token counts, and the actual response.

Without this, debugging is guesswork. "The agent made a bad decision" tells you nothing. "The agent received this 4,000-token context, generated this response with 0.7 temperature, and chose path B over path A" tells you everything.

Variable Latency and Cost

A database query takes 5ms or 50ms. An LLM call takes 500ms or 15,000ms. The variance is enormous, and it's not a bug — it's the nature of generative models.

Every LLM call has a cost that depends on input and output token counts. Two identical-looking requests can cost $0.01 and $0.50 depending on context size. If you're not tracking this per-span, you're flying blind on both performance and spend.

Branching Execution Paths

Traditional services have predictable call graphs. Agent systems don't. A router node might send execution down one of five paths based on LLM classification. A parallel node spawns concurrent branches that join later. A human gate pauses execution indefinitely.

Your trace tree isn't a straight line. It's a directed acyclic graph with conditional branches, parallel fan-outs, and variable-length paths.

The Trace Is the Unit of Work

In my system, every workflow execution gets a single trace. Every node in the workflow gets a span. Every LLM call within a node gets a child span. This gives me a complete picture:

Trace: execution_abc123
├── Span: trigger_node (12ms)
├── Span: llm_classify (2,340ms)
│   ├── LLM Call: anthropic/claude-sonnet (2,280ms)
│   │   ├── input_tokens: 3,200
│   │   ├── output_tokens: 45
│   │   ├── cost: $0.034
│   │   └── classification: "support_request"
│   └── Decision: route → support_branch
├── Span: memory_retrieve (89ms)
│   ├── query: "customer history for user_789"
│   └── results: 3 documents, 2,100 tokens
├── Span: llm_respond (4,120ms)
│   ├── LLM Call: anthropic/claude-sonnet (4,050ms)
│   │   ├── input_tokens: 5,300
│   │   ├── output_tokens: 890
│   │   └── cost: $0.083
│   └── tools_called: [lookup_order, check_status]
├── Span: human_gate (00:03:22)
│   ├── gate_type: approval
│   ├── timeout: 10m
│   ├── decision: approved
│   └── decided_by: user_456
└── Span: action_send_email (230ms)
    └── status: success

Total: 00:03:29 | Cost: $0.117 | Tokens: 11,435

This trace tells me everything. Where the time went (human gate: 3 minutes, LLM calls: 6.4 seconds). Where the money went (response generation: 71% of cost). What decisions were made and why.

What to Capture on Every LLM Span

At minimum, every LLM call span needs:

AttributeWhy
llm.providerWhich provider handled it
llm.modelWhich model was used
llm.input_tokensInput size (cost + context tracking)
llm.output_tokensOutput size (cost tracking)
llm.costCalculated cost in microdollars
llm.temperatureReproducibility debugging
llm.latency_msPerformance tracking
llm.statussuccess, error, timeout, circuit_break
llm.tools_calledWhat tools the model invoked

Optional but valuable:

AttributeWhy
llm.prompt_hashDetect prompt drift without storing full prompts
llm.cache_hitPrompt caching effectiveness
llm.retry_countHow many attempts before success
llm.fallback_fromIf this was a fallback, which provider failed

Storing full prompts and responses in traces is expensive. I hash prompts and store full content only on errors or when sampling. For debugging, the hash lets me correlate with the prompt version in my codebase.

Workflow-Level Spans

LLM spans are only part of the picture. Each workflow node needs its own instrumentation:

Router nodes — Log the classification result, confidence score, and which branch was taken. When a workflow produces bad output, the first question is usually "why did it go down this path?"

Parallel nodes — Track fan-out count, individual branch durations, and join timing. If one branch takes 10x longer than the others, you need to see that immediately.

Memory nodes — Log the query, number of results, total token count retrieved, and relevance scores. Context quality drives output quality. Bad retrieval is the silent killer of agent performance.

Human gates — Track wait time, timeout configuration, who decided, and what they decided. This is your audit trail.

Action nodes — Log the external system called, request/response status, and any side effects. These are your irreversible operations.

Metrics That Matter

Traces give you the micro view. Metrics give you the macro view. For AI agent systems, the standard RED metrics (Rate, Errors, Duration) need extension:

Cost Metrics

execution_cost_total (histogram)
  labels: workflow_type, provider, model

execution_token_count (histogram)
  labels: workflow_type, direction (input/output)

provider_cost_rate (gauge)
  labels: provider
  unit: dollars_per_minute

Cost-per-execution trending up means something changed. Maybe prompts grew. Maybe the model changed. Maybe a loop isn't terminating early enough. Without the metric, you won't notice until the invoice arrives.

Quality Metrics

execution_completion_rate (counter)
  labels: workflow_type, outcome (success/error/timeout/budget_exhausted)

node_error_rate (counter)
  labels: workflow_type, node_type, error_type

human_gate_override_rate (counter)
  labels: workflow_type, gate_type
  // high override rate = your AI decisions need work

The human gate override rate is underrated. If humans reject 40% of AI decisions at a particular gate, that's not an approval problem — it's a model quality problem. The metric surfaces what anecdotes hide.

Latency Metrics

execution_duration (histogram)
  labels: workflow_type
  buckets: 1s, 5s, 15s, 30s, 60s, 300s

llm_call_duration (histogram)
  labels: provider, model
  buckets: 500ms, 1s, 2s, 5s, 10s, 30s

human_gate_wait_duration (histogram)
  labels: gate_type
  buckets: 10s, 30s, 60s, 300s, 600s, 3600s

AI latency has a bimodal distribution. Most requests are fast. Some are very slow. P50 looks fine. P99 is terrible. If you're only tracking averages, you're missing the experience of your most frustrated users.

Correlating Across the Stack

The trace ID is the thread that ties everything together. My system propagates it through:

flowchart TD
    A[HTTP Request] --> B[API Handler]
    B --> C[Execution Engine]
    C --> D[Workflow Node]
    D --> E[LLM Call]
    E --> F[Provider API]
    D --> G[Tool Execution]
    G --> H[Database Query]
    G --> I[External API]

    class C special
    class D,E worker
    class F,G,H,I default

When something goes wrong, I start with the trace ID from the error log and walk the entire execution tree. No grep. No guessing. No "which log file is this in?"

This is where OpenTelemetry earns its keep. Standard trace propagation, standard span attributes, standard exporters. I don't want to build tracing infrastructure. I want to instrument my system and send data to whatever backend I'm using.

The Dashboard I Actually Use

I've built elaborate dashboards. Most of them went unused. Here's what I actually look at daily:

Execution health — Completion rate, error rate, timeout rate. One chart, last 24 hours. If the lines move, something changed.

Cost burn rate — Dollars per hour, broken down by workflow type. With trend line. If the trend line slopes up and no new features shipped, investigate.

P95 latency by workflow — Not average. P95. Broken down by workflow type. If a workflow that's usually 5 seconds is now 15 seconds, something regressed.

Provider health — Circuit breaker state, error rate, latency per provider. If one provider is degrading, I want to know before my users do.

Active human gates — How many executions are waiting for human approval right now. If this number spikes, either something is broken or humans are overwhelmed.

Everything else is on-demand. I query traces when debugging specific issues. I don't need a dashboard for things I only look at during incidents.

Structured Events Over Log Lines

One pattern I've moved toward: emitting structured events instead of log lines.

// Instead of this:
logger.info("LLM call completed", {
    model: "claude-sonnet",
    tokens: 5300,
    duration: 4050
});

// Emit this:
emit_event(LLMCallCompleted {
    trace_id,
    span_id,
    provider: "anthropic",
    model: "claude-sonnet-4-20250514",
    input_tokens: 3200,
    output_tokens: 890,
    cost_microdollars: 83000,
    duration_ms: 4050,
    cache_hit: false,
    tools_called: ["lookup_order", "check_status"],
});

The structured event is typed, validated, and can be consumed by traces, metrics, and logs simultaneously. One emission point, three observability signals. No drift between what your logs say and what your metrics show.

What I Got Wrong Initially

Too much data, too little signal. I started by recording everything — full prompts, full responses, every intermediate state. Storage costs exploded. More importantly, I couldn't find anything because every search returned thousands of results.

The fix: sample heavily for full content capture (1-5% of requests). Record full content for all errors. Use hashes and token counts for everything else.

Ignoring the cost dimension. My first observability setup tracked latency and errors but not cost. I could tell you a workflow was fast and reliable, but not that it was costing 3x what it should. Cost is a first-class observability dimension for AI systems, not an afterthought.

Not correlating human decisions. Human gates produce some of the most valuable observability data: what did the AI suggest, what did the human choose, and why did they differ? I initially treated human gates as opaque wait points. Now every gate records both the AI recommendation and the human decision, so I can track agreement rates and identify patterns where the AI consistently gets it wrong.


Logs tell you what happened. Metrics tell you how much. Traces tell you why.

For AI agent systems, you need all three — and you need them aware of the things that make AI different: non-deterministic outputs, variable costs, branching execution paths, and humans in the loop.

The system that's just logging is the system you can't debug when it matters.

Enjoyed this article?

Share it with others or connect with me