Dec 3, 20245 min read

The Cost Problem in AI Nobody Talks About

AI demos are free. AI at scale is expensive. Here's how I think about building cost-predictable agent systems.

AI Cost Management

Every AI startup demo shows the same thing: amazing capabilities, magical results, the future is here.

What they don't show: the AWS bill.

I've seen teams burn through their entire cloud budget in a week because an agent loop ran away. I've seen startups whose unit economics don't work because LLM costs eat all their margin.

AI cost management isn't a nice-to-have. It's survival.

Why AI Costs Are Different

Traditional software costs are predictable. You know how much a database query costs. You know how much a compute instance costs per hour.

AI costs are variable in terrifying ways:

Input size varies. A short prompt costs 1/100th of a long conversation with context.

Output length varies. The model decides how much to generate.

Loop iterations vary. ReAct patterns might take 3 steps or 30.

Failure modes are expensive. When things go wrong, they often go wrong expensively.

You can't just measure "cost per request" because two identical-looking requests might cost 100x different amounts.

The Math That Matters

Let's make it concrete.

GPT-4 Turbo (as of late 2024):

  • Input: ~$10 per million tokens
  • Output: ~$30 per million tokens

Sounds cheap until you do the math:

Simple query (1K input, 500 output):

  • Input: $0.01
  • Output: $0.015
  • Total: $0.025

Agent with context (10K input, 2K output, 5 iterations):

  • Per iteration: ~$0.16
  • 5 iterations: $0.80
  • Total: $0.80 per agent run

Runaway loop (10K input, 2K output, 50 iterations):

  • Total: $8.00 per failed execution

Now multiply by users. A thousand users hitting a runaway loop? $8,000 gone in minutes.

The Failure Modes

1. Unbounded Loops

Agent calls tool. Tool returns data. Agent needs more data. Calls tool again. Repeat forever.

I've seen agents make 200+ tool calls on a single query before anyone noticed. Each call adds tokens. The context grows. Costs explode.

2. Context Accumulation

Every message in a conversation gets included. Every tool result gets appended. Context windows fill up.

Long conversation + many tool calls = massive input tokens on every subsequent call.

3. Retry Storms

API call fails. Retry. Fails again. Retry. Now with more context because you're explaining the failure.

Error handling that adds tokens is error handling that costs money.

4. Model Overkill

Using GPT-4 for tasks that GPT-3.5 could handle. Using 128K context when 8K would suffice.

Bigger isn't always better. It's always more expensive.

How I Control Costs

1. Hard Limits

Not soft limits. Not warnings. Hard stops that kill execution.

max_iterations_per_execution: 25
max_tool_calls_per_step: 10
max_tokens_per_execution: 50000
max_cost_per_execution: $2.00
execution_timeout: 300 seconds

When limits are hit, execution stops gracefully. The customer gets an explanation. You don't go bankrupt.

2. Budget per Workflow

Different workflows have different cost ceilings.

  • Quick lookup: $0.10 max
  • Complex research: $1.00 max
  • Full analysis: $5.00 max

Track spending per execution. Alert when approaching limits. Stop when exceeded.

3. Model Routing

Not every task needs the most capable model.

Classification tasks → Haiku/GPT-3.5
Simple generation → Sonnet/GPT-4-mini
Complex reasoning → Opus/GPT-4

Route based on task complexity. Use the smallest model that produces acceptable results.

4. Context Management

Don't dump everything into context. Be selective.

  • Rolling summaries instead of full history
  • Relevant facts only, not entire knowledge base
  • Compressed tool results when full output isn't needed

Less context = lower cost = faster response.

5. Caching

Same question twice? Same answer. Don't re-run the model.

Cache by:

  • Exact input match (simple, effective for repeated queries)
  • Semantic similarity (more complex, catches paraphrases)
  • Time-bounded (some queries need fresh data)

Caching isn't just performance. It's cost control.

6. Pre-flight Estimation

Before running an expensive operation, estimate the cost:

Estimated cost: ~$0.75
Your budget remaining: $2.00
Proceed? [Y/n]

Let users make informed decisions. Let systems auto-reject requests that would exceed budgets.

Tracking and Visibility

You can't optimize what you don't measure.

Every execution should track:

  • Input tokens (per model call)
  • Output tokens (per model call)
  • Model used
  • Total cost
  • Cost breakdown by phase

Aggregate to:

  • Cost per user
  • Cost per workflow type
  • Cost per hour/day/month

Dashboard this. Alert on anomalies. Review weekly.

The Unit Economics Test

Before launching any AI feature, answer:

  1. What's the average cost per use?
  2. What's the worst-case cost per use?
  3. What's the revenue/value per use?
  4. Does (3) exceed (1) with comfortable margin?
  5. Can you survive if (2) happens at scale?

If you can't answer these, you're not ready for production.

Real Numbers From My Systems

From actual production:

Workflow TypeAvg CostP95 CostP99 Cost
Simple lookup$0.03$0.08$0.15
Complex query$0.25$0.60$1.20
Full analysis$1.50$3.00$5.00

The P99 is what matters for budgeting. Some queries will be expensive. Plan for it.

What Changes Everything

Two things transform AI economics:

1. Local models: Running Llama or Mistral locally changes the math completely. No per-token cost. Just compute cost.

For many tasks, local models are good enough. The trade-off shifts from "cost per query" to "capability per query."

2. Fine-tuning: A fine-tuned smaller model often beats a general-purpose larger model for specific tasks. At lower cost.

The future is likely hybrid: local models for routine tasks, API models for complex reasoning.


AI costs aren't a problem to solve later. They're a constraint to design around from day one. Build with hard limits, measure everything, route intelligently.

The teams that figure out cost-efficient AI will win. The teams that ignore it will run out of money.

Enjoyed this article?

Share it with others or connect with me