The Cost Problem in AI Nobody Talks About
AI demos are free. AI at scale is expensive. Here's how I think about building cost-predictable agent systems.
AI demos are free. AI at scale is expensive. Here's how I think about building cost-predictable agent systems.

Every AI startup demo shows the same thing: amazing capabilities, magical results, the future is here.
What they don't show: the AWS bill.
I've seen teams burn through their entire cloud budget in a week because an agent loop ran away. I've seen startups whose unit economics don't work because LLM costs eat all their margin.
AI cost management isn't a nice-to-have. It's survival.
Traditional software costs are predictable. You know how much a database query costs. You know how much a compute instance costs per hour.
AI costs are variable in terrifying ways:
Input size varies. A short prompt costs 1/100th of a long conversation with context.
Output length varies. The model decides how much to generate.
Loop iterations vary. ReAct patterns might take 3 steps or 30.
Failure modes are expensive. When things go wrong, they often go wrong expensively.
You can't just measure "cost per request" because two identical-looking requests might cost 100x different amounts.
Let's make it concrete.
GPT-4 Turbo (as of late 2024):
Sounds cheap until you do the math:
Simple query (1K input, 500 output):
Agent with context (10K input, 2K output, 5 iterations):
Runaway loop (10K input, 2K output, 50 iterations):
Now multiply by users. A thousand users hitting a runaway loop? $8,000 gone in minutes.
Agent calls tool. Tool returns data. Agent needs more data. Calls tool again. Repeat forever.
I've seen agents make 200+ tool calls on a single query before anyone noticed. Each call adds tokens. The context grows. Costs explode.
Every message in a conversation gets included. Every tool result gets appended. Context windows fill up.
Long conversation + many tool calls = massive input tokens on every subsequent call.
API call fails. Retry. Fails again. Retry. Now with more context because you're explaining the failure.
Error handling that adds tokens is error handling that costs money.
Using GPT-4 for tasks that GPT-3.5 could handle. Using 128K context when 8K would suffice.
Bigger isn't always better. It's always more expensive.
Not soft limits. Not warnings. Hard stops that kill execution.
max_iterations_per_execution: 25
max_tool_calls_per_step: 10
max_tokens_per_execution: 50000
max_cost_per_execution: $2.00
execution_timeout: 300 seconds
When limits are hit, execution stops gracefully. The customer gets an explanation. You don't go bankrupt.
Different workflows have different cost ceilings.
Track spending per execution. Alert when approaching limits. Stop when exceeded.
Not every task needs the most capable model.
Classification tasks → Haiku/GPT-3.5
Simple generation → Sonnet/GPT-4-mini
Complex reasoning → Opus/GPT-4
Route based on task complexity. Use the smallest model that produces acceptable results.
Don't dump everything into context. Be selective.
Less context = lower cost = faster response.
Same question twice? Same answer. Don't re-run the model.
Cache by:
Caching isn't just performance. It's cost control.
Before running an expensive operation, estimate the cost:
Estimated cost: ~$0.75
Your budget remaining: $2.00
Proceed? [Y/n]
Let users make informed decisions. Let systems auto-reject requests that would exceed budgets.
You can't optimize what you don't measure.
Every execution should track:
Aggregate to:
Dashboard this. Alert on anomalies. Review weekly.
Before launching any AI feature, answer:
If you can't answer these, you're not ready for production.
From actual production:
| Workflow Type | Avg Cost | P95 Cost | P99 Cost |
|---|---|---|---|
| Simple lookup | $0.03 | $0.08 | $0.15 |
| Complex query | $0.25 | $0.60 | $1.20 |
| Full analysis | $1.50 | $3.00 | $5.00 |
The P99 is what matters for budgeting. Some queries will be expensive. Plan for it.
Two things transform AI economics:
1. Local models: Running Llama or Mistral locally changes the math completely. No per-token cost. Just compute cost.
For many tasks, local models are good enough. The trade-off shifts from "cost per query" to "capability per query."
2. Fine-tuning: A fine-tuned smaller model often beats a general-purpose larger model for specific tasks. At lower cost.
The future is likely hybrid: local models for routine tasks, API models for complex reasoning.
AI costs aren't a problem to solve later. They're a constraint to design around from day one. Build with hard limits, measure everything, route intelligently.
The teams that figure out cost-efficient AI will win. The teams that ignore it will run out of money.