Circuit Breakers, Rate Limits, and Cost Ceilings: Production Safety for AI Systems
AI in production needs more than retry logic. Here's the engineering behind keeping AI systems safe, predictable, and affordable.
AI in production needs more than retry logic. Here's the engineering behind keeping AI systems safe, predictable, and affordable.

I wrote about the cost problem in AI and how to think about budgets. This post is about the engineering. The actual systems that keep AI workloads from taking down your infrastructure or draining your wallet.
Three mechanisms. All essential.
An LLM provider goes down. Your system keeps sending requests. Each request times out after 30 seconds. You have 500 concurrent executions. Now you have 500 threads blocked on a dead service.
Or worse: the provider is degraded, not down. Requests take 20 seconds instead of 2. Your system slows to a crawl but never actually fails.
Retry logic makes it worse. Every failed request retries 3 times. Your 500 requests become 1,500. The provider, already struggling, gets hammered harder.
A circuit breaker sits between your system and external services. It has three states:
stateDiagram-v2
[*] --> Closed
Closed --> Open : failure threshold reached
Open --> HalfOpen : timeout expires
HalfOpen --> Closed : success
HalfOpen --> Open : failure
Closed : requests flow normally
Open : requests fail immediately
HalfOpen : one test request allowed
class Closed success
class Open reject
class HalfOpen special
In my implementation:
Configuration:
failure_threshold: 5 failures in 60 seconds
open_timeout: 30 seconds
half_open_max_requests: 1
When a provider fails 5 times in a minute, the circuit opens. All subsequent requests fail immediately — no waiting, no timeout, no wasted resources.
After 30 seconds, it lets one request through. If that succeeds, the circuit closes and traffic resumes. If it fails, the circuit stays open.
Without circuit breaker: Provider goes down → requests pile up → execution latency spikes → users see timeouts → system overloaded from retries → cascading failure.
With circuit breaker: Provider goes down → 5 requests fail → circuit opens → all requests fail instantly with clear error → system stays healthy → circuit recovers automatically when provider returns.
The key insight: fast failure is better than slow failure. A user seeing "service temporarily unavailable" in 10ms is better than waiting 30 seconds for a timeout.
Circuit breakers pair well with fallback. When the primary provider's circuit opens:
flowchart TD
A["Primary (Anthropic)"] -->|circuit open · fail fast| B["Secondary (OpenAI)"]
B -->|circuit closed| C[Response]
class A,B default
class C success
One rule: never re-execute tools during fallback. The fallback only applies to the model call. Tool results from before the failure are preserved.
AI costs are per-token. A single runaway user, a bug in a workflow loop, or a burst of traffic can generate thousands of expensive API calls in seconds.
Traditional rate limiting (requests per second) isn't enough. A single AI request can cost anywhere from $0.001 to $5.00 depending on context size and output length.
I use two layers:
Layer 1: Gateway rate limiting
Coarse protection at the API boundary. Limits requests per second per user.
Rate limit:
authenticated: 100 requests/minute
per-workspace: 500 requests/minute
burst: 20 requests/second
This catches obvious abuse — someone hammering the API, a bot gone wild, a client retry loop.
Implemented with token bucket algorithm. Fast. Stateless per-request (state in Redis).
Layer 2: Node-level rate limiting
Fine-grained protection at the workflow execution level. Limits tokens and cost per execution step.
Per LLM node:
max_input_tokens: 10000
max_output_tokens: 4000
max_cost: $0.50
Per execution:
max_total_tokens: 50000
max_total_cost: $2.00
max_iterations: 25
This catches the expensive problems: unbounded context accumulation, runaway loops, oversized prompts.
Gateway limiting alone doesn't catch expensive requests — one request can cost $5.
Node limiting alone doesn't catch high-frequency cheap requests — a thousand $0.01 requests add up.
Together, they cover both dimensions: volume and cost.
My gateway rate limiter uses the token bucket algorithm:
Bucket capacity: 100 tokens
Refill rate: 100 tokens per minute
Each request costs 1 token
Burst allowed: up to 20 tokens instantly
The bucket refills at a steady rate. Requests consume tokens. If the bucket is empty, the request is rejected with a 429 status.
This allows bursts (users tend to be bursty) while enforcing sustained rate limits.
Rate limit state lives in Redis. Distributed counters. TTL-based expiration.
Key: rate:user:{user_id}:{window}
Value: request count
TTL: window duration (60s)
Every API instance checks the same Redis counters. No coordination needed between instances.
Rate limits control velocity. Cost ceilings control total spend.
A workflow that makes 10 requests per minute for 10 hours? Well within rate limits. But at $0.50 per request, that's $3,000.
Every workflow execution has a budget:
Execution budget:
allocated: $2.00
spent: $0.00
remaining: $2.00
Every LLM call updates the budget:
Before call:
remaining: $1.50
Call cost:
input: 5000 tokens × $10/M = $0.05
output: 1000 tokens × $30/M = $0.03
total: $0.08
After call:
remaining: $1.42
When the budget hits zero, execution stops. Gracefully. The user gets a clear message: "Budget exhausted. Execution paused."
Different workflows get different budgets:
| Workflow | Budget | Rationale |
|---|---|---|
| Quick lookup | $0.10 | Simple retrieval, few tokens |
| Customer support | $0.50 | Multi-turn, moderate context |
| Complex analysis | $2.00 | Deep reasoning, many steps |
| Full research | $5.00 | Extensive, multi-source |
These aren't arbitrary. They're based on observed P95 costs plus margin.
Cost ceilings don't help if nobody watches them.
My alert system:
Alerts:
- At 70% budget: log warning
- At 90% budget: alert to dashboard
- At 100% budget: stop execution, notify user
- Anomaly: cost > 2× average for workflow type → alert
The anomaly detection matters most. If a workflow that usually costs $0.20 suddenly costs $1.50, something changed. Maybe the prompt grew. Maybe the loop count increased. Worth investigating before it becomes a pattern.
Visibility is defense.
My cost dashboard breaks down:
When I can see that "the support workflow's average cost jumped 40% this week," I can investigate before it becomes a budget problem.
These three mechanisms layer:
flowchart TD
A[Request] --> B{Gateway Rate Limit}
B -->|too frequent| R1[Rejected]
B -->|pass| C{Circuit Breaker}
C -->|provider down| R2[Fail Fast]
C -->|pass| D{Node Rate Limit}
D -->|too expensive| R3[Rejected]
D -->|pass| E{Budget Check}
E -->|exhausted| R4[Budget Exhausted]
E -->|pass| F[LLM Call]
F --> G[Budget Update]
G --> H[Response]
class B,C,D,E decision
class R1,R2,R3,R4 reject
class H success
Each layer catches different failure modes:
| Mechanism | Catches |
|---|---|
| Gateway rate limit | API abuse, client bugs, DDoS |
| Circuit breaker | Provider outages, degradation |
| Node rate limit | Expensive individual calls |
| Cost ceiling | Cumulative overspend |
No single mechanism is enough. Together, they make the system predictable.
Start with tight limits. Loosen based on data.
Default budget: $1.00 (not $100)
Default iterations: 10 (not 1000)
Default timeout: 5 minutes (not unlimited)
It's easier to increase a limit than to explain a $10,000 bill.
Every safety mechanism should produce a clear error:
"Execution paused: cost ceiling reached ($2.00 limit).
Completed 8 of 12 steps. Results so far are available."
Not just "Error 500." The user should know what happened and what they can do about it.
Rate limit rejections, circuit breaker trips, budget exhaustions — these are all signals.
High rejection rate? Maybe limits are too tight. Frequent circuit trips? Maybe the provider is unreliable. Constant budget exhaustion? Maybe workflows need optimization.
The safety mechanisms are also your observability layer.
If you've never seen your circuit breaker trip, you don't know if it works.
Chaos testing for AI systems: intentionally trigger failures, inject latency, run expensive queries. Verify the safety mechanisms activate correctly.
Production AI systems aren't just about making the AI work. They're about making the AI fail safely.
Circuit breakers prevent cascading failures. Rate limits prevent resource exhaustion. Cost ceilings prevent financial surprises. Together, they make the difference between a system you can trust and a system you hope doesn't break.
Hope isn't a production strategy.