Feb 8, 20268 min read

Circuit Breakers, Rate Limits, and Cost Ceilings: Production Safety for AI Systems

AI in production needs more than retry logic. Here's the engineering behind keeping AI systems safe, predictable, and affordable.

Production Safety for AI Systems

I wrote about the cost problem in AI and how to think about budgets. This post is about the engineering. The actual systems that keep AI workloads from taking down your infrastructure or draining your wallet.

Three mechanisms. All essential.

Circuit Breakers

The Problem

An LLM provider goes down. Your system keeps sending requests. Each request times out after 30 seconds. You have 500 concurrent executions. Now you have 500 threads blocked on a dead service.

Or worse: the provider is degraded, not down. Requests take 20 seconds instead of 2. Your system slows to a crawl but never actually fails.

Retry logic makes it worse. Every failed request retries 3 times. Your 500 requests become 1,500. The provider, already struggling, gets hammered harder.

The Solution

A circuit breaker sits between your system and external services. It has three states:

stateDiagram-v2
    [*] --> Closed
    Closed --> Open : failure threshold reached
    Open --> HalfOpen : timeout expires
    HalfOpen --> Closed : success
    HalfOpen --> Open : failure

    Closed : requests flow normally
    Open : requests fail immediately
    HalfOpen : one test request allowed

    class Closed success
    class Open reject
    class HalfOpen special

In my implementation:

Configuration:
  failure_threshold: 5 failures in 60 seconds
  open_timeout: 30 seconds
  half_open_max_requests: 1

When a provider fails 5 times in a minute, the circuit opens. All subsequent requests fail immediately — no waiting, no timeout, no wasted resources.

After 30 seconds, it lets one request through. If that succeeds, the circuit closes and traffic resumes. If it fails, the circuit stays open.

What This Means in Practice

Without circuit breaker: Provider goes down → requests pile up → execution latency spikes → users see timeouts → system overloaded from retries → cascading failure.

With circuit breaker: Provider goes down → 5 requests fail → circuit opens → all requests fail instantly with clear error → system stays healthy → circuit recovers automatically when provider returns.

The key insight: fast failure is better than slow failure. A user seeing "service temporarily unavailable" in 10ms is better than waiting 30 seconds for a timeout.

Provider Fallback

Circuit breakers pair well with fallback. When the primary provider's circuit opens:

flowchart TD
    A["Primary (Anthropic)"] -->|circuit open · fail fast| B["Secondary (OpenAI)"]
    B -->|circuit closed| C[Response]

    class A,B default
    class C success

One rule: never re-execute tools during fallback. The fallback only applies to the model call. Tool results from before the failure are preserved.

Rate Limiting

The Problem

AI costs are per-token. A single runaway user, a bug in a workflow loop, or a burst of traffic can generate thousands of expensive API calls in seconds.

Traditional rate limiting (requests per second) isn't enough. A single AI request can cost anywhere from $0.001 to $5.00 depending on context size and output length.

Hybrid Rate Limiting

I use two layers:

Layer 1: Gateway rate limiting

Coarse protection at the API boundary. Limits requests per second per user.

Rate limit:
  authenticated: 100 requests/minute
  per-workspace: 500 requests/minute
  burst: 20 requests/second

This catches obvious abuse — someone hammering the API, a bot gone wild, a client retry loop.

Implemented with token bucket algorithm. Fast. Stateless per-request (state in Redis).

Layer 2: Node-level rate limiting

Fine-grained protection at the workflow execution level. Limits tokens and cost per execution step.

Per LLM node:
  max_input_tokens: 10000
  max_output_tokens: 4000
  max_cost: $0.50

Per execution:
  max_total_tokens: 50000
  max_total_cost: $2.00
  max_iterations: 25

This catches the expensive problems: unbounded context accumulation, runaway loops, oversized prompts.

Why Both Layers

Gateway limiting alone doesn't catch expensive requests — one request can cost $5.

Node limiting alone doesn't catch high-frequency cheap requests — a thousand $0.01 requests add up.

Together, they cover both dimensions: volume and cost.

The Token Bucket

My gateway rate limiter uses the token bucket algorithm:

Bucket capacity: 100 tokens
Refill rate: 100 tokens per minute
Each request costs 1 token
Burst allowed: up to 20 tokens instantly

The bucket refills at a steady rate. Requests consume tokens. If the bucket is empty, the request is rejected with a 429 status.

This allows bursts (users tend to be bursty) while enforcing sustained rate limits.

Distributed State

Rate limit state lives in Redis. Distributed counters. TTL-based expiration.

Key: rate:user:{user_id}:{window}
Value: request count
TTL: window duration (60s)

Every API instance checks the same Redis counters. No coordination needed between instances.

Cost Ceilings

The Problem

Rate limits control velocity. Cost ceilings control total spend.

A workflow that makes 10 requests per minute for 10 hours? Well within rate limits. But at $0.50 per request, that's $3,000.

Budget Tracking

Every workflow execution has a budget:

Execution budget:
  allocated: $2.00
  spent: $0.00
  remaining: $2.00

Every LLM call updates the budget:

Before call:
  remaining: $1.50

Call cost:
  input: 5000 tokens × $10/M = $0.05
  output: 1000 tokens × $30/M = $0.03
  total: $0.08

After call:
  remaining: $1.42

When the budget hits zero, execution stops. Gracefully. The user gets a clear message: "Budget exhausted. Execution paused."

Per-Workflow Budgets

Different workflows get different budgets:

WorkflowBudgetRationale
Quick lookup$0.10Simple retrieval, few tokens
Customer support$0.50Multi-turn, moderate context
Complex analysis$2.00Deep reasoning, many steps
Full research$5.00Extensive, multi-source

These aren't arbitrary. They're based on observed P95 costs plus margin.

Alerting

Cost ceilings don't help if nobody watches them.

My alert system:

Alerts:
  - At 70% budget: log warning
  - At 90% budget: alert to dashboard
  - At 100% budget: stop execution, notify user
  - Anomaly: cost > 2× average for workflow type → alert

The anomaly detection matters most. If a workflow that usually costs $0.20 suddenly costs $1.50, something changed. Maybe the prompt grew. Maybe the loop count increased. Worth investigating before it becomes a pattern.

Cost Dashboard

Visibility is defense.

My cost dashboard breaks down:

  • By provider: Which LLM provider costs the most
  • By workflow type: Which workflows are expensive
  • By time: Cost trends (daily, weekly, monthly)
  • By user: Who's driving costs

When I can see that "the support workflow's average cost jumped 40% this week," I can investigate before it becomes a budget problem.

How They Work Together

These three mechanisms layer:

flowchart TD
    A[Request] --> B{Gateway Rate Limit}
    B -->|too frequent| R1[Rejected]
    B -->|pass| C{Circuit Breaker}
    C -->|provider down| R2[Fail Fast]
    C -->|pass| D{Node Rate Limit}
    D -->|too expensive| R3[Rejected]
    D -->|pass| E{Budget Check}
    E -->|exhausted| R4[Budget Exhausted]
    E -->|pass| F[LLM Call]
    F --> G[Budget Update]
    G --> H[Response]

    class B,C,D,E decision
    class R1,R2,R3,R4 reject
    class H success

Each layer catches different failure modes:

MechanismCatches
Gateway rate limitAPI abuse, client bugs, DDoS
Circuit breakerProvider outages, degradation
Node rate limitExpensive individual calls
Cost ceilingCumulative overspend

No single mechanism is enough. Together, they make the system predictable.

Implementation Notes

Be Strict by Default

Start with tight limits. Loosen based on data.

Default budget: $1.00 (not $100)
Default iterations: 10 (not 1000)
Default timeout: 5 minutes (not unlimited)

It's easier to increase a limit than to explain a $10,000 bill.

Fail Gracefully

Every safety mechanism should produce a clear error:

"Execution paused: cost ceiling reached ($2.00 limit).
 Completed 8 of 12 steps. Results so far are available."

Not just "Error 500." The user should know what happened and what they can do about it.

Monitor the Safety Mechanisms

Rate limit rejections, circuit breaker trips, budget exhaustions — these are all signals.

High rejection rate? Maybe limits are too tight. Frequent circuit trips? Maybe the provider is unreliable. Constant budget exhaustion? Maybe workflows need optimization.

The safety mechanisms are also your observability layer.

Test the Limits

If you've never seen your circuit breaker trip, you don't know if it works.

Chaos testing for AI systems: intentionally trigger failures, inject latency, run expensive queries. Verify the safety mechanisms activate correctly.


Production AI systems aren't just about making the AI work. They're about making the AI fail safely.

Circuit breakers prevent cascading failures. Rate limits prevent resource exhaustion. Cost ceilings prevent financial surprises. Together, they make the difference between a system you can trust and a system you hope doesn't break.

Hope isn't a production strategy.

Enjoyed this article?

Share it with others or connect with me