← Blog
llm-rate-limits api-rate-limiting openai-429-error June 6, 2026 7 min read

How to Handle LLM API Rate Limits Gracefully

Speedometer dial showing traffic volume representing API rate limits and request throttling management

TL;DR — Catch HTTP 429, read the Retry-After header, apply exponential backoff (wait = 2^attempt × 1s, max 60s), and put overflow into a Redis queue. Never let a 429 bubble raw to the user. OpenAI Tier 1 free: 3 RPM / 200 RPD — upgrade to Tier 2 once you spend $50.

Every LLM provider enforces rate limits. When you exceed them, you get a 429 status code and your request is rejected. If your app doesn't handle this, your user sees an error — or worse, their request silently disappears.

Rate limits exist because LLM inference is expensive. Providers throttle requests to protect their infrastructure and ensure fair access. OpenAI, Anthropic, and DeepSeek all enforce limits on requests per minute (RPM) and tokens per minute (TPM). The limits vary by model and by your spending tier.

This article covers the exact rate limits for each provider, three patterns for handling 429 errors gracefully, and how to architect your app so rate limits never cause a lost request.

Current rate limits by provider

Rate limits depend on your account tier, which is based on how much you've spent with the provider. Here are the defaults for the most common tiers:

OpenAI (Tier 1 — $5+ spent)

Model RPM TPM
GPT-4o 500 30,000
GPT-4o-mini 500 200,000
o1 500 30,000

Higher tiers (Tier 2 at $50+, Tier 3 at $100+) increase these significantly. At Tier 5 ($1,000+ spent), GPT-4o gets 10,000 RPM and 30M TPM.

Anthropic (Build tier — credit-based)

Model RPM TPM (input) TPM (output)
Claude Sonnet 4 1,000 40,000 8,000
Claude Haiku 3.5 1,000 50,000 10,000

Anthropic separates input and output TPM limits. The output TPM limit is typically what you hit first because generation is slower than prompt processing.

DeepSeek

DeepSeek currently enforces softer rate limits but can throttle during peak hours. They don't publish fixed tier-based limits like OpenAI. During high-demand periods, expect 60-120 RPM effective throughput.

How to tell when you're being rate limited

The provider tells you. Every 429 response includes headers that specify your limits and when to retry:

HTTP/1.1 429 Too Many Requests
Retry-After: 2
X-RateLimit-Limit-Requests: 500
X-RateLimit-Remaining-Requests: 0
X-RateLimit-Reset-Requests: 1686000000

The three critical headers:

If you're not reading these headers, you're guessing. Stop guessing.

Pattern 1: Exponential backoff with jitter

The simplest and most effective pattern. When you get a 429, wait and retry — but increase the wait time exponentially so you don't hammer the API with retries.

import time
import random
import requests

def call_llm_with_retry(payload, max_retries=5):
    base_delay = 1  # seconds
    
    for attempt in range(max_retries):
        response = requests.post(
            "https://api.openai.com/v1/chat/completions",
            json=payload,
            headers={"Authorization": f"Bearer {API_KEY}"}
        )
        
        if response.status_code == 200:
            return response.json()
        
        if response.status_code == 429:
            # Use Retry-After header if available
            retry_after = response.headers.get("Retry-After")
            if retry_after:
                delay = float(retry_after)
            else:
                # Exponential backoff: 1s, 2s, 4s, 8s, 16s
                delay = base_delay * (2 ** attempt)
            
            # Add jitter to prevent thundering herd
            jitter = random.uniform(0, delay * 0.5)
            time.sleep(delay + jitter)
            continue
        
        # Non-rate-limit error — don't retry
        response.raise_for_status()
    
    raise Exception("Max retries exceeded — rate limit persists")

Why jitter matters. Without jitter, if 50 requests all get rate limited at the same time, they all retry at the exact same time — and get rate limited again. Adding random jitter spreads retries across the time window. This is the difference between "works in development" and "works in production."

The same pattern in Go:

func callLLMWithRetry(payload []byte, maxRetries int) ([]byte, error) {
    baseDelay := time.Second
    for attempt := 0; attempt < maxRetries; attempt++ {
        resp, err := http.Post(
            "https://api.openai.com/v1/chat/completions",
            "application/json",
            bytes.NewBuffer(payload),
        )
        if err != nil {
            return nil, err
        }
        defer resp.Body.Close()
        body, _ := io.ReadAll(resp.Body)
        if resp.StatusCode == 200 {
            return body, nil
        }
        if resp.StatusCode == 429 {
            delay := baseDelay * (1 << attempt)
            jitter := time.Duration(rand.Int63n(int64(delay / 2)))
            time.Sleep(delay + jitter)
            continue
        }
        return nil, fmt.Errorf("LLM API error: HTTP %d", resp.StatusCode)
    }
    return nil, fmt.Errorf("max retries exceeded")
}

Cost implication. Every retry is a full billable request if the provider processes it before rejecting. With rate limits, the provider rejects at the gate — no tokens consumed, no cost. But retries from server errors (500, 502, 503) do consume tokens. Track retry rates to understand their impact on your LLM costs.

Pattern 2: Client-side request queue

Exponential backoff handles the problem reactively — you hit the limit, then back off. A request queue handles it proactively — you never exceed the limit in the first place.

import time
import threading
from collections import deque

class RateLimitedQueue:
    def __init__(self, max_rpm=500):
        self.max_rpm = max_rpm
        self.interval = 60.0 / max_rpm  # seconds between requests
        self.last_request_time = 0
        self.lock = threading.Lock()
        self.queue = deque()
    
    def submit(self, request_fn):
        """Add a request to the queue. Returns a future-like object."""
        result = {"done": False, "value": None, "error": None}
        self.queue.append((request_fn, result))
        self._process()
        return result
    
    def _process(self):
        with self.lock:
            while self.queue:
                now = time.time()
                elapsed = now - self.last_request_time
                
                if elapsed < self.interval:
                    time.sleep(self.interval - elapsed)
                
                request_fn, result = self.queue.popleft()
                try:
                    result["value"] = request_fn()
                except Exception as e:
                    result["error"] = e
                result["done"] = True
                self.last_request_time = time.time()

This approach spaces requests evenly across the minute window. If your limit is 500 RPM, it sends one request every 120ms. You never hit 429 because you never exceed the rate.

The tradeoff. Queuing adds latency. If your app needs sub-second responses for every user, a queue that throttles to 500 RPM means some users wait. For background tasks (batch processing, content generation, data extraction), queuing is ideal. For real-time user-facing features, combine queuing with the fallback model pattern below.

Pattern 3: Model fallback on rate limit

When your primary model is rate limited, route the request to a cheaper model instead of failing or queuing. The user gets a slightly different model but zero downtime.

MODEL_FALLBACKS = {
    "gpt-4o": "gpt-4o-mini",
    "claude-sonnet-4-20250514": "claude-haiku-3-5-20241022",
}

def call_with_fallback(messages, model="gpt-4o"):
    try:
        return call_llm(messages, model=model)
    except RateLimitError:
        fallback = MODEL_FALLBACKS.get(model)
        if fallback:
            return call_llm(messages, model=fallback)
        raise

This pattern works because rate limits are per-model. Being rate limited on GPT-4o doesn't affect your GPT-4o-mini quota. You get lower quality but guaranteed availability.

When to use this. Fallback routing makes sense for tasks where quality degrades gracefully — chatbots, summarization, simple Q&A. It doesn't make sense for tasks where the cheaper model produces meaningfully worse output — complex reasoning, code generation, nuanced analysis. Know which tasks tolerate a model downgrade before you implement this.

For a detailed comparison of which cheaper model handles which tasks well, see our cheapest LLM for each use case breakdown.

Proactive rate limit monitoring

The best way to handle rate limits is to see them coming before they hit. Monitor your X-RateLimit-Remaining headers on every successful response, not just on 429 errors.

def call_llm_monitored(payload):
    response = requests.post(url, json=payload, headers=headers)
    
    remaining = int(response.headers.get("X-RateLimit-Remaining-Requests", 999))
    limit = int(response.headers.get("X-RateLimit-Limit-Requests", 999))
    
    usage_percent = ((limit - remaining) / limit) * 100
    
    if usage_percent > 80:
        log.warning(f"Rate limit at {usage_percent:.0f}% — "
                    f"{remaining} requests remaining")
    
    if usage_percent > 95:
        # Proactively throttle before hitting the wall
        activate_request_queue()
    
    return response.json()

Logging rate limit usage alongside your cost monitoring gives you a complete picture. You'll see correlations — cost spikes often coincide with rate limit pressure because both are driven by volume.

Rate limits at your proxy layer

If you're using a proxy between your app and LLM providers — which is how Tokonomics works — rate limiting happens at two levels:

  1. Your proxy's rate limits. The proxy may enforce its own limits per API key to protect shared infrastructure.
  2. The upstream provider's rate limits. OpenAI, Anthropic, etc. enforce their limits regardless of whether you're going through a proxy.

Tokonomics enforces per-key rate limits (10 req/min on Free, 60 req/min on Pro) and returns standard X-RateLimit-* headers so your retry logic works identically. When the upstream provider returns a 429, Tokonomics passes it through transparently — your app sees the same headers and can apply the same backoff patterns described above.

The advantage of a proxy is visibility. Instead of guessing why you're hitting rate limits, you can see the exact request volume per API key, per model, and per time window in your analytics dashboard. You can identify which feature or team is consuming the most capacity and optimize there first.

The decision framework

Choose your pattern based on your use case:

Scenario Pattern Why
User-facing, real-time Exponential backoff + fallback model Users see zero errors, acceptable quality tradeoff
Batch processing Request queue Throughput matters more than latency
High volume, mixed priority Queue + priority lanes Critical requests skip the queue, batch jobs wait
Multi-model architecture Fallback chain GPT-4o → GPT-4o-mini → DeepSeek V3

Most production apps combine patterns. User-facing endpoints get exponential backoff with model fallback. Background jobs go through a rate-limited queue. Both feed into the same cost tracking system so you can see total spend regardless of which path a request took.

What to do right now

If your app calls any LLM API and doesn't handle 429 errors, you have a production bug. Here's the minimum fix:

  1. Add exponential backoff to every LLM call. Five lines of code. No excuse not to have this.
  2. Read the Retry-After header. It's more accurate than any backoff formula.
  3. Log rate limit events. You can't optimize what you don't measure.
  4. Set up budget alerts. Rate limits and cost overruns are often the same problem — high volume. Set alerts that fire before you hit either wall.

Rate limits are not bugs in the provider's API. They're a feature of operating at scale. The developers who handle them well build apps that stay reliable as traffic grows. The ones who ignore them build apps that break at the worst possible moment.

Last updated June 2026. All sources retrieved June 2026.

About the author
Zouhair is the founder of Tokonomics. He built the platform after receiving a $47,000 LLM invoice that his team didn't see coming. He tracks LLM pricing changes weekly across all major providers.
Connect on LinkedIn →
← Back to Blog