What causes HTTP 429 errors in LLM APIs?

HTTP 429 means you exceeded the provider's rate limit — too many requests per minute (RPM) or too many tokens per minute (TPM). OpenAI Tier 1 allows 500 RPM on GPT-4o-mini. Anthropic Claude allows 50 RPM on Tier 1. Upgrade your tier by adding a payment method and reaching the spend threshold.

What is the correct way to implement exponential backoff for LLM APIs?

On a 429: read the Retry-After header first. If absent, wait 2^attempt seconds (1s, 2s, 4s, 8s...) up to a 60s maximum. Add ±20% jitter to avoid thundering-herd. Cap at 5 retries. Always log every retry attempt for cost tracking.

How can I increase my OpenAI API rate limits?

OpenAI rate limits scale automatically with your usage tier, which is based on cumulative spending. Tier 2 requires $50 spend. Tier 3 requires $100. Tier 4 requires $250. Tier 5 (highest) requires $1,000. Add a payment method and use the API — limits increase without any manual request.

How to Handle LLM API Rate Limits Gracefully

Q: How long should I wait before retrying a 429 error?

Check the Retry-After header first, it gives the exact wait time in seconds. If absent, use exponential backoff starting at 1 second, doubling each attempt (1s, 2s, 4s, 8s) with 20% jitter. Cap retries at 5 attempts and max wait at 60 seconds.

TL;DR — Catch HTTP 429, read the Retry-After header, apply exponential backoff (wait = 2^attempt × 1s, max 60s), and put overflow into a Redis queue. Never let a 429 bubble raw to the user. OpenAI Tier 1 free: 3 RPM / 200 RPD — upgrade to Tier 2 once you spend $50.

Key Takeaways

HTTP 429 = rate limit exceeded — every LLM provider (OpenAI, Anthropic, DeepSeek) enforces RPM and TPM limits

OpenAI Tier 1 (free): only 3 RPM / 200 RPD — upgrade to Tier 2 after spending $50 for 500 RPM

3 patterns: exponential backoff (2^attempt × 1s), token bucket smoothing, and Redis request queuing

Never let a raw 429 error reach your user — queue the request and retry automatically

Every LLM provider enforces rate limits. When you exceed them, you get a 429 status code and your request is rejected. If your app doesn't handle this, your user sees an error — or worse, their request silently disappears.

Rate limits exist because LLM inference is expensive. Providers throttle requests to protect their infrastructure and ensure fair access. OpenAI, Anthropic, and DeepSeek all enforce limits on requests per minute (RPM) and tokens per minute (TPM). The limits vary by model and by your spending tier.

This article covers the exact rate limits for each provider, three patterns for handling 429 errors gracefully, and how to architect your app so rate limits never cause a lost request.

What are the current rate limits by provider?

Rate limits depend on your account tier, which is based on how much you've spent with the provider. Here are the defaults for the most common tiers:

OpenAI (Tier 1 — $5+ spent)

Model	RPM	TPM
GPT-4o	500	30,000
GPT-4o-mini	500	200,000
o1	500	30,000

Higher tiers (Tier 2 at $50+, Tier 3 at $100+) increase these significantly. At Tier 5 ($1,000+ spent), GPT-4o gets 10,000 RPM and 30M TPM.

Anthropic (Build tier — credit-based)

Model	RPM	TPM (input)	TPM (output)
Claude Sonnet 4	1,000	40,000	8,000
Claude Haiku 3.5	1,000	50,000	10,000

Anthropic separates input and output TPM limits. The output TPM limit is typically what you hit first because generation is slower than prompt processing.

DeepSeek

DeepSeek currently enforces softer rate limits but can throttle during peak hours. They don't publish fixed tier-based limits like OpenAI. During high-demand periods, expect 60-120 RPM effective throughput.

How do you tell when you're being rate limited?

The provider tells you. Every 429 response includes headers that specify your limits and when to retry:

HTTP/1.1 429 Too Many Requests
Retry-After: 2
X-RateLimit-Limit-Requests: 500
X-RateLimit-Remaining-Requests: 0
X-RateLimit-Reset-Requests: 1686000000

The three critical headers:

Retry-After — seconds to wait before retrying. Always respect this.
X-RateLimit-Remaining-Requests — how many requests you have left in the current window.
X-RateLimit-Reset-Requests — Unix timestamp when your limit resets.

If you're not reading these headers, you're guessing. Stop guessing.

How does exponential backoff with jitter work?

The simplest and most effective pattern. When you get a 429, wait and retry — but increase the wait time exponentially so you don't hammer the API with retries.

import time
import random
import requests

def call_llm_with_retry(payload, max_retries=5):
    base_delay = 1  # seconds
    
    for attempt in range(max_retries):
        response = requests.post(
            "https://api.openai.com/v1/chat/completions",
            json=payload,
            headers={"Authorization": f"Bearer {API_KEY}"}
        )
        
        if response.status_code == 200:
            return response.json()
        
        if response.status_code == 429:
            # Use Retry-After header if available
            retry_after = response.headers.get("Retry-After")
            if retry_after:
                delay = float(retry_after)
            else:
                # Exponential backoff: 1s, 2s, 4s, 8s, 16s
                delay = base_delay * (2 ** attempt)
            
            # Add jitter to prevent thundering herd
            jitter = random.uniform(0, delay * 0.5)
            time.sleep(delay + jitter)
            continue
        
        # Non-rate-limit error — don't retry
        response.raise_for_status()
    
    raise Exception("Max retries exceeded — rate limit persists")

Why jitter matters. Without jitter, if 50 requests all get rate limited at the same time, they all retry at the exact same time — and get rate limited again. Adding random jitter spreads retries across the time window. This is the difference between "works in development" and "works in production."

The same pattern in Go:

func callLLMWithRetry(payload []byte, maxRetries int) ([]byte, error) {
    baseDelay := time.Second
    for attempt := 0; attempt < maxRetries; attempt++ {
        resp, err := http.Post(
            "https://api.openai.com/v1/chat/completions",
            "application/json",
            bytes.NewBuffer(payload),
        )
        if err != nil {
            return nil, err
        }
        defer resp.Body.Close()
        body, _ := io.ReadAll(resp.Body)
        if resp.StatusCode == 200 {
            return body, nil
        }
        if resp.StatusCode == 429 {
            delay := baseDelay * (1 << attempt)
            jitter := time.Duration(rand.Int63n(int64(delay / 2)))
            time.Sleep(delay + jitter)
            continue
        }
        return nil, fmt.Errorf("LLM API error: HTTP %d", resp.StatusCode)
    }
    return nil, fmt.Errorf("max retries exceeded")
}

Cost implication. Every retry is a full billable request if the provider processes it before rejecting. With rate limits, the provider rejects at the gate — no tokens consumed, no cost. But retries from server errors (500, 502, 503) do consume tokens. Track retry rates to understand their impact on your LLM costs.

How do you build a client-side request queue?

Exponential backoff handles the problem reactively — you hit the limit, then back off. A request queue handles it proactively — you never exceed the limit in the first place.

import time
import threading
from collections import deque

class RateLimitedQueue:
    def __init__(self, max_rpm=500):
        self.max_rpm = max_rpm
        self.interval = 60.0 / max_rpm  # seconds between requests
        self.last_request_time = 0
        self.lock = threading.Lock()
        self.queue = deque()
    
    def submit(self, request_fn):
        """Add a request to the queue. Returns a future-like object."""
        result = {"done": False, "value": None, "error": None}
        self.queue.append((request_fn, result))
        self._process()
        return result
    
    def _process(self):
        with self.lock:
            while self.queue:
                now = time.time()
                elapsed = now - self.last_request_time
                
                if elapsed < self.interval:
                    time.sleep(self.interval - elapsed)
                
                request_fn, result = self.queue.popleft()
                try:
                    result["value"] = request_fn()
                except Exception as e:
                    result["error"] = e
                result["done"] = True
                self.last_request_time = time.time()

This approach spaces requests evenly across the minute window. If your limit is 500 RPM, it sends one request every 120ms. You never hit 429 because you never exceed the rate.

The tradeoff. Queuing adds latency. If your app needs sub-second responses for every user, a queue that throttles to 500 RPM means some users wait. For background tasks (batch processing, content generation, data extraction), queuing is ideal. For real-time user-facing features, combine queuing with the fallback model pattern below.

How does model fallback on rate limit work?

When your primary model is rate limited, route the request to a cheaper model instead of failing or queuing. The user gets a slightly different model but zero downtime.

MODEL_FALLBACKS = {
    "gpt-4o": "gpt-4o-mini",
    "claude-sonnet-4-20250514": "claude-haiku-3-5-20241022",
}

def call_with_fallback(messages, model="gpt-4o"):
    try:
        return call_llm(messages, model=model)
    except RateLimitError:
        fallback = MODEL_FALLBACKS.get(model)
        if fallback:
            return call_llm(messages, model=fallback)
        raise

This pattern works because rate limits are per-model. Being rate limited on GPT-4o doesn't affect your GPT-4o-mini quota. You get lower quality but guaranteed availability.

When to use this. Fallback routing makes sense for tasks where quality degrades gracefully — chatbots, summarization, simple Q&A. It doesn't make sense for tasks where the cheaper model produces meaningfully worse output — complex reasoning, code generation, nuanced analysis. Know which tasks tolerate a model downgrade before you implement this.

For a detailed comparison of which cheaper model handles which tasks well, see our cheapest LLM for each use case breakdown.

How do you monitor rate limits proactively?

The best way to handle rate limits is to see them coming before they hit. Monitor your X-RateLimit-Remaining headers on every successful response, not just on 429 errors.

def call_llm_monitored(payload):
    response = requests.post(url, json=payload, headers=headers)
    
    remaining = int(response.headers.get("X-RateLimit-Remaining-Requests", 999))
    limit = int(response.headers.get("X-RateLimit-Limit-Requests", 999))
    
    usage_percent = ((limit - remaining) / limit) * 100
    
    if usage_percent > 80:
        log.warning(f"Rate limit at {usage_percent:.0f}% — "
                    f"{remaining} requests remaining")
    
    if usage_percent > 95:
        # Proactively throttle before hitting the wall
        activate_request_queue()
    
    return response.json()

Logging rate limit usage alongside your cost monitoring gives you a complete picture. You'll see correlations — cost spikes often coincide with rate limit pressure because both are driven by volume.

Why set rate limits at your proxy layer?

If you're using a proxy between your app and LLM providers — which is how Tokonomics works — rate limiting happens at two levels:

Your proxy's rate limits. The proxy may enforce its own limits per API key to protect shared infrastructure.
The upstream provider's rate limits. OpenAI, Anthropic, etc. enforce their limits regardless of whether you're going through a proxy.

Tokonomics enforces per-key rate limits (10 req/min on Free, 60 req/min on Pro) and returns standard X-RateLimit-* headers so your retry logic works identically. When the upstream provider returns a 429, Tokonomics passes it through transparently — your app sees the same headers and can apply the same backoff patterns described above.

The advantage of a proxy is visibility. Instead of guessing why you're hitting rate limits, you can see the exact request volume per API key, per model, and per time window in your analytics dashboard. You can identify which feature or team is consuming the most capacity and optimize there first.

Which rate limit strategy should you choose?

Choose your pattern based on your use case:

Scenario	Pattern	Why
User-facing, real-time	Exponential backoff + fallback model	Users see zero errors, acceptable quality tradeoff
Batch processing	Request queue	Throughput matters more than latency
High volume, mixed priority	Queue + priority lanes	Critical requests skip the queue, batch jobs wait
Multi-model architecture	Fallback chain	GPT-4o → GPT-4o-mini → DeepSeek V3

Most production apps combine patterns. User-facing endpoints get exponential backoff with model fallback. Background jobs go through a rate-limited queue. Both feed into the same cost tracking system so you can see total spend regardless of which path a request took.

What should you do right now?

If your app calls any LLM API and doesn't handle 429 errors, you have a production bug. Here's the minimum fix:

Add exponential backoff to every LLM call. Five lines of code. No excuse not to have this.
Read the Retry-After header. It's more accurate than any backoff formula.
Log rate limit events. You can't optimize what you don't measure.
Set up budget alerts. Rate limits and cost overruns are often the same problem — high volume. Set alerts that fire before you hit either wall.

Rate limits are not bugs in the provider's API. They're a feature of operating at scale. The developers who handle them well build apps that stay reliable as traffic grows. The ones who ignore them build apps that break at the worst possible moment.

Frequently Asked Questions

What's the difference between RPM and TPM rate limits?

RPM (requests per minute) caps how many API calls you can make. TPM (tokens per minute) caps total token throughput. You'll hit whichever limit comes first. OpenAI's Tier 1 allows 500 RPM but only 30,000 TPM on GPT-4o, so long prompts hit TPM before RPM.

How long should I wait before retrying a 429 error?

Check the Retry-After header first, it gives the exact wait time in seconds. If absent, use exponential backoff starting at 1 second, doubling each attempt (1s, 2s, 4s, 8s) with 20% jitter. Cap retries at 5 attempts and max wait at 60 seconds.

Can I increase my OpenAI rate limits without paying more?

Yes. OpenAI automatically upgrades your tier based on cumulative spend. Tier 2 unlocks at $50 total spend, Tier 3 at $100, Tier 4 at $250, and Tier 5 at $1,000 (OpenAI, 2024). No manual request needed.

Do rate limits apply differently to streaming requests?

Streaming doesn't change your rate limits. Each streaming request still counts as one RPM, and all tokens (input plus output) count toward your TPM cap. The difference is how the response arrives, not how the request is metered. Budget alerts help you catch high-volume patterns before you hit limits.

Last updated June 2026. All sources retrieved June 2026.