TL;DR — Catch HTTP 429, read the
Retry-Afterheader, apply exponential backoff (wait = 2^attempt × 1s, max 60s), and put overflow into a Redis queue. Never let a 429 bubble raw to the user. OpenAI Tier 1 free: 3 RPM / 200 RPD — upgrade to Tier 2 once you spend $50.
Every LLM provider enforces rate limits. When you exceed them, you get a 429 status code and your request is rejected. If your app doesn't handle this, your user sees an error — or worse, their request silently disappears.
Rate limits exist because LLM inference is expensive. Providers throttle requests to protect their infrastructure and ensure fair access. OpenAI, Anthropic, and DeepSeek all enforce limits on requests per minute (RPM) and tokens per minute (TPM). The limits vary by model and by your spending tier.
This article covers the exact rate limits for each provider, three patterns for handling 429 errors gracefully, and how to architect your app so rate limits never cause a lost request.
Current rate limits by provider
Rate limits depend on your account tier, which is based on how much you've spent with the provider. Here are the defaults for the most common tiers:
OpenAI (Tier 1 — $5+ spent)
| Model | RPM | TPM |
|---|---|---|
| GPT-4o | 500 | 30,000 |
| GPT-4o-mini | 500 | 200,000 |
| o1 | 500 | 30,000 |
Higher tiers (Tier 2 at $50+, Tier 3 at $100+) increase these significantly. At Tier 5 ($1,000+ spent), GPT-4o gets 10,000 RPM and 30M TPM.
Anthropic (Build tier — credit-based)
| Model | RPM | TPM (input) | TPM (output) |
|---|---|---|---|
| Claude Sonnet 4 | 1,000 | 40,000 | 8,000 |
| Claude Haiku 3.5 | 1,000 | 50,000 | 10,000 |
Anthropic separates input and output TPM limits. The output TPM limit is typically what you hit first because generation is slower than prompt processing.
DeepSeek
DeepSeek currently enforces softer rate limits but can throttle during peak hours. They don't publish fixed tier-based limits like OpenAI. During high-demand periods, expect 60-120 RPM effective throughput.
How to tell when you're being rate limited
The provider tells you. Every 429 response includes headers that specify your limits and when to retry:
HTTP/1.1 429 Too Many Requests
Retry-After: 2
X-RateLimit-Limit-Requests: 500
X-RateLimit-Remaining-Requests: 0
X-RateLimit-Reset-Requests: 1686000000
The three critical headers:
Retry-After— seconds to wait before retrying. Always respect this.X-RateLimit-Remaining-Requests— how many requests you have left in the current window.X-RateLimit-Reset-Requests— Unix timestamp when your limit resets.
If you're not reading these headers, you're guessing. Stop guessing.
Pattern 1: Exponential backoff with jitter
The simplest and most effective pattern. When you get a 429, wait and retry — but increase the wait time exponentially so you don't hammer the API with retries.
import time
import random
import requests
def call_llm_with_retry(payload, max_retries=5):
base_delay = 1 # seconds
for attempt in range(max_retries):
response = requests.post(
"https://api.openai.com/v1/chat/completions",
json=payload,
headers={"Authorization": f"Bearer {API_KEY}"}
)
if response.status_code == 200:
return response.json()
if response.status_code == 429:
# Use Retry-After header if available
retry_after = response.headers.get("Retry-After")
if retry_after:
delay = float(retry_after)
else:
# Exponential backoff: 1s, 2s, 4s, 8s, 16s
delay = base_delay * (2 ** attempt)
# Add jitter to prevent thundering herd
jitter = random.uniform(0, delay * 0.5)
time.sleep(delay + jitter)
continue
# Non-rate-limit error — don't retry
response.raise_for_status()
raise Exception("Max retries exceeded — rate limit persists")
Why jitter matters. Without jitter, if 50 requests all get rate limited at the same time, they all retry at the exact same time — and get rate limited again. Adding random jitter spreads retries across the time window. This is the difference between "works in development" and "works in production."
The same pattern in Go:
func callLLMWithRetry(payload []byte, maxRetries int) ([]byte, error) {
baseDelay := time.Second
for attempt := 0; attempt < maxRetries; attempt++ {
resp, err := http.Post(
"https://api.openai.com/v1/chat/completions",
"application/json",
bytes.NewBuffer(payload),
)
if err != nil {
return nil, err
}
defer resp.Body.Close()
body, _ := io.ReadAll(resp.Body)
if resp.StatusCode == 200 {
return body, nil
}
if resp.StatusCode == 429 {
delay := baseDelay * (1 << attempt)
jitter := time.Duration(rand.Int63n(int64(delay / 2)))
time.Sleep(delay + jitter)
continue
}
return nil, fmt.Errorf("LLM API error: HTTP %d", resp.StatusCode)
}
return nil, fmt.Errorf("max retries exceeded")
}
Cost implication. Every retry is a full billable request if the provider processes it before rejecting. With rate limits, the provider rejects at the gate — no tokens consumed, no cost. But retries from server errors (500, 502, 503) do consume tokens. Track retry rates to understand their impact on your LLM costs.
Pattern 2: Client-side request queue
Exponential backoff handles the problem reactively — you hit the limit, then back off. A request queue handles it proactively — you never exceed the limit in the first place.
import time
import threading
from collections import deque
class RateLimitedQueue:
def __init__(self, max_rpm=500):
self.max_rpm = max_rpm
self.interval = 60.0 / max_rpm # seconds between requests
self.last_request_time = 0
self.lock = threading.Lock()
self.queue = deque()
def submit(self, request_fn):
"""Add a request to the queue. Returns a future-like object."""
result = {"done": False, "value": None, "error": None}
self.queue.append((request_fn, result))
self._process()
return result
def _process(self):
with self.lock:
while self.queue:
now = time.time()
elapsed = now - self.last_request_time
if elapsed < self.interval:
time.sleep(self.interval - elapsed)
request_fn, result = self.queue.popleft()
try:
result["value"] = request_fn()
except Exception as e:
result["error"] = e
result["done"] = True
self.last_request_time = time.time()
This approach spaces requests evenly across the minute window. If your limit is 500 RPM, it sends one request every 120ms. You never hit 429 because you never exceed the rate.
The tradeoff. Queuing adds latency. If your app needs sub-second responses for every user, a queue that throttles to 500 RPM means some users wait. For background tasks (batch processing, content generation, data extraction), queuing is ideal. For real-time user-facing features, combine queuing with the fallback model pattern below.
Pattern 3: Model fallback on rate limit
When your primary model is rate limited, route the request to a cheaper model instead of failing or queuing. The user gets a slightly different model but zero downtime.
MODEL_FALLBACKS = {
"gpt-4o": "gpt-4o-mini",
"claude-sonnet-4-20250514": "claude-haiku-3-5-20241022",
}
def call_with_fallback(messages, model="gpt-4o"):
try:
return call_llm(messages, model=model)
except RateLimitError:
fallback = MODEL_FALLBACKS.get(model)
if fallback:
return call_llm(messages, model=fallback)
raise
This pattern works because rate limits are per-model. Being rate limited on GPT-4o doesn't affect your GPT-4o-mini quota. You get lower quality but guaranteed availability.
When to use this. Fallback routing makes sense for tasks where quality degrades gracefully — chatbots, summarization, simple Q&A. It doesn't make sense for tasks where the cheaper model produces meaningfully worse output — complex reasoning, code generation, nuanced analysis. Know which tasks tolerate a model downgrade before you implement this.
For a detailed comparison of which cheaper model handles which tasks well, see our cheapest LLM for each use case breakdown.
Proactive rate limit monitoring
The best way to handle rate limits is to see them coming before they hit. Monitor your X-RateLimit-Remaining headers on every successful response, not just on 429 errors.
def call_llm_monitored(payload):
response = requests.post(url, json=payload, headers=headers)
remaining = int(response.headers.get("X-RateLimit-Remaining-Requests", 999))
limit = int(response.headers.get("X-RateLimit-Limit-Requests", 999))
usage_percent = ((limit - remaining) / limit) * 100
if usage_percent > 80:
log.warning(f"Rate limit at {usage_percent:.0f}% — "
f"{remaining} requests remaining")
if usage_percent > 95:
# Proactively throttle before hitting the wall
activate_request_queue()
return response.json()
Logging rate limit usage alongside your cost monitoring gives you a complete picture. You'll see correlations — cost spikes often coincide with rate limit pressure because both are driven by volume.
Rate limits at your proxy layer
If you're using a proxy between your app and LLM providers — which is how Tokonomics works — rate limiting happens at two levels:
- Your proxy's rate limits. The proxy may enforce its own limits per API key to protect shared infrastructure.
- The upstream provider's rate limits. OpenAI, Anthropic, etc. enforce their limits regardless of whether you're going through a proxy.
Tokonomics enforces per-key rate limits (10 req/min on Free, 60 req/min on Pro) and returns standard X-RateLimit-* headers so your retry logic works identically. When the upstream provider returns a 429, Tokonomics passes it through transparently — your app sees the same headers and can apply the same backoff patterns described above.
The advantage of a proxy is visibility. Instead of guessing why you're hitting rate limits, you can see the exact request volume per API key, per model, and per time window in your analytics dashboard. You can identify which feature or team is consuming the most capacity and optimize there first.
The decision framework
Choose your pattern based on your use case:
| Scenario | Pattern | Why |
|---|---|---|
| User-facing, real-time | Exponential backoff + fallback model | Users see zero errors, acceptable quality tradeoff |
| Batch processing | Request queue | Throughput matters more than latency |
| High volume, mixed priority | Queue + priority lanes | Critical requests skip the queue, batch jobs wait |
| Multi-model architecture | Fallback chain | GPT-4o → GPT-4o-mini → DeepSeek V3 |
Most production apps combine patterns. User-facing endpoints get exponential backoff with model fallback. Background jobs go through a rate-limited queue. Both feed into the same cost tracking system so you can see total spend regardless of which path a request took.
What to do right now
If your app calls any LLM API and doesn't handle 429 errors, you have a production bug. Here's the minimum fix:
- Add exponential backoff to every LLM call. Five lines of code. No excuse not to have this.
- Read the
Retry-Afterheader. It's more accurate than any backoff formula. - Log rate limit events. You can't optimize what you don't measure.
- Set up budget alerts. Rate limits and cost overruns are often the same problem — high volume. Set alerts that fire before you hit either wall.
Rate limits are not bugs in the provider's API. They're a feature of operating at scale. The developers who handle them well build apps that stay reliable as traffic grows. The ones who ignore them build apps that break at the worst possible moment.
Last updated June 2026. All sources retrieved June 2026.