TL;DR — Yes. OpenAI Batch API cuts per-token price by 50% (GPT-4o: $2.50 → $1.25/M input). Tradeoff: results in up to 24 hours, not real-time. Best for: nightly data processing, bulk classification, async document analysis. Not for: user-facing chat, real-time search.
Yes. OpenAI's Batch API gives you a flat 50% discount on per-token pricing. GPT-4o input tokens drop from $2.50 to $1.25 per million. GPT-4o-mini drops from $0.15 to $0.075. Same models, same quality, half the price.
The tradeoff: you give up real-time responses. Batch requests are queued and processed within a 24-hour window. You submit a file of requests, OpenAI processes them when capacity is available, and you download the results later. If your workload doesn't need instant responses, this is the easiest cost optimization available.
How the Batch API works
Instead of sending individual API calls, you upload a JSONL file containing multiple requests. Each line is a complete chat completion request:
{"custom_id": "req-1", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "gpt-4o", "messages": [{"role": "user", "content": "Summarize this article..."}]}}
{"custom_id": "req-2", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "gpt-4o", "messages": [{"role": "user", "content": "Extract entities from..."}]}}
Then you create a batch:
import openai
# Upload the file
batch_file = openai.files.create(
file=open("requests.jsonl", "rb"),
purpose="batch"
)
# Create the batch
batch = openai.batches.create(
input_file_id=batch_file.id,
endpoint="/v1/chat/completions",
completion_window="24h"
)
# Check status later
status = openai.batches.retrieve(batch.id)
print(status.status) # "completed"
# Download results
results = openai.files.content(status.output_file_id)
OpenAI processes the batch and returns a JSONL file with all responses, mapped back to your custom_id values.
The exact savings
| Model | Standard input | Batch input | Standard output | Batch output | Savings |
|---|---|---|---|---|---|
| GPT-4o | $2.50/1M | $1.25/1M | $10.00/1M | $5.00/1M | 50% |
| GPT-4o-mini | $0.15/1M | $0.075/1M | $0.60/1M | $0.30/1M | 50% |
| o1 | $15.00/1M | $7.50/1M | $60.00/1M | $30.00/1M | 50% |
The discount is consistent across all models. For the full pricing table on standard rates, see our LLM API pricing guide.
Real savings example: A team processing 10,000 product descriptions daily through GPT-4o for categorization:
- Average input: 500 tokens per request
- Average output: 100 tokens per request
- Daily tokens: 5M input + 1M output
| Pricing | Daily cost | Monthly cost |
|---|---|---|
| Standard | $12.50 + $10.00 = $22.50 | $675 |
| Batch | $6.25 + $5.00 = $11.25 | $337.50 |
| Savings | $11.25/day | $337.50/month |
That's $337.50/month saved by changing how you submit requests — no prompt changes, no model downgrade, no quality loss.
When the Batch API makes sense
The Batch API works for any workload where you don't need an instant response. Common use cases:
Content generation at scale. Generating product descriptions, marketing copy, or blog drafts. You prepare a batch of prompts, submit overnight, and review results in the morning.
Data processing and extraction. Parsing invoices, extracting entities from documents, classifying support tickets. These are typically queued jobs anyway — the batch API just makes them cheaper.
Evaluation and testing. Running your prompt suite against a new model version. Instead of firing 1,000 API calls in real-time, batch them and get results in a few hours at half the cost.
Embeddings generation. Processing a large corpus for RAG indexing. Embedding 100,000 documents doesn't need to happen in real-time.
Dataset labeling. Using GPT-4o to label training data for a fine-tuned model. Label quality doesn't change whether you process the request at 2pm or 2am.
When the Batch API doesn't work
User-facing features. Chatbots, search, real-time recommendations — anything where a user is waiting for a response. A 24-hour completion window is obviously not acceptable.
Interactive workflows. Multi-turn conversations, agent loops, or any flow where the next step depends on the previous response. Batch can't handle sequential dependencies.
Time-sensitive processing. Fraud detection, real-time content moderation, or alerts that need to fire immediately.
Small volumes. If you're making 100 calls/day, the Batch API saves you a few cents. The engineering effort to restructure your code for batch processing isn't worth it below roughly 1,000 requests/day.
Combining batch and real-time
The highest-impact approach is splitting your workload:
Real-time (standard pricing):
├── User-facing chatbot responses
├── Search and recommendations
└── Interactive agent steps
Batch (50% discount):
├── Nightly content generation
├── Daily report summarization
├── Weekly data classification
└── Embedding index updates
Most teams have both workload types. A SaaS app might use real-time API calls for the chatbot feature (user is waiting) and batch processing for the nightly analytics digest (nobody is waiting). Running the batch portion through the Batch API cuts that segment's cost in half.
To understand which parts of your app could move to batch, start by tracking costs per feature. Once you see the cost breakdown, you can identify which features are batch-eligible and calculate the potential savings.
Batch API limitations to know
24-hour window, not instant. OpenAI guarantees completion within 24 hours but most batches finish in 1-4 hours. You can't rely on a specific completion time — don't build workflows that assume results arrive within an hour.
File size limits. Maximum 50,000 requests per batch. For larger workloads, split into multiple batch submissions.
No streaming. Batch responses are returned as complete JSON objects. If you normally rely on streaming for progress indicators, you'll need a different UX pattern for batch results.
Error handling is different. In real-time, you catch a 429 or 500 error and retry immediately. In batch, failed requests appear in the output file with an error field. You need post-processing logic to identify and resubmit failures.
Not all endpoints supported. The Batch API supports chat completions and embeddings. It does not support image generation, audio, or fine-tuning.
Tracking batch vs real-time costs
When you split workloads between batch and real-time, you need to track costs for both channels. Provider dashboards don't always separate batch and standard usage clearly.
With Tokonomics, you can tag batch requests differently from real-time ones:
# Real-time calls
headers = {"X-Metering-Tags": '{"channel":"realtime","feature":"chatbot"}'}
# Before submitting to batch, log the expected cost
headers = {"X-Metering-Tags": '{"channel":"batch","feature":"digest"}'}
This lets you see in your cost dashboard exactly how much you're spending on each channel, verify the 50% savings is materializing, and catch any drift.
The bottom line
The Batch API is the simplest cost optimization OpenAI offers. No prompt engineering, no model switching, no quality tradeoff. You restructure how you submit requests and save 50%.
Action items:
- Audit your workloads. Which features don't need real-time responses?
- Calculate potential savings. Multiply your batch-eligible volume by current cost, then halve it.
- Start with one workload. Pick your highest-volume non-real-time task and migrate it to batch.
- Track the savings. Use budget monitoring to verify the cost reduction shows up.
If you're spending more than $500/month on OpenAI and any portion of your workload is non-real-time, the Batch API should be your first optimization — before prompt optimization, before model switching, before anything else. It's free money.
Last updated June 2026. All sources retrieved June 2026.