← Blog
openai-batch-api openai-cost-savings batch-processing-llm June 6, 2026 7 min read

OpenAI Batch API: Can It Really Save You Money?

Data processing pipeline representing OpenAI Batch API cost savings and batch processing

TL;DR — Yes. OpenAI Batch API cuts per-token price by 50% (GPT-4o: $2.50 → $1.25/M input). Tradeoff: results in up to 24 hours, not real-time. Best for: nightly data processing, bulk classification, async document analysis. Not for: user-facing chat, real-time search.

Yes. OpenAI's Batch API gives you a flat 50% discount on per-token pricing. GPT-4o input tokens drop from $2.50 to $1.25 per million. GPT-4o-mini drops from $0.15 to $0.075. Same models, same quality, half the price.

The tradeoff: you give up real-time responses. Batch requests are queued and processed within a 24-hour window. You submit a file of requests, OpenAI processes them when capacity is available, and you download the results later. If your workload doesn't need instant responses, this is the easiest cost optimization available.

How the Batch API works

Instead of sending individual API calls, you upload a JSONL file containing multiple requests. Each line is a complete chat completion request:

{"custom_id": "req-1", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "gpt-4o", "messages": [{"role": "user", "content": "Summarize this article..."}]}}
{"custom_id": "req-2", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "gpt-4o", "messages": [{"role": "user", "content": "Extract entities from..."}]}}

Then you create a batch:

import openai

# Upload the file
batch_file = openai.files.create(
    file=open("requests.jsonl", "rb"),
    purpose="batch"
)

# Create the batch
batch = openai.batches.create(
    input_file_id=batch_file.id,
    endpoint="/v1/chat/completions",
    completion_window="24h"
)

# Check status later
status = openai.batches.retrieve(batch.id)
print(status.status)  # "completed"

# Download results
results = openai.files.content(status.output_file_id)

OpenAI processes the batch and returns a JSONL file with all responses, mapped back to your custom_id values.

The exact savings

Model Standard input Batch input Standard output Batch output Savings
GPT-4o $2.50/1M $1.25/1M $10.00/1M $5.00/1M 50%
GPT-4o-mini $0.15/1M $0.075/1M $0.60/1M $0.30/1M 50%
o1 $15.00/1M $7.50/1M $60.00/1M $30.00/1M 50%

The discount is consistent across all models. For the full pricing table on standard rates, see our LLM API pricing guide.

Real savings example: A team processing 10,000 product descriptions daily through GPT-4o for categorization:

Pricing Daily cost Monthly cost
Standard $12.50 + $10.00 = $22.50 $675
Batch $6.25 + $5.00 = $11.25 $337.50
Savings $11.25/day $337.50/month

That's $337.50/month saved by changing how you submit requests — no prompt changes, no model downgrade, no quality loss.

When the Batch API makes sense

The Batch API works for any workload where you don't need an instant response. Common use cases:

Content generation at scale. Generating product descriptions, marketing copy, or blog drafts. You prepare a batch of prompts, submit overnight, and review results in the morning.

Data processing and extraction. Parsing invoices, extracting entities from documents, classifying support tickets. These are typically queued jobs anyway — the batch API just makes them cheaper.

Evaluation and testing. Running your prompt suite against a new model version. Instead of firing 1,000 API calls in real-time, batch them and get results in a few hours at half the cost.

Embeddings generation. Processing a large corpus for RAG indexing. Embedding 100,000 documents doesn't need to happen in real-time.

Dataset labeling. Using GPT-4o to label training data for a fine-tuned model. Label quality doesn't change whether you process the request at 2pm or 2am.

When the Batch API doesn't work

User-facing features. Chatbots, search, real-time recommendations — anything where a user is waiting for a response. A 24-hour completion window is obviously not acceptable.

Interactive workflows. Multi-turn conversations, agent loops, or any flow where the next step depends on the previous response. Batch can't handle sequential dependencies.

Time-sensitive processing. Fraud detection, real-time content moderation, or alerts that need to fire immediately.

Small volumes. If you're making 100 calls/day, the Batch API saves you a few cents. The engineering effort to restructure your code for batch processing isn't worth it below roughly 1,000 requests/day.

Combining batch and real-time

The highest-impact approach is splitting your workload:

Real-time (standard pricing):
├── User-facing chatbot responses
├── Search and recommendations
└── Interactive agent steps

Batch (50% discount):
├── Nightly content generation
├── Daily report summarization
├── Weekly data classification
└── Embedding index updates

Most teams have both workload types. A SaaS app might use real-time API calls for the chatbot feature (user is waiting) and batch processing for the nightly analytics digest (nobody is waiting). Running the batch portion through the Batch API cuts that segment's cost in half.

To understand which parts of your app could move to batch, start by tracking costs per feature. Once you see the cost breakdown, you can identify which features are batch-eligible and calculate the potential savings.

Batch API limitations to know

24-hour window, not instant. OpenAI guarantees completion within 24 hours but most batches finish in 1-4 hours. You can't rely on a specific completion time — don't build workflows that assume results arrive within an hour.

File size limits. Maximum 50,000 requests per batch. For larger workloads, split into multiple batch submissions.

No streaming. Batch responses are returned as complete JSON objects. If you normally rely on streaming for progress indicators, you'll need a different UX pattern for batch results.

Error handling is different. In real-time, you catch a 429 or 500 error and retry immediately. In batch, failed requests appear in the output file with an error field. You need post-processing logic to identify and resubmit failures.

Not all endpoints supported. The Batch API supports chat completions and embeddings. It does not support image generation, audio, or fine-tuning.

Tracking batch vs real-time costs

When you split workloads between batch and real-time, you need to track costs for both channels. Provider dashboards don't always separate batch and standard usage clearly.

With Tokonomics, you can tag batch requests differently from real-time ones:

# Real-time calls
headers = {"X-Metering-Tags": '{"channel":"realtime","feature":"chatbot"}'}

# Before submitting to batch, log the expected cost
headers = {"X-Metering-Tags": '{"channel":"batch","feature":"digest"}'}

This lets you see in your cost dashboard exactly how much you're spending on each channel, verify the 50% savings is materializing, and catch any drift.

The bottom line

The Batch API is the simplest cost optimization OpenAI offers. No prompt engineering, no model switching, no quality tradeoff. You restructure how you submit requests and save 50%.

Action items:

  1. Audit your workloads. Which features don't need real-time responses?
  2. Calculate potential savings. Multiply your batch-eligible volume by current cost, then halve it.
  3. Start with one workload. Pick your highest-volume non-real-time task and migrate it to batch.
  4. Track the savings. Use budget monitoring to verify the cost reduction shows up.

If you're spending more than $500/month on OpenAI and any portion of your workload is non-real-time, the Batch API should be your first optimization — before prompt optimization, before model switching, before anything else. It's free money.

Last updated June 2026. All sources retrieved June 2026.

About the author
Zouhair is the founder of Tokonomics. He built the platform after receiving a $47,000 LLM invoice that his team didn't see coming. He tracks LLM pricing changes weekly across all major providers.
Connect on LinkedIn →
← Back to Blog