The Batch API Playbook: 50% Off for Workloads That Can Wait

The cheapest optimization you are not using ¶

OpenAI's Batch API charges 50% of standard pricing for requests processed within 24 hours. Not 5% cheaper. Not 15% for high-volume customers. Half price, available to every account, today.

The engineering cost to adopt it: two API calls and a .jsonl file.

If you are paying $10,000/month on GPT-4o for document classification, nightly report generation, or bulk embedding, the path to $5,000/month is a half-day of work.

What qualifies ¶

Batch works for any workload where you can tolerate up to 24 hours of latency. The practical categories:

Nightly reports. Summarize the day's activity, generate the weekly digest, produce the Monday standup brief.

Document indexing. Extract entities, classify documents, generate embeddings; all of this is batch-safe.

Evaluation runs. LLM-as-judge evals on your test set do not need real-time responses.

Data enrichment. Product description generation, SEO metadata, schema extraction from raw documents.

Offline analysis. Sentiment analysis on customer support tickets, classification of inbound emails, categorization of log messages.

The disqualifying criteria is just: the user is waiting. If a human expects a response in under a minute, it is not a batch workload.

How it works ¶

Three steps: build a .jsonl file, upload it, poll for completion.

Step 1: Build your request file

``python import json

requests = [ { "custom_id": f"req-{i}", "method": "POST", "url": "/v1/chat/completions", "body": { "model": "gpt-4o", "messages": [ {"role": "system", "content": "Classify the sentiment: positive, neutral, or negative."}, {"role": "user", "content": document} ], "max_tokens": 10 } } for i, document in enumerate(documents) ]

with open("batch_requests.jsonl", "w") as f: for req in requests: f.write(json.dumps(req) + "\n")`

`Step 2: Submit the batch`

`python from openai import OpenAI

client = OpenAI()

batch_input_file = client.files.create( file=open("batch_requests.jsonl", "rb"), purpose="batch" )

batch = client.batches.create( input_file_id=batch_input_file.id, endpoint="/v1/chat/completions", completion_window="24h" )

print(f"Batch ID: {batch.id}")`

`Step 3: Retrieve results`

`python import time

while True: batch = client.batches.retrieve(batch.id) if batch.status == "completed": break time.sleep(60)

content = client.files.content(batch.output_file_id) results = [json.loads(line) for line in content.text.strip().split("\n")]`

Each result maps back to your custom_id. Failed requests are in a separate error file; you can resubmit only the failures.

`The math ¶`

Say you run nightly document classification on 10,000 documents, averaging 500 input tokens and 10 output tokens each.

Standard pricing (gpt-4o as of May 2026):

Input: 10,000 x 500 = 5M tokens at $2.50/M = $12.50


Output: 10,000 x 10 = 100K tokens at $10/M = $1.00
Nightly cost: $13.50
Batch pricing (50% off):
Nightly cost: $6.75
Annual savings: ~$2,465
For 100,000 documents/night, that is $24,650/year for adopting an asynchronous queue you already effectively have.
Compound it with context compression ¶
Batch discount and context compression are orthogonal. A 500-token document that compresses to 150 tokens before it hits the API drops your input cost by 70%. Combine the two:
Uncompressed, real-time: 100% of cost
Compressed, real-time: ~32% of cost
Uncompressed, batch: ~50% of cost
Compressed, batch: ~16% of cost

For offline workloads, compressed batch processing costs roughly one-sixth of naive real-time inference. The engineering effort is a .jsonl` formatter and a compression call.

What to watch for ¶

Rate limits still apply per batch. Very large batches (>50K requests) need to be split. The API will error and tell you the limit.

Batch quotas exist per organization. Your first batches will hit a lower limit that increases with usage history.

Output tokens cost the same per-token in batch as in real-time. Only input tokens get the discount. Verify your token mix.

24h SLA is a ceiling, not a floor. Most batches complete in 1-4 hours. Do not assume you have 24 hours if your downstream job depends on the results.

Start compressing before you batch →

Cite this¶

Researchers, analysts, or journalists referencing this post can use either format below — both are copyable.

BibTeXbibtex

@misc{batch-api-50-percent-off-async-workloads-2026,
  title  = {The Batch API Playbook: 50% Off for Workloads That Can Wait},
  author = {James Hollingsworth},
  year   = {2026},
  month  = {May},
  url    = {https://www.gotcontext.ai/blog/batch-api-50-percent-off-async-workloads},
  note   = {gotcontext.ai engineering blog.},
}

APAtext

James Hollingsworth. (2026, May 8). The Batch API Playbook: 50% Off for Workloads That Can Wait. gotcontext.ai. Retrieved from https://www.gotcontext.ai/blog/batch-api-50-percent-off-async-workloads.

Contribute¶

Suggest an edit

Spotted a typo, a stale benchmark, or a missing nuance? Open a GitHub issue.

Discuss this post

Counterexamples, follow-up questions, and adjacent research welcome.

Email us

Bigger story? Hit us directly at hello@gotcontext.ai.

← Back to all posts