The Batch API Playbook: 50% Off for Workloads That Can Wait
OpenAI charges half price for requests processed within 24 hours. Nightly reports, bulk classification, and offline enrichment all qualify. The engineering cost: two API calls and a .jsonl file. Stack batch pricing with context compression and you are at roughly one-sixth of naive real-time cost.
The cheapest optimization you are not using ¶
OpenAI's Batch API charges 50% of standard pricing for requests processed within 24 hours. Not 5% cheaper. Not 15% for high-volume customers. Half price, available to every account, today.
The engineering cost to adopt it: two API calls and a .jsonl file.
If you are paying $10,000/month on GPT-4o for document classification, nightly report generation, or bulk embedding, the path to $5,000/month is a half-day of work.
What qualifies ¶
Batch works for any workload where you can tolerate up to 24 hours of latency. The practical categories:
The disqualifying criteria is just: the user is waiting. If a human expects a response in under a minute, it is not a batch workload.
How it works ¶
Three steps: build a .jsonl file, upload it, poll for completion.
Step 1: Build your request file
``python
import json
requests = [ { "custom_id": f"req-{i}", "method": "POST", "url": "/v1/chat/completions", "body": { "model": "gpt-4o", "messages": [ {"role": "system", "content": "Classify the sentiment: positive, neutral, or negative."}, {"role": "user", "content": document} ], "max_tokens": 10 } } for i, document in enumerate(documents) ]
with open("batch_requests.jsonl", "w") as f:
for req in requests:
f.write(json.dumps(req) + "\n")
`
Step 2: Submit the batch
`python
from openai import OpenAI
client = OpenAI()
batch_input_file = client.files.create( file=open("batch_requests.jsonl", "rb"), purpose="batch" )
batch = client.batches.create( input_file_id=batch_input_file.id, endpoint="/v1/chat/completions", completion_window="24h" )
print(f"Batch ID: {batch.id}")
`
Step 3: Retrieve results
`python
import time
while True: batch = client.batches.retrieve(batch.id) if batch.status == "completed": break time.sleep(60)
content = client.files.content(batch.output_file_id)
results = [json.loads(line) for line in content.text.strip().split("\n")]
`
Each result maps back to your custom_id. Failed requests are in a separate error file; you can resubmit only the failures.
The math ¶
Say you run nightly document classification on 10,000 documents, averaging 500 input tokens and 10 output tokens each.
Standard pricing (gpt-4o as of May 2026):
Batch pricing (50% off):
For 100,000 documents/night, that is $24,650/year for adopting an asynchronous queue you already effectively have.
Compound it with context compression ¶
Batch discount and context compression are orthogonal. A 500-token document that compresses to 150 tokens before it hits the API drops your input cost by 70%. Combine the two:
For offline workloads, compressed batch processing costs roughly one-sixth of naive real-time inference. The engineering effort is a .jsonl` formatter and a compression call.
What to watch for ¶
Cite this¶
Researchers, analysts, or journalists referencing this post can use either format below — both are copyable.
@misc{batch-api-50-percent-off-async-workloads-2026,
title = {The Batch API Playbook: 50% Off for Workloads That Can Wait},
author = {James Hollingsworth},
year = {2026},
month = {May},
url = {https://www.gotcontext.ai/blog/batch-api-50-percent-off-async-workloads},
note = {gotcontext.ai engineering blog.},
}James Hollingsworth. (2026, May 8). The Batch API Playbook: 50% Off for Workloads That Can Wait. gotcontext.ai. Retrieved from https://www.gotcontext.ai/blog/batch-api-50-percent-off-async-workloads.