Prompt Caching: Anthropic vs OpenAI vs Google — The Mechanics That Actually Determine Your Bill

Prompt caching is not the same product at every provider ¶

All three major LLM providers now offer prompt caching: the ability to reuse a cached prefix across multiple API calls, paying reduced rates for cached tokens. But the mechanics differ in ways that materially change the cost calculation.

The differences: who controls cache activation, how long the cache persists, what you pay during the write phase, and what the read discount actually is.

Anthropic (Claude API) ¶

Anthropic uses explicit, developer-controlled caching. You mark specific content blocks with cache_control in the request. Unmarked content is never cached regardless of how often it appears.

``python messages = [ { "role": "user", "content": [ { "type": "text", "text": system_prompt_text, "cache_control": {"type": "ephemeral"} }, { "type": "text", "text": user_query } ] } ]`

Pricing (Claude Sonnet 4, May 2026):

Token type	Price per MTok
Input (standard)	$3.00
Cache write	$3.75 (25% surcharge)
Cache read	$0.30 (90% discount)
Output	$15.00


TTL: Ephemeral cache lasts 5 minutes. Persistent cache ("cache_control: persistent") lasts 1 hour. There is no longer-duration option.
Minimum cacheable size: 1,024 tokens for Claude Sonnet; 2,048 tokens for Claude Haiku.
What this means operationally: The 25% write surcharge means your first call to a cold cache costs more than an uncached call. Caching only pays off if the cache is read at least twice before expiry. For applications with a 5-minute session window and burst query patterns, the economics are favorable. For single-turn APIs where the same system prompt is reused across sessions but individual sessions are hours apart, the 5-minute TTL means the cache almost never survives to be read.
OpenAI (GPT API) ¶
OpenAI uses automatic, provider-controlled caching. There is no API to explicitly mark content for caching. OpenAI caches the longest common prefix of your prompts automatically when it detects repeated prefixes.

`python # No cache_control needed (OpenAI handles it automatically) response = client.chat.completions.create( model="gpt-4o", messages=[ {"role": "system", "content": system_prompt}, {"role": "user", "content": user_query} ] ) # Check cache usage in response cached_tokens = response.usage.prompt_tokens_details.cached_tokens`

Pricing (GPT-4o, May 2026):

Token type	Price per MTok
Input (standard)	$2.50
Cache read	$1.25 (50% discount)
Cache write	No surcharge
Output	$10.00


TTL: OpenAI does not publish a specific TTL. Cache hits are reported in the response but the caching policy is not developer-controllable.
What this means operationally: No write surcharge means you never pay a penalty for cache misses. The 50% read discount is smaller than Anthropic (90%), but the automatic activation means you get cache benefits without instrumentation. The tradeoff: you cannot guarantee what is cached or predict cache behavior. For stateless API integrations where you want caching without adding complexity, OpenAI is simpler. For precise control over what gets cached, Anthropic gives you the lever.
Google (Gemini API) ¶
Google uses explicit developer-controlled caching with a storage-fee model distinct from the other two.

`python cached_content = genai.caching.CachedContent.create( model="gemini-2.5-pro", contents=[system_prompt_content], ttl=datetime.timedelta(hours=2) ) response = client.generate_content( model=cached_content.model, cached_content=cached_content.name, contents=[user_query] )`

Pricing (Gemini 2.5 Pro, May 2026):

Token type	Price per MTok
Input (standard)	$1.25 (<=200K) / $2.50 (>200K)
Cache read	$0.31 / $0.63 (tiered)
Cache storage	$4.50 per MTok per hour
Output	$10.00


Minimum TTL: 1 hour. You cannot cache for less than 1 hour.
Minimum cacheable size: 32,768 tokens.
What this means operationally: The storage fee changes the math entirely. At $4.50/MTok/hour, caching 100K tokens for 1 hour costs $0.45 in storage alone, before any reads. If you read those 100K cached tokens once per hour, you pay $0.031 in cache read fees. The storage cost dominates unless you are making many reads per hour against the same cached content.
The breakeven: at $4.50/MTok/hr storage and $0.94 savings per read (from $1.25 standard to $0.31 cache read per MTok), you need approximately 5 reads per hour to break even on storage costs.
Direct comparison table ¶
Feature Anthropic OpenAI Google
Cache activation Explicit (cache_control) Automatic Explicit (CachedContent)
Write surcharge +25% None None (storage fee instead)
Read discount 90% 50% ~75%
TTL options 5 min / 1 hr Automatic / unknown 1 hr minimum
Storage fee None None $4.50/MTok/hr
Min cache size 1,024–2,048 tokens Automatic 32,768 tokens
Reads to break even 2+ before TTL N/A (no write cost) 5+ per hour
Which provider wins on caching? ¶
It depends entirely on your application pattern:
High read frequency, long sessions (>1 hr), large system prompts (>32K tokens): Google can be cheapest if you make 10+ reads/hr per cached object. The storage fee is amortized.
Medium read frequency, short sessions (<5 min), small-to-medium system prompts: Anthropic wins. The 90% discount on reads is the deepest, and the 5-minute TTL matches session-scoped caching patterns.
Stateless integrations, no instrumentation budget, mixed query patterns: OpenAI is simplest. No write surcharge, no minimum size, automatic activation. You leave some savings on the table but pay nothing for misses.
Compounding with context compression ¶
Prompt caching reduces the cost of the *repeated prefix*: the system prompt or document that appears in every call. Context compression reduces the size of that prefix before it gets cached, which means:
You hit the minimum cache size thresholds faster (especially Google at 32,768 tokens)
The cached prefix costs less per token in storage (especially on Google)
Cache reads are cheaper because there are fewer tokens to read
Compress the context first, then cache the compressed version. The savings compound.

Feature	Anthropic	OpenAI	Google
Cache activation	Explicit (cache_control)	Automatic	Explicit (CachedContent)
Write surcharge	+25%	None	None (storage fee instead)
Read discount	90%	50%	~75%
TTL options	5 min / 1 hr	Automatic / unknown	1 hr minimum
Storage fee	None	None	$4.50/MTok/hr
Min cache size	1,024–2,048 tokens	Automatic	32,768 tokens
Reads to break even	2+ before TTL	N/A (no write cost)	5+ per hour

gotcontext handles compression via the ingest_context` MCP tool, then you cache the compressed output at whichever provider you use.

Get gotcontext free →

Cite this¶

Researchers, analysts, or journalists referencing this post can use either format below — both are copyable.

BibTeXbibtex

@misc{prompt-caching-by-the-numbers-anthropic-vs-openai-vs-google-2026,
  title  = {Prompt Caching: Anthropic vs OpenAI vs Google — The Mechanics That Actually Determine Your Bill},
  author = {James Hollingsworth},
  year   = {2026},
  month  = {May},
  url    = {https://www.gotcontext.ai/blog/prompt-caching-by-the-numbers-anthropic-vs-openai-vs-google},
  note   = {gotcontext.ai engineering blog.},
}

APAtext

James Hollingsworth. (2026, May 8). Prompt Caching: Anthropic vs OpenAI vs Google — The Mechanics That Actually Determine Your Bill. gotcontext.ai. Retrieved from https://www.gotcontext.ai/blog/prompt-caching-by-the-numbers-anthropic-vs-openai-vs-google.

Contribute¶

Suggest an edit

Spotted a typo, a stale benchmark, or a missing nuance? Open a GitHub issue.

Discuss this post

Counterexamples, follow-up questions, and adjacent research welcome.

Email us

Bigger story? Hit us directly at hello@gotcontext.ai.

← Back to all posts