Measured savings across 11 LLMs — Claude Opus 4.7 to Gemini Flash.→ See per-model data
Get free API key →
Cost

Prompt Caching: Anthropic vs OpenAI vs Google — The Mechanics That Actually Determine Your Bill

All three major providers offer prompt caching. The pricing models, TTLs, and activation mechanics are different enough that the cheapest option depends on your specific usage pattern. Here is a direct comparison.

James Hollingsworth(Contributor)Published 6 min~846 words

Prompt caching is not the same product at every provider

All three major LLM providers now offer prompt caching: the ability to reuse a cached prefix across multiple API calls, paying reduced rates for cached tokens. But the mechanics differ in ways that materially change the cost calculation.

The differences: who controls cache activation, how long the cache persists, what you pay during the write phase, and what the read discount actually is.

Anthropic (Claude API)

Anthropic uses explicit, developer-controlled caching. You mark specific content blocks with cache_control in the request. Unmarked content is never cached regardless of how often it appears.

``python messages = [ { "role": "user", "content": [ { "type": "text", "text": system_prompt_text, "cache_control": {"type": "ephemeral"} }, { "type": "text", "text": user_query } ] } ] `

Pricing (Claude Sonnet 4, May 2026):

Token typePrice per MTok
Input (standard)$3.00
Cache write$3.75 (25% surcharge)
Cache read$0.30 (90% discount)
Output$15.00
TTL: Ephemeral cache lasts 5 minutes. Persistent cache ("cache_control: persistent") lasts 1 hour. There is no longer-duration option.

Minimum cacheable size: 1,024 tokens for Claude Sonnet; 2,048 tokens for Claude Haiku.

What this means operationally: The 25% write surcharge means your first call to a cold cache costs more than an uncached call. Caching only pays off if the cache is read at least twice before expiry. For applications with a 5-minute session window and burst query patterns, the economics are favorable. For single-turn APIs where the same system prompt is reused across sessions but individual sessions are hours apart, the 5-minute TTL means the cache almost never survives to be read.

OpenAI (GPT API)

OpenAI uses automatic, provider-controlled caching. There is no API to explicitly mark content for caching. OpenAI caches the longest common prefix of your prompts automatically when it detects repeated prefixes.

`python # No cache_control needed (OpenAI handles it automatically) response = client.chat.completions.create( model="gpt-4o", messages=[ {"role": "system", "content": system_prompt}, {"role": "user", "content": user_query} ] ) # Check cache usage in response cached_tokens = response.usage.prompt_tokens_details.cached_tokens `

Pricing (GPT-4o, May 2026):

Token typePrice per MTok
Input (standard)$2.50
Cache read$1.25 (50% discount)
Cache writeNo surcharge
Output$10.00
TTL: OpenAI does not publish a specific TTL. Cache hits are reported in the response but the caching policy is not developer-controllable.

What this means operationally: No write surcharge means you never pay a penalty for cache misses. The 50% read discount is smaller than Anthropic (90%), but the automatic activation means you get cache benefits without instrumentation. The tradeoff: you cannot guarantee what is cached or predict cache behavior. For stateless API integrations where you want caching without adding complexity, OpenAI is simpler. For precise control over what gets cached, Anthropic gives you the lever.

Google (Gemini API)

Google uses explicit developer-controlled caching with a storage-fee model distinct from the other two.

`python cached_content = genai.caching.CachedContent.create( model="gemini-2.5-pro", contents=[system_prompt_content], ttl=datetime.timedelta(hours=2) ) response = client.generate_content( model=cached_content.model, cached_content=cached_content.name, contents=[user_query] ) `

Pricing (Gemini 2.5 Pro, May 2026):

Token typePrice per MTok
Input (standard)$1.25 (<=200K) / $2.50 (>200K)
Cache read$0.31 / $0.63 (tiered)
Cache storage$4.50 per MTok per hour
Output$10.00
Minimum TTL: 1 hour. You cannot cache for less than 1 hour.

Minimum cacheable size: 32,768 tokens.

What this means operationally: The storage fee changes the math entirely. At $4.50/MTok/hour, caching 100K tokens for 1 hour costs $0.45 in storage alone, before any reads. If you read those 100K cached tokens once per hour, you pay $0.031 in cache read fees. The storage cost dominates unless you are making many reads per hour against the same cached content.

The breakeven: at $4.50/MTok/hr storage and $0.94 savings per read (from $1.25 standard to $0.31 cache read per MTok), you need approximately 5 reads per hour to break even on storage costs.

Direct comparison table

FeatureAnthropicOpenAIGoogle
Cache activationExplicit (cache_control)AutomaticExplicit (CachedContent)
Write surcharge+25%NoneNone (storage fee instead)
Read discount90%50%~75%
TTL options5 min / 1 hrAutomatic / unknown1 hr minimum
Storage feeNoneNone$4.50/MTok/hr
Min cache size1,024–2,048 tokensAutomatic32,768 tokens
Reads to break even2+ before TTLN/A (no write cost)5+ per hour

Which provider wins on caching?

It depends entirely on your application pattern:

High read frequency, long sessions (>1 hr), large system prompts (>32K tokens): Google can be cheapest if you make 10+ reads/hr per cached object. The storage fee is amortized.

Medium read frequency, short sessions (<5 min), small-to-medium system prompts: Anthropic wins. The 90% discount on reads is the deepest, and the 5-minute TTL matches session-scoped caching patterns.

Stateless integrations, no instrumentation budget, mixed query patterns: OpenAI is simplest. No write surcharge, no minimum size, automatic activation. You leave some savings on the table but pay nothing for misses.

Compounding with context compression

Prompt caching reduces the cost of the *repeated prefix*: the system prompt or document that appears in every call. Context compression reduces the size of that prefix before it gets cached, which means:

  • You hit the minimum cache size thresholds faster (especially Google at 32,768 tokens)
  • The cached prefix costs less per token in storage (especially on Google)
  • Cache reads are cheaper because there are fewer tokens to read
  • Compress the context first, then cache the compressed version. The savings compound.

    gotcontext handles compression via the ingest_context` MCP tool, then you cache the compressed output at whichever provider you use.

    Get gotcontext free →

    Cite this

    Researchers, analysts, or journalists referencing this post can use either format below — both are copyable.

    BibTeXbibtex
    @misc{prompt-caching-by-the-numbers-anthropic-vs-openai-vs-google-2026,
      title  = {Prompt Caching: Anthropic vs OpenAI vs Google — The Mechanics That Actually Determine Your Bill},
      author = {James Hollingsworth},
      year   = {2026},
      month  = {May},
      url    = {https://www.gotcontext.ai/blog/prompt-caching-by-the-numbers-anthropic-vs-openai-vs-google},
      note   = {gotcontext.ai engineering blog.},
    }
    APAtext
    James Hollingsworth. (2026, May 8). Prompt Caching: Anthropic vs OpenAI vs Google — The Mechanics That Actually Determine Your Bill. gotcontext.ai. Retrieved from https://www.gotcontext.ai/blog/prompt-caching-by-the-numbers-anthropic-vs-openai-vs-google.

    Contribute