Measured savings across 11 LLMs — Claude Opus 4.7 to Gemini Flash.→ See per-model data
Get Started →

Measured 2026-04-23 · 11 production models

Up to 38% off your flagship LLM bill.

Measured live on Opus 4.7, GPT-5.4, and Gemini 3.1 Pro.

We sent the same prompt to each vendor's flagship, compressed and uncompressed, and recorded the input_tokens they actually billed. Opus 4.7 billed 38.3% less · GPT-5.4 billed 35.3% less · Gemini 3.1 Pro billed 34.6% less. Same prompt, live API calls, today.

Get free API keyIntegration docs

1,000 compressions/mo free. No card required.

Live flagship billing2026-04-23
Claude Opus 4.7−38.3%

992 → 612 input_tokens

GPT-5.4−35.3%

515 → 333 prompt_tokens

Gemini 3.1 Pro−34.6%

566 → 370 promptTokenCount

See the full methodology →
How it works

Same answer. Shorter prompt. Smaller bill.

Before your prompt reaches Claude, GPT, or Gemini, gotcontext spots repeated ideas, filler phrases, and redundant context — then rewrites the prompt to say the same thing with fewer words. The AI gets a shorter prompt, gives the same answer, and your input bill drops ~35%.

§01 / Input

You send a prompt

Same prompt you'd send today. No code changes. gotcontext sits between your app and the LLM — one line to add us, one line to remove us.

§02 / Compress

We compress it

We keep what matters — names, numbers, constraints, instructions — and drop what doesn't: repeated boilerplate, filler words, restated context. Takes about 80ms.

§03 / Output

Your LLM answers normally

The AI reads a shorter version of your prompt and gives the same answer. You pay for fewer input tokens. Across flagships that's 35-38% off your bill.

Before · what you'd send992 tokens

"The FastAPI app initializes Sentry. The FastAPI app wires Clerk auth. The FastAPI app exposes an MCP gateway at /mcp. Circuit breakers guard Redis, Postgres, and Polar. When a breaker opens, DegradationHeader adds X-Degraded. The MCP gateway is at /mcp and validates gc_ API keys..."

After · what gotcontext sends612 tokens (-38%)

"§1=The FastAPI app initializes Sentry, wires Clerk auth, exposes MCP gateway at /mcp. Circuit breakers (Redis/Postgres/Polar) trigger X-Degraded header on open. gc_ API keys validate via verify_api_key..."

§04 / Fidelity

"Does the AI give worse answers?"

No. Every compression is scored for semantic fidelity against the original — if a compression would meaningfully change the answer, we ship the original through untouched. You can dial the fidelity-vs-savings balance per request (fidelity: "minimal" | "balanced" | "full"). Default is balanced, which is what produced the 38% Opus 4.7 number above.

Lossless mode: 0% quality lossBalanced: 96%+ fidelity, ~35% savingsAggressive: for high-volume routing / log summaries
Real provider-billed savings

What each flagship actually charged us

We sent the same 279-word prompt to each vendor's flagship -- Opus 4.7, GPT-5.4, and Gemini 3.1 Pro -- once uncompressed, once compressed through gotcontext, and read back the input_tokens field each provider reports. These numbers are what hits your invoice.

ProviderModel testedUncompressedCompressedBilled savings
Anthropicclaude-opus-4-7992 tokens612 tokens-38.3%
OpenAIgpt-5.4515 tokens333 tokens-35.3%
Googlegemini-3.1-pro-preview566 tokens370 tokens-34.6%

Three different tokenizers, three different raw counts, same compressed input. Opus 4.7 surfaces the most savings because its tokenizer is the most granular -- it picks up small wins other vendors round off. Our semantic tokenizer reported 39.8% on this corpus. The reproducible benchmark script is open-source. For a head-to-head against the most active OSS competitor, see the gotcontext vs Headroom benchmark (76% to 41% on real OpenAI bills, measured 2026-04-25).

Projected dollar savings across 11 models

Dollar savings per compression — all 11 models

We project the dollar value of one compression per model by applying each provider family's measured billed savings % (Anthropic 38.3%, OpenAI 35.3%, Google 34.6%) to a 691-token reference prompt -- the mean of the three uncompressed counts above. Bars are colored by provider.

Claude Opus 4.7$0.003970 / call$3,970.00 / month @ 1M
Claude Opus 4.6$0.003970 / call$3,970.00 / month @ 1M
GPT-5.5$0.001220 / call$1,220.00 / month @ 1M
Claude Sonnet 4.6$0.000794 / call$794.00 / month @ 1M
GPT-5.4$0.000610 / call$610.00 / month @ 1M
GPT-4.1$0.000488 / call$488.00 / month @ 1M
Gemini 3.1 Pro$0.000299 / call$299.00 / month @ 1M
Claude Haiku 4$0.000212 / call$212.00 / month @ 1M
Gemini Auto$0.000143 / call$143.00 / month @ 1M
GPT-4.1 mini$0.000098 / call$98.00 / month @ 1M
Gemini 3.1 Flash$0.000072 / call$72.00 / month @ 1M
GPT-5.4 mini$0.000061 / call$61.00 / month @ 1M
Anthropic · Claude family

Claude — Opus, Sonnet, Haiku

Claude uses a proprietary tokenizer with 200K or 1M context depending on tier. Compression savings are largest here because Anthropic's premium tier is the most expensive per-token bill of any provider.

Claude Opus 4.7

FlagshipAnthropic

Flagship reasoning. Newest Anthropic model, 1M context.

Input
$15.00/1M
Output
$75.00/1M
Context
1M
Saved / call
$0.003970
1M calls / mo: $3,970.00
10M calls / mo: $39,700.00

Premium tier. Pinned predecessor to 4.7.

Input
$15.00/1M
Output
$75.00/1M
Context
200K
Saved / call
$0.003970
1M calls / mo: $3,970.00
10M calls / mo: $39,700.00

Balanced cost and quality. The workhorse tier.

Input
$3.00/1M
Output
$15.00/1M
Context
200K
Saved / call
$0.000794
1M calls / mo: $794.00
10M calls / mo: $7,940.00

Low-cost routing and summary tasks.

Input
$0.80/1M
Output
$4.00/1M
Context
200K
Saved / call
$0.000212
1M calls / mo: $212.00
10M calls / mo: $2,120.00
OpenAI · GPT family

GPT — 5.4, 5.4 mini, 4.1, 4.1 mini

OpenAI prices vary 40× between the mini and flagship tiers. GPT-5.4 has a 400K context — useful for very long prompts, and compression amplifies that advantage by letting you fit more source material.

GPT-5.5

FlagshipProjectedOpenAI

Flagship OpenAI model (2026-04-23). Output rate doubled vs GPT-5.4 — compression ROI is highest here.

Input
$5.00/1M
Output
$30.00/1M
Context
1M
Saved / call
$0.001220
1M calls / mo: $1,220.00
10M calls / mo: $12,200.00

GPT-5.4

OpenAI

Standard OpenAI tier — the model we measured directly.

Input
$2.50/1M
Output
$15.00/1M
Context
400K
Saved / call
$0.000610
1M calls / mo: $610.00
10M calls / mo: $6,100.00

Low-cost GPT-5 tier with the same 400K window.

Input
$0.25/1M
Output
$2.00/1M
Context
400K
Saved / call
$0.000061
1M calls / mo: $61.00
10M calls / mo: $610.00

GPT-4.1

OpenAI

Mid-cost OpenAI profile suited to compact prompts.

Input
$2.00/1M
Output
$8.00/1M
Context
128K
Saved / call
$0.000488
1M calls / mo: $488.00
10M calls / mo: $4,880.00

Bulk preprocessing and light classification.

Input
$0.40/1M
Output
$1.60/1M
Context
128K
Saved / call
$0.000098
1M calls / mo: $98.00
10M calls / mo: $980.00
Google · Gemini family

Gemini — 3.1 Pro, Auto, 3.1 Flash

Gemini ships the cheapest mainstream tier (Flash) and the largest context window (1M across all three). Auto routes between Pro and Flash per turn for a middle-tier effective rate.

Gemini 3.1 Pro

FlagshipGoogle

Gemini large-context profile.

Input
$1.25/1M
Output
$10.00/1M
Context
1M
Saved / call
$0.000299
1M calls / mo: $299.00
10M calls / mo: $2,990.00

Auto-routes between Pro and Flash per turn.

Input
$0.60/1M
Output
$5.00/1M
Context
1M
Saved / call
$0.000143
1M calls / mo: $143.00
10M calls / mo: $1,430.00

Low-cost Gemini tier. Ideal for bulk compression.

Input
$0.30/1M
Output
$2.50/1M
Context
1M
Saved / call
$0.000072
1M calls / mo: $72.00
10M calls / mo: $720.00
Methodology

How we measured

We drove compression through compress_meta_tokens on our production MCP gateway at https://api.gotcontext.ai/mcp. Each call sets _meta.model to the target so the gateway's per-model attribution logs correctly.

Corpus
279-word technical prose with realistic repetition (FastAPI + MCP documentation). Same prompt sent to all three providers, compressed and uncompressed.
Real-provider calls
Direct API calls to api.anthropic.com/v1/messages, api.openai.com/v1/chat/completions, and generativelanguage.googleapis.com. We read the usage.input_tokens field each response returns — the same number the provider bills.
Semantic-layer measurement
gotcontext's own tokenizer: 279 words compress to 168 tokens (39.8% reduction). We quote this separately because it's what the compressor achieves at the structural level, independent of provider billing.
Pricing source
Live /v1/models endpoint on our API. Synced within 24 hours of provider rate changes.

Honest disclaimer: 39.8% is the ratio for this specific corpus. Short documents (<100 tokens) may expand due to skeleton overhead. Medium-to-large documents (500+ tokens) typically compress 5–10×. Highly repetitive content (logs, API responses, boilerplate) can hit 85%+. Run your own benchmark on a representative sample before scaling.

Reproduce it

Run the benchmark yourself

Two public scripts. Both are open-source in the main repo — no sign-up needed to reproduce our numbers.

1. Real-provider billing (the headline data)

Hits Claude, GPT, and Gemini with the same compressed + uncompressed prompt. Reads the input_tokens field from each response. Requires your Anthropic / OpenAI / Google keys.

ANTHROPIC_API_KEY=... OPENAI_API_KEY=... GEMINI_API_KEY=... \
python benchmarks/real_llm_cross_provider_smoke.py
benchmarks/real_llm_cross_provider_smoke.py →

2. Per-model attribution (gotcontext side)

Drives compress_meta_tokens once per registered model with _meta.model set. Cross-checks against /v1/usage/by-model.

GC_API_KEY=gc_your_key python benchmarks/per_model_savings_smoke.py
benchmarks/per_model_savings_smoke.py →

Ready to cut your LLM bill?

The difference between Opus 4.7 and GPT-5.4 mini is 65×. Pick the right model, add compression, and stop paying for tokens you never needed.