How to Reduce LLM Token Costs by 85%

The Token Cost Problem ¶

Every LLM API call costs money. GPT-4, Claude, and Gemini all charge per token, and context windows are getting larger, not cheaper. A typical coding agent session can burn through 100K+ tokens per task.

The math is simple: if you can compress your context by up to 85% without losing meaning, you save up to 85% on token costs.

What is Semantic Compression? ¶

Semantic compression goes beyond simple text truncation. Instead of cutting text at an arbitrary character limit, it:

Parses the document structure: headings, paragraphs, code blocks, lists

Builds a semantic graph: maps relationships between concepts

Ranks by importance: uses PageRank-style algorithms on the semantic graph

Preserves key information: keeps the skeleton that carries meaning

Removes redundancy: eliminates repeated concepts and filler

The result reads naturally and preserves the information an LLM needs to produce high-quality outputs.

Getting Started ¶

1. Create an account

2. Generate an API key

Go to your dashboard settings and create a new API key.

3. Connect via MCP

Add to your Claude Code config:

``json { "mcpServers": { "gotcontext": { "url": "https://api.gotcontext.ai/mcp", "headers": { "Authorization": "Bearer gc_live_YOUR_API_KEY" } } } }`

`4. Start saving`

Your AI tool will now automatically have access to compression tools. Add a note to your CLAUDE.md:

`When context is large (>10K tokens), use gotcontext's ingest_context tool to compress before processing.``

Real-World Results ¶

Document Type	Original	Compressed	Savings
API documentation	7,200 tokens	1,440 tokens	80%
Source code (500 lines)	4,200 tokens	1,260 tokens	70%
Large codebase (50 files)	48,000 tokens	7,200 tokens	85%

When to Compress ¶

Compression works best for:

Large context windows: documentation, codebases, chat histories

Repeated context: the same background info sent with every prompt

Retrieval augmented generation: compress retrieved chunks before injection

It's less useful for:

Very short texts (< 100 tokens)

Highly structured data (JSON, CSV); these are already compact

Content where every word matters (legal contracts, poetry)

Pricing ¶

Free: 1,000 compressions/month, 17 core MCP tools

Pro ($49/mo): 50,000 compressions/month, all 100+ MCP tools

Team ($99/mo): 100,000 compressions/month pooled, RBAC, batch queue

Business ($199/mo): Self-hosted Docker, OIDC, audit-log export, SOC2

Enterprise Dedicated ($499/mo): Reserved-capacity pool, 99.9% SLA

Get started free →

Try it on your own context

You just read the writeup. Now run the thing. Paste a doc or some verbose tool output and watch it shrink — free, no signup.

Your text

# Service Operations Runbook: Payments API

## Purpose and scope

This runbook covers the payments-api service: what it does, how it is deployed, what its dependencies are, and what to do when it misbehaves. It is written for the on-call engineer. Every procedure here assumes you have production read access and the ability to trigger a deploy through the standard pipeline. Nothing in this document requires direct database write access, and no procedure here should be improvised under pressure: if the situation is not covered, page the service owner rather than inventing a fix at 3am.

The payments-api accepts charge requests from the checkout frontend, validates them against the pricing catalog, forwards them to the payment processor, and records the outcome in the orders database. It is the only service permitted to talk to the processor. Average traffic is steady during business hours with a daily peak around 19:00 UTC and a weekly peak on Friday evenings.

## Architecture and dependencies

The service runs as three replicas behind the regional load balancer. Each replica is stateless; all persistent state lives in the orders database and the idempotency-key store. The service depends on four things: the orders database (primary and one read replica), the idempotency-key store, the pricing catalog service, and the external payment processor. Of these, only the processor is outside our control.

Dependency failure behavior is deliberate and asymmetric. If the pricing catalog is unreachable, the service serves prices from its local cache for up to ten minutes and emits a degraded-mode metric. If the idempotency store is unreachable, the service refuses new charges entirely, because accepting a charge without idempotency protection risks double-billing, and double-billing is strictly worse than downtime. If the processor times out, the charge is recorded as pending and a reconciliation job resolves it within the hour.

## Deployment

Deploys go through the standard pipeline: merge to main, automated tests, staging deploy, a thirty-minute soak with synthetic checkout traffic, then production rollout one replica at a time. The pipeline aborts automatically if the error rate on the new replica exceeds the old baseline. A full rollout takes about twenty minutes. Rollback is the same pipeline in reverse and takes about six minutes; the on-call engineer can trigger it without approval.

## Monitoring and alerts

Three alerts page the on-call engineer. High charge failure rate fires when more than two percent of charge attempts fail over five minutes; the usual causes are a processor incident or a bad deploy, in that order. Idempotency store unavailable fires immediately on connection failure. Reconciliation backlog fires when pending charges older than ninety minutes accumulate, which usually means the reconciliation job is stuck rather than the processor being slow.

2,912/12,000 chars

Compressed

Compressed text will appear here…

Cite this¶

Researchers, analysts, or journalists referencing this post can use either format below — both are copyable.

BibTeXbibtex

@misc{reduce-llm-token-costs-2026,
  title  = {How to Reduce LLM Token Costs by 85%},
  author = {James Hollingsworth},
  year   = {2026},
  month  = {April},
  url    = {https://gotcontext.ai/blog/reduce-llm-token-costs},
  note   = {gotcontext.ai engineering blog.},
}

APAtext

James Hollingsworth. (2026, April 14). How to Reduce LLM Token Costs by 85%. gotcontext.ai. Retrieved from https://gotcontext.ai/blog/reduce-llm-token-costs.

Contribute¶

Suggest an edit

Spotted a typo, a stale benchmark, or a missing nuance? Open a GitHub issue.

Discuss this post

Counterexamples, follow-up questions, and adjacent research welcome.

Email us

Bigger story? Hit us directly at hello@gotcontext.ai.

← Back to all posts