How to Measure Context Waste Before It Becomes a Cost Problem

Most teams know their LLM bill is higher than it should be. Few know which part of their context is causing the problem. Measuring context waste before optimizing is not optional -- it is the difference between a targeted fix and an expensive guess.

Two recent studies quantify the scale of the problem. Research on MCP tool descriptions (arXiv:2602.14878) audited 856 tool schemas across 103 MCP servers and found 97.1% contain at least one quality defect that the authors call a "smell" — descriptions that fail to state what the tool does, ambiguous parameter explanations, or overlapping functionality. A separate Chroma study on context rot found that stuffing irrelevant documents into context actively degrades model quality across 18 frontier LLMs -- not just wastes money, but makes answers worse.

These are not edge cases. They describe the default behavior of most production LLM applications.

The Four Sources of Context Waste ¶

Context waste concentrates in four places. You almost certainly have all four.

Tool schema bloat. If you use function calling or MCP tools, your tool descriptions consume tokens on every request. The arXiv:2602.14878 study found 56% of tools failed to clearly state their purpose in the description — meaning a substantial fraction of your tool manifest is paying tokens for content that doesn't help the model route correctly. A single tool with detailed parameter descriptions, type annotations, examples, and edge case notes can consume 800-1,200 tokens. Multiply by 15 tools and you have 18,000 tokens of scaffolding before the user's message.

Stale conversation history. Most chat applications append the full conversation history to every request. This is the simplest implementation and the most expensive one. A 20-turn conversation accumulates quickly, and most of the early turns are irrelevant to the current question. Turns 1-5 of a debugging session rarely help with the question being asked in turn 20.

Redundant RAG chunks. Retrieval-augmented generation systems fetch the top-K chunks by similarity and inject all of them. The top chunk is usually highly relevant. Chunks 4 through 8 are often near-duplicates of each other or tangentially related. They add tokens and, per the Chroma study, they add noise that degrades answer quality on the chunks that actually matter.

Structural document overhead. PDFs, HTML pages, and formatted documents contain navigation elements, headers, footers, boilerplate legal text, and table-of-contents sections that are irrelevant to most queries. Naive extraction dumps all of this into context. It inflates token count without adding information.

How to Measure Each Type ¶

Measurement comes before optimization. Here is a practical audit for each category.

Tool schema audit: Log the token count of your tool description block for 1,000 representative requests. Compute it as a percentage of total context. If it exceeds 20%, start compressing descriptions. The test is simple: remove every word from a tool description that is not required to correctly invoke the tool, then verify the model still invokes it correctly. Most descriptions can lose 50-70% of their tokens this way.

Conversation history audit: For each turn in your logs, compute what percentage of the injected history is from turns that were referenced in the model's response. You can approximate this by checking whether the response content overlaps with earlier turns. A healthy ratio is 60%+ reference rate. If you are below 40%, you are injecting stale history.

RAG chunk audit: For each retrieval call, compute pairwise cosine similarity between the injected chunks. If your average top-5 chunk similarity is above 0.85, you are retrieving near-duplicates. Filter them before injection. Also measure whether the model's response cites chunk 1 more than chunk 5 -- if so, the lower-ranked chunks are probably noise.

Document extraction audit: Compare raw document token counts to content-only token counts after stripping navigation, boilerplate, and formatting artifacts. A well-extracted document should be 60-80% of the raw extraction. If you are at 40%, your extractor is doing cleanup. If you are at 95%, it is not.

What Context Rot Actually Does ¶

The Chroma study's finding on context rot is worth taking seriously beyond the cost angle. The degradation is not linear. Adding irrelevant documents to context does not just waste tokens -- it actively interferes with the model's ability to use the relevant ones.

This happens because attention is distributed across the full context window. Irrelevant content captures attention weight that would otherwise go to useful tokens. The model's effective context -- the portion it actually attends to meaningfully -- is smaller than the raw token count suggests.

The practical implication is that your quality ceiling scales with context relevance, not context size. Sending 8,000 tokens of highly relevant content often produces better answers than sending 16,000 tokens where half is noise. This means context compression is not just a cost optimization -- it is a quality optimization.

Building a Waste Measurement Pipeline ¶

A minimal measurement pipeline has three components:

Token accounting per layer. Log, for each request: system prompt tokens, tool schema tokens, history tokens, retrieved context tokens, user message tokens, output tokens. Persist these with the request ID so you can correlate cost to feature.

Utilization proxies. Track whether retrieved chunks appear in responses (for RAG), whether tool schemas produce tool calls (for function calling), and whether early history turns are referenced in later turns. These are imperfect but practical proxies for context utilization.

A/B infrastructure. Run 10% of requests with compressed context variants. Compare output quality on your existing eval metrics. The compression win rate on quality metrics is almost always above 80% for tasks where the context was already noisy.

This does not require a new data pipeline. It requires adding structured logging to your LLM call sites and a weekly review of the resulting data.

The teams that get context cost under control are not the ones with the best compression algorithms. They are the ones that know which 30% of their context is doing 80% of the work -- and have stopped paying for the other 70%.

Start measuring your context waste ->

Cite this¶

Researchers, analysts, or journalists referencing this post can use either format below — both are copyable.

BibTeXbibtex

@misc{measuring-context-waste-2026,
  title  = {How to Measure Context Waste Before It Becomes a Cost Problem},
  author = {James Hollingsworth},
  year   = {2026},
  month  = {May},
  url    = {https://www.gotcontext.ai/blog/measuring-context-waste},
  note   = {gotcontext.ai engineering blog.},
}

APAtext

James Hollingsworth. (2026, May 8). How to Measure Context Waste Before It Becomes a Cost Problem. gotcontext.ai. Retrieved from https://www.gotcontext.ai/blog/measuring-context-waste.

Contribute¶

Suggest an edit

Spotted a typo, a stale benchmark, or a missing nuance? Open a GitHub issue.

Discuss this post

Counterexamples, follow-up questions, and adjacent research welcome.

Email us

Bigger story? Hit us directly at hello@gotcontext.ai.

← Back to all posts