Measured savings across 11 LLMs — Claude Opus 4.7 to Gemini Flash.→ See per-model data
Get free API key →
Research

Your LLM Gets Measurably Worse as the Conversation Grows. All of Them Do.

Chroma tested 18 frontier LLMs and found every single one degrades as context length grows — context rot. The fix for quality and the fix for cost turn out to be the same intervention.

James Hollingsworth(Contributor)Published 5 min~581 words

Context rot is not a bug. It's a property.

Chroma recently published a technical report testing 18 frontier LLMs (Claude Opus 4, GPT-4.1, Gemini 2.5 Pro, and 15 others) on a consistent benchmark as input context length grew. The finding was unambiguous: every model tested showed measurable quality degradation as context length increased. No model was immune. The phenomenon has a name: context rot.

This isn't a qualitative observation. It's a consistent, reproducible pattern across model families, parameter scales, and providers. The models don't uniformly forget; they become increasingly unreliable. Performance on the same question, with the same answer present in the context, degrades as more surrounding text is added.

Three patterns that make it worse

The Chroma study identified specific structural factors that accelerate degradation:

Semantic distance matters more than context length alone. When the question and the relevant passage are semantically dissimilar (a technical question whose answer is buried in adjacent narrative text), performance degrades faster than when question and answer are closely matched in vocabulary and framing. Long context windows don't just dilute signal; they penalize the cases where retrieval is hardest.

Distractors amplify with scale. Irrelevant but plausible content placed near the answer degrades accuracy more at 100K tokens than at 10K tokens. The model's ability to suppress misleading context weakens as the total input grows.

Coherent content hurts more than shuffled content. This is the counterintuitive result: logically structured, well-organized surrounding content degrades performance *more* than randomly shuffled noise. The hypothesis is that coherent text activates the model's tendency to read across sections, pulling attention away from the specific answer location.

A parallel study (arXiv:2601.11564) found a non-linear relationship between KV cache growth and performance on dense transformer architectures. Performance doesn't degrade linearly with context; it drops faster as context length increases, particularly when input mixes relevant and irrelevant material.

The practical implication nobody wants to say out loud

LLM providers have spent two years competing on context window size. 128K, 200K, 1M tokens. The marketing framing is: bigger window = more powerful model. Feed it your entire codebase. Your whole conversation history. Everything.

The research says: longer context doesn't mean better results. It frequently means worse results. Every token of irrelevant content you add to the context window is not neutral. It actively degrades performance on what you care about.

PatternEffect on quality
Full conversation history in every callDegrading; each turn adds distractor tokens
Full codebase as contextDegrading; semantically distant files suppress relevant signal
Complete tool output recirculationDegrading; verbose outputs bury the relevant lines
Compressed, query-relevant contextQuality-preserving; model sees what matters

Context rot is a cost problem and a quality problem simultaneously

Removing unnecessary tokens from your context window doesn't just save money. It makes your agent more accurate.

The two objectives (cost reduction and quality improvement) point at the same intervention: feed the model less, not more, of what it doesn't need.

gotcontext's compression pipeline is designed around this finding. Rather than truncating arbitrarily (which throws away information at random), it builds a semantic graph of the input, ranks content by structural importance, and emits a compressed form that preserves what the model needs to answer the query at hand. The result is a shorter context that gets better answers, not despite being shorter, but because of it.

``json { "mcpServers": { "gotcontext": { "url": "https://api.gotcontext.ai/mcp", "headers": { "Authorization": "Bearer gc_live_YOUR_KEY" } } } } ``

Eighteen models. Same result. Longer context degrades quality. The fix is the same as the cost fix: compress before the model reads it.

Start compressing context →

Cite this

Researchers, analysts, or journalists referencing this post can use either format below — both are copyable.

BibTeXbibtex
@misc{why-your-llm-gets-dumber-as-the-conversation-grows-2026,
  title  = {Your LLM Gets Measurably Worse as the Conversation Grows. All of Them Do.},
  author = {James Hollingsworth},
  year   = {2026},
  month  = {May},
  url    = {https://www.gotcontext.ai/blog/why-your-llm-gets-dumber-as-the-conversation-grows},
  note   = {gotcontext.ai engineering blog.},
}
APAtext
James Hollingsworth. (2026, May 8). Your LLM Gets Measurably Worse as the Conversation Grows. All of Them Do.. gotcontext.ai. Retrieved from https://www.gotcontext.ai/blog/why-your-llm-gets-dumber-as-the-conversation-grows.

Contribute