Your LLM Gets Measurably Worse as the Conversation Grows. All of Them Do.

Context rot is not a bug. It's a property. ¶

Chroma recently published a technical report testing 18 frontier LLMs (Claude Opus 4, GPT-4.1, Gemini 2.5 Pro, and 15 others) on a consistent benchmark as input context length grew. The finding was unambiguous: every model tested showed measurable quality degradation as context length increased. No model was immune. The phenomenon has a name: context rot.

This isn't a qualitative observation. It's a consistent, reproducible pattern across model families, parameter scales, and providers. The models don't uniformly forget; they become increasingly unreliable. Performance on the same question, with the same answer present in the context, degrades as more surrounding text is added.

Three patterns that make it worse ¶

The Chroma study identified specific structural factors that accelerate degradation:

Semantic distance matters more than context length alone. When the question and the relevant passage are semantically dissimilar (a technical question whose answer is buried in adjacent narrative text), performance degrades faster than when question and answer are closely matched in vocabulary and framing. Long context windows don't just dilute signal; they penalize the cases where retrieval is hardest.

Distractors amplify with scale. Irrelevant but plausible content placed near the answer degrades accuracy more at 100K tokens than at 10K tokens. The model's ability to suppress misleading context weakens as the total input grows.

Coherent content hurts more than shuffled content. This is the counterintuitive result: logically structured, well-organized surrounding content degrades performance *more* than randomly shuffled noise. The hypothesis is that coherent text activates the model's tendency to read across sections, pulling attention away from the specific answer location.

A parallel study (arXiv:2601.11564) found a non-linear relationship between KV cache growth and performance on dense transformer architectures. Performance doesn't degrade linearly with context; it drops faster as context length increases, particularly when input mixes relevant and irrelevant material.

The practical implication nobody wants to say out loud ¶

LLM providers have spent two years competing on context window size. 128K, 200K, 1M tokens. The marketing framing is: bigger window = more powerful model. Feed it your entire codebase. Your whole conversation history. Everything.

The research says: longer context doesn't mean better results. It frequently means worse results. Every token of irrelevant content you add to the context window is not neutral. It actively degrades performance on what you care about.

Pattern	Effect on quality
Full conversation history in every call	Degrading; each turn adds distractor tokens
Full codebase as context	Degrading; semantically distant files suppress relevant signal
Complete tool output recirculation	Degrading; verbose outputs bury the relevant lines
Compressed, query-relevant context	Quality-preserving; model sees what matters

Context rot is a cost problem and a quality problem simultaneously ¶

Removing unnecessary tokens from your context window doesn't just save money. It makes your agent more accurate.

The two objectives (cost reduction and quality improvement) point at the same intervention: feed the model less, not more, of what it doesn't need.

gotcontext's compression pipeline is designed around this finding. Rather than truncating arbitrarily (which throws away information at random), it builds a semantic graph of the input, ranks content by structural importance, and emits a compressed form that preserves what the model needs to answer the query at hand. The result is a shorter context that gets better answers, not despite being shorter, but because of it.

``json { "mcpServers": { "gotcontext": { "url": "https://api.gotcontext.ai/mcp", "headers": { "Authorization": "Bearer gc_live_YOUR_KEY" } } } }``

Eighteen models. Same result. Longer context degrades quality. The fix is the same as the cost fix: compress before the model reads it.

Start compressing context →

Cite this¶

Researchers, analysts, or journalists referencing this post can use either format below — both are copyable.

BibTeXbibtex

@misc{why-your-llm-gets-dumber-as-the-conversation-grows-2026,
  title  = {Your LLM Gets Measurably Worse as the Conversation Grows. All of Them Do.},
  author = {James Hollingsworth},
  year   = {2026},
  month  = {May},
  url    = {https://www.gotcontext.ai/blog/why-your-llm-gets-dumber-as-the-conversation-grows},
  note   = {gotcontext.ai engineering blog.},
}

APAtext

James Hollingsworth. (2026, May 8). Your LLM Gets Measurably Worse as the Conversation Grows. All of Them Do.. gotcontext.ai. Retrieved from https://www.gotcontext.ai/blog/why-your-llm-gets-dumber-as-the-conversation-grows.

Contribute¶

Suggest an edit

Spotted a typo, a stale benchmark, or a missing nuance? Open a GitHub issue.

Discuss this post

Counterexamples, follow-up questions, and adjacent research welcome.

Email us

Bigger story? Hit us directly at hello@gotcontext.ai.

← Back to all posts