Why Long Agent Sessions Fall Apart (And the Paper That Explains It)
Chroma tested 18 LLMs and found every one degrades as context grows. A 2024 paper shows compression at 5–51x beats retrieval for preserving causal structure in agent sessions.
Long agent sessions don't fail because the model runs out of tokens. They fail because the model starts ignoring most of what you gave it.
Chroma tested 18 large language models on a task called conversational QA. Their finding, published in their Context Rot research, was blunt: every single model degraded as context length grew. A focused ~300-token prompt consistently outperformed the full ~113,000-token conversation window. More context made answers worse.
They called the phenomenon context rot.
What Context Rot Actually Means for Agents ¶
Context rot isn't a bug in any particular model. It's a structural property of how attention works at scale. As conversations accumulate, the signal-to-noise ratio in the context window degrades. Early exchanges get deprioritized. Repeated patterns get overweighted. The model spends attention budget on old, irrelevant turns instead of the current task.
For a single-turn chat session, this doesn't matter much. For an agent running 50+ turns across a multi-hour debugging session, it's the primary failure mode.
The practical symptoms are recognizable if you've hit them:
All of these are context rot. The model isn't broken. It's overwhelmed.
The Compression Approach: Semantic Anchor Compression ¶
A 2024 paper from researchers at several universities proposed a different framing for the problem. Instead of asking "how do we fit more context into the window," they asked: what is the minimal representation of a conversation that preserves the information that actually matters?
The result was Semantic Anchor Compression (SAC), published as arXiv:2510.08907.
SAC works by identifying anchor tokens: the tokens in a conversation that carry the most semantic weight. Rather than summarizing or paraphrasing (which introduces drift), SAC aggregates KV representations around these anchors, producing a compressed version of the conversation that the model can attend to as if it were normal context.
No autoencoder. No separate compression model. The compression happens in the KV cache layer using the same model that will consume the result.
The compression ratios the paper demonstrates are not incremental:
At 15× compression, SAC outperformed RAG approaches by up to 23.5% F1 and 26.8% EM on certain tasks. The compressed representation outperformed retrieval because compression preserves conversational structure (the order and flow of the dialogue) while retrieval collapses it into a bag of relevant chunks.
Why Compression Outperforms Retrieval for Agent Sessions ¶
This is the counterintuitive part. RAG is the standard answer to long-context problems: embed the conversation, retrieve the relevant bits, feed only those to the model. It works for document QA. It fails for agent sessions.
Agent sessions have causal dependencies. The fact that you told the agent "don't touch the production database" in turn 5 is not just a relevant fact. It's a constraint that must be present for every subsequent tool call. Retrieval will surface it when you explicitly ask about databases. It won't surface it when the agent is deciding whether to run a migration script.
Conversation compression preserves this causality. The compressed context still contains the constraint, in the right temporal position, even at 15× compression. Retrieval does not guarantee this.
The same logic applies to:
All of these are load-bearing facts in an agent session. All of them are at risk under context rot. None of them are reliably retrievable without the conversational structure that compression preserves.
The Engineering Tradeoff ¶
Compression adds latency to context preparation. At 5×, this is usually acceptable. At 51×, the preparation step is non-trivial. The practical operating range for production agent systems is 5–15×, which brings a 100,000-token conversation down to 6,600–20,000 tokens, well within the sweet spot where attention is focused and generation is fast.
The other cost is implementation complexity. SAC requires access to the KV cache layer, which is not exposed in standard API calls to hosted models. For teams using Claude, GPT-4o, or Gemini via API, a proxy compression step (compressing the text representation before sending) achieves similar results with less fidelity.
This is exactly what gotcontext.ai's compress_codebase and ingest_context tools do: they compress the context representation before it reaches the model, trading some fidelity for a dramatic reduction in the tokens the model actually processes.
What This Means for Your Agent Architecture ¶
If you're building agents that run long sessions, the Chroma and SAC findings together suggest a clear design principle: never let the raw conversation accumulate in the context window.
Instead:
Context rot is inevitable in any system that accumulates context without managing it. The models aren't going to get better at ignoring irrelevant history. The architecture has to do that work.
Compress your agent sessions automatically with gotcontext.ai →
Cite this¶
Researchers, analysts, or journalists referencing this post can use either format below — both are copyable.
@misc{conversation-compression-long-agent-sessions-2026,
title = {Why Long Agent Sessions Fall Apart (And the Paper That Explains It)},
author = {James Hollingsworth},
year = {2026},
month = {May},
url = {https://www.gotcontext.ai/blog/conversation-compression-long-agent-sessions},
note = {gotcontext.ai engineering blog.},
}James Hollingsworth. (2026, May 8). Why Long Agent Sessions Fall Apart (And the Paper That Explains It). gotcontext.ai. Retrieved from https://www.gotcontext.ai/blog/conversation-compression-long-agent-sessions.