Why Long Agent Sessions Fall Apart (And the Paper That Explains It)

Long agent sessions don't fail because the model runs out of tokens. They fail because the model starts ignoring most of what you gave it.

Chroma tested 18 large language models on a task called conversational QA. Their finding, published in their Context Rot research, was blunt: every single model degraded as context length grew. A focused ~300-token prompt consistently outperformed the full ~113,000-token conversation window. More context made answers worse.

They called the phenomenon context rot.

What Context Rot Actually Means for Agents ¶

Context rot isn't a bug in any particular model. It's a structural property of how attention works at scale. As conversations accumulate, the signal-to-noise ratio in the context window degrades. Early exchanges get deprioritized. Repeated patterns get overweighted. The model spends attention budget on old, irrelevant turns instead of the current task.

For a single-turn chat session, this doesn't matter much. For an agent running 50+ turns across a multi-hour debugging session, it's the primary failure mode.

The practical symptoms are recognizable if you've hit them:

The agent "forgets" a constraint you set in turn 3 by turn 40

Responses grow longer and less actionable as the session continues

The agent starts re-asking questions you already answered

Tool call quality degrades: more hallucinated arguments, more retries

All of these are context rot. The model isn't broken. It's overwhelmed.

The Compression Approach: Semantic Anchor Compression ¶

A 2024 paper from researchers at several universities proposed a different framing for the problem. Instead of asking "how do we fit more context into the window," they asked: what is the minimal representation of a conversation that preserves the information that actually matters?

The result was Semantic Anchor Compression (SAC), published as arXiv:2510.08907.

SAC works by identifying anchor tokens: the tokens in a conversation that carry the most semantic weight. Rather than summarizing or paraphrasing (which introduces drift), SAC aggregates KV representations around these anchors, producing a compressed version of the conversation that the model can attend to as if it were normal context.

No autoencoder. No separate compression model. The compression happens in the KV cache layer using the same model that will consume the result.

The compression ratios the paper demonstrates are not incremental:

5× compression with quality comparable to full context

15× compression with F1 score of 54.95 vs 51.52 for retrieval-augmented baselines

51× compression still functional on standard QA benchmarks

At 15× compression, SAC outperformed RAG approaches by up to 23.5% F1 and 26.8% EM on certain tasks. The compressed representation outperformed retrieval because compression preserves conversational structure (the order and flow of the dialogue) while retrieval collapses it into a bag of relevant chunks.

Why Compression Outperforms Retrieval for Agent Sessions ¶

This is the counterintuitive part. RAG is the standard answer to long-context problems: embed the conversation, retrieve the relevant bits, feed only those to the model. It works for document QA. It fails for agent sessions.

Agent sessions have causal dependencies. The fact that you told the agent "don't touch the production database" in turn 5 is not just a relevant fact. It's a constraint that must be present for every subsequent tool call. Retrieval will surface it when you explicitly ask about databases. It won't surface it when the agent is deciding whether to run a migration script.

Conversation compression preserves this causality. The compressed context still contains the constraint, in the right temporal position, even at 15× compression. Retrieval does not guarantee this.

The same logic applies to:

File paths established early in a session

User preferences stated once and assumed thereafter

Error states the agent encountered and resolved

Decisions made with rationale that affects later choices

All of these are load-bearing facts in an agent session. All of them are at risk under context rot. None of them are reliably retrievable without the conversational structure that compression preserves.

The Engineering Tradeoff ¶

Compression adds latency to context preparation. At 5×, this is usually acceptable. At 51×, the preparation step is non-trivial. The practical operating range for production agent systems is 5–15×, which brings a 100,000-token conversation down to 6,600–20,000 tokens, well within the sweet spot where attention is focused and generation is fast.

The other cost is implementation complexity. SAC requires access to the KV cache layer, which is not exposed in standard API calls to hosted models. For teams using Claude, GPT-4o, or Gemini via API, a proxy compression step (compressing the text representation before sending) achieves similar results with less fidelity.

This is exactly what gotcontext.ai's compress_codebase and ingest_context tools do: they compress the context representation before it reaches the model, trading some fidelity for a dramatic reduction in the tokens the model actually processes.

What This Means for Your Agent Architecture ¶

If you're building agents that run long sessions, the Chroma and SAC findings together suggest a clear design principle: never let the raw conversation accumulate in the context window.

Instead:

Compress conversation history before each turn, not just when you hit the limit

Prefer compression over retrieval for preserving causal dependencies

Monitor answer quality over session length as an early signal of context rot onset

Set a compression threshold (10× is a reasonable starting point) and apply it proactively

Context rot is inevitable in any system that accumulates context without managing it. The models aren't going to get better at ignoring irrelevant history. The architecture has to do that work.

Compress your agent sessions automatically with gotcontext.ai →

Cite this¶

Researchers, analysts, or journalists referencing this post can use either format below — both are copyable.

BibTeXbibtex

@misc{conversation-compression-long-agent-sessions-2026,
  title  = {Why Long Agent Sessions Fall Apart (And the Paper That Explains It)},
  author = {James Hollingsworth},
  year   = {2026},
  month  = {May},
  url    = {https://www.gotcontext.ai/blog/conversation-compression-long-agent-sessions},
  note   = {gotcontext.ai engineering blog.},
}

APAtext

James Hollingsworth. (2026, May 8). Why Long Agent Sessions Fall Apart (And the Paper That Explains It). gotcontext.ai. Retrieved from https://www.gotcontext.ai/blog/conversation-compression-long-agent-sessions.

Contribute¶

Suggest an edit

Spotted a typo, a stale benchmark, or a missing nuance? Open a GitHub issue.

Discuss this post

Counterexamples, follow-up questions, and adjacent research welcome.

Email us

Bigger story? Hit us directly at hello@gotcontext.ai.

← Back to all posts