Measured savings across 11 LLMs — Claude Opus 4.7 to Gemini Flash.→ See per-model data
Get free API key →
Engineering

KV Cache Compression: A Field Guide for Practitioners

Five distinct compression families — eviction, quantization, low-rank, architecture, and streaming — each trade differently across sequence length and task type. A 2025 review of 25+ methods found no single winner. Here is the decision matrix.

James Hollingsworth(Contributor)Published 7 min~824 words

The memory wall at inference time

The KV (key-value) cache is not an implementation detail. It is the dominant memory bottleneck in LLM inference. At a 32K context window with a 70B-parameter model in BF16, the KV cache consumes roughly 48GB per token sequence. Scale to 128K and you have a memory problem the compute side cannot solve.

A 2025 survey (arXiv:2508.06297) covering the major KV compression families found no single method dominates. Each trades differently across sequence length, task type, and hardware budget. This post is the decision tree.

Five compression families

The families are distinct in mechanism, not just in implementation.

1. Selective token eviction

Evict tokens from the KV cache based on attention score history. The most-cited implementations are H2O (Heavy-Hitter Oracle) and SnapKV.

H2O (arXiv:2306.14048) keeps a fixed-size KV budget by retaining tokens that received the most cumulative attention mass (the "heavy hitters"). SnapKV (arXiv:2404.14469) does a prefill-time selection by observing which tokens draw attention during the prompt and pre-selecting them before generation begins.

Both work well on tasks where attention is concentrated (summarization, factual QA). Both degrade on tasks where relevant tokens are spread across the full context (long-dependency code generation, for instance).

When to use: Batch inference, short-answer tasks, summarization. 2-4x KV memory reduction at minimal quality loss on concentrated-attention tasks.

2. Quantization

Reduce the bit-width of stored KV activations from BF16 or FP16 down to INT8 or INT4. The key paper here is KIVI (arXiv:2402.02750), which quantizes K to 2-bit and V to 2-bit with a small FP16 residual cache and shows perplexity change within noise on standard benchmarks at 2-bit.

NVIDIA's kvpress library (github.com/NVIDIA/kvpress) implements KV quantization as a drop-in HuggingFace hook; no custom CUDA required. The library covers 25+ compression methods including KIVI-style quantization.

When to use: Self-hosted inference, any task type. Orthogonal to eviction; can stack with H2O for multiplicative reduction. Minimal implementation effort with kvpress.

3. Low-rank approximation

Instead of evicting tokens or reducing bit-width, represent the KV cache in a lower-dimensional subspace. ShadowKV (arXiv:2410.21465) stores compressed K with a low-rank SVD approximation plus sparse residuals, and reconstructs V on-the-fly during attention. At 6x compression it retains accuracy on RULER long-context benchmarks where eviction methods start breaking down.

Low-rank methods require more compute per attention step (reconstruction adds FLOPs) in exchange for better quality at high compression ratios.

When to use: Long-context tasks (>64K tokens), tasks where uniform eviction would miss semantically important distant tokens.

4. Architecture-level KV reduction

This is not a compression technique applied post-hoc. It is a design choice that changes how much KV data is generated in the first place.

Grouped Query Attention (GQA), used in Llama 3, Mistral, and Gemma, shares key-value heads across groups of query heads. A 4-group setup stores 4x less KV data than Multi-Head Attention at equal model size.

Multi-head Latent Attention (MLA), introduced in DeepSeek-V2 (arXiv:2405.04434), compresses KV into a low-rank latent vector before storage. At the same model quality, MLA uses roughly 10x less KV memory than standard MHA.

If you are deploying a model you control (fine-tuning on an open-weight base), architecture choice here has larger impact than any post-hoc compression method.

5. Streaming / windowed attention

StreamingLLM (arXiv:2309.17453) retains only attention sinks (early tokens the model consistently attends to) plus a sliding window of recent tokens. Memory is bounded at O(window size) regardless of sequence length. Quality holds on tasks where the relevant information is recent. Quality collapses on tasks requiring recall of information outside the window.

When to use: Real-time chat, streaming inference, use cases where long-range recall is not required.

Practical starting point with kvpress

NVIDIA's kvpress (github.com/NVIDIA/kvpress) wraps 25+ compression methods as HuggingFace hooks. Installation:

``bash pip install kvpress `

Drop-in integration with any pipeline:

`python from transformers import pipeline from kvpress import ExpectedAttentionPress

press = ExpectedAttentionPress(compression_ratio=0.4)

pipe = pipeline( "text-generation", model="meta-llama/Meta-Llama-3.1-8B-Instruct", device="cuda" )

with press(pipe.model): output = pipe(prompt, max_new_tokens=256) ``

The hook intercepts the KV cache write path. No model surgery, no retraining. Compression ratio is tunable at runtime.

Decision matrix

Task typeSequence lengthRecommended family
Summarization, QAShort (<32K)Eviction (H2O, SnapKV)
Any task, self-hostedAnyQuantization (KIVI, kvpress)
Long-range retrievalLong (>64K)Low-rank (ShadowKV)
Real-time chatUnboundedStreaming (StreamingLLM)
New deploymentAnyArchitecture (GQA or MLA)
The review (arXiv:2508.06297) is explicit: no method dominates. Quantization is the lowest-friction entry point because it stacks with eviction and is hardware-agnostic. Start there, measure, then layer eviction on top for tasks where attention concentration is high.

What this means for your token bill

KV cache compression does not reduce your input token count. You still pay for prompt tokens the same way. It reduces memory and therefore throughput constraints at inference time. The benefit is latency and cost per second of GPU time, not direct per-token cost.

If you are using the OpenAI or Anthropic APIs, KV compression is transparent to you: they manage it. If you self-host any open-weight model (Llama, Mistral, Gemma), KV cache compression is a first-order concern for production viability at scale.

Compress what you send to the model, not what the model stores →

Cite this

Researchers, analysts, or journalists referencing this post can use either format below — both are copyable.

BibTeXbibtex
@misc{kv-cache-compression-field-guide-2026,
  title  = {KV Cache Compression: A Field Guide for Practitioners},
  author = {James Hollingsworth},
  year   = {2026},
  month  = {May},
  url    = {https://www.gotcontext.ai/blog/kv-cache-compression-field-guide},
  note   = {gotcontext.ai engineering blog.},
}
APAtext
James Hollingsworth. (2026, May 8). KV Cache Compression: A Field Guide for Practitioners. gotcontext.ai. Retrieved from https://www.gotcontext.ai/blog/kv-cache-compression-field-guide.

Contribute