What Chunking Strategy Actually Matters for RAG Quality

The wrong question ¶

Most RAG teams spend their chunking budget on the wrong problem.

The engineering question is usually: should we use fixed-size chunks, sentence chunks, or semantic chunks? The research question, per a January 2026 systematic analysis (arXiv:2601.14123), is different: why does retrieval fail, and which chunk design failure causes it?

The answer changes what you optimize.

What the research found ¶

ArXiv:2601.14123 ran a systematic comparison of chunking strategies across multiple QA benchmarks. The key results:

Sentence chunking and semantic chunking perform similarly up to approximately 5,000 tokens of chunk size. Below that threshold, the additional complexity of semantic segmentation (embedding-based boundary detection, topic modeling, etc.) does not translate into measurable retrieval improvement. The simpler approach wins on implementation cost without losing on quality.

Chunk overlap is not a free lunch. Overlapping chunks (the common advice: use 10-20% overlap to avoid splitting context) increase index size and retrieval computation without a consistent quality benefit. The paper found overlap does not reliably improve the answer quality on the benchmarks tested. You are paying storage and compute costs for marginal or zero gain.

Retrieval precision is the dominant variable. When the authors controlled for chunking strategy and varied retrieval quality (by injecting noise into the retrieved set), quality degraded far more steeply from retrieval precision changes than from chunking strategy changes. A 10% drop in retrieval precision costs more QA accuracy than switching from semantic to fixed-size chunking.

What this means in practice ¶

Start with sentence chunking

Sentence chunking gives you natural semantic units without a similarity model dependency. Implementation is a single nltk.sent_tokenize() call or equivalent. For corpora under 5K characters per document, this matches semantic chunking performance.

``python import nltk from typing import List

nltk.download("punkt", quiet=True)

def sentence_chunks(text: str, max_sentences: int = 5) -> List[str]: sentences = nltk.sent_tokenize(text) chunks = [] for i in range(0, len(sentences), max_sentences): chunk = " ".join(sentences[i:i + max_sentences]) chunks.append(chunk) return chunks``

Group sentences into chunks of 3-7 sentences depending on your embedding model context window. Target 128-512 tokens per chunk for most embedding models.

Skip the overlap

Do not add chunk overlap until you have evidence it helps your specific retrieval setup. Start at 0% overlap. Measure retrieval precision (do the right chunks come back for test questions?). If precision is high and answer quality is still low, overlap is unlikely to fix it — look upstream at query expansion or downstream at context assembly.

Fix retrieval before chunking

The paper finding that retrieval precision dominates chunking strategy suggests a clear prioritization: if you have limited optimization effort, spend it on the retrieval layer first.

Practical ways to improve retrieval precision without changing chunks:

Reranking. A cross-encoder reranker (Cohere Rerank, BGE-Reranker, Jina Reranker) reads query + candidate chunk pairs and scores them jointly. Bi-encoder retrieval (standard embedding cosine search) misses relevance signals that are visible to cross-encoders. Adding a reranker to the top-50 bi-encoder results before passing the top-5 to the LLM is the highest-ROI single retrieval improvement in most RAG stacks.

HyDE (Hypothetical Document Embeddings). Generate a hypothetical answer to the query, embed it, and retrieve chunks similar to the hypothetical answer rather than the raw query. Works well when queries are short and chunks are long.

Query expansion. Paraphrase the query in 3-5 ways, retrieve for all, merge and deduplicate. Catches vocabulary mismatches between user phrasing and document phrasing.

Compress retrieved context before injection

Once you have retrieved the right chunks, you often have more content than the LLM needs. A 5-chunk result at 500 tokens each is 2,500 tokens of context. Semantic compression of retrieved chunks to the 20-30% of content most relevant to the specific query reduces the context window load and, per context rot research, improves answer quality by reducing noise in the attended context.

The chunking decision tree ¶

Given the research:

Default to sentence chunking (3-7 sentences per chunk, no overlap)

Add semantic chunking only if your corpus has extremely variable document structure (code + prose + tables mixed)

Use code-aware chunking for software corpora (split on function/class boundaries, not sentences)

Before optimizing chunks, measure retrieval precision: what fraction of your test queries return the chunk containing the correct answer in the top-3 results?

If precision below 80%, fix retrieval (reranker, HyDE, query expansion) before touching chunking

If precision is high but answer quality is low, check context compression and prompt quality

What this costs to change ¶

Switching from semantic to sentence chunking on an existing corpus means a re-index. With a managed vector database (Pinecone, Weaviate, Qdrant) this is a batch embedding job — typically a few hours for medium-sized corpora. The operational disruption is real but bounded.

Not switching when your chunking is not the bottleneck costs you time on the wrong optimization. Measure retrieval precision first.

Compress retrieved context before it hits the LLM →

Cite this¶

Researchers, analysts, or journalists referencing this post can use either format below — both are copyable.

BibTeXbibtex

@misc{what-chunking-strategy-matters-for-rag-2026,
  title  = {What Chunking Strategy Actually Matters for RAG Quality},
  author = {James Hollingsworth},
  year   = {2026},
  month  = {May},
  url    = {https://www.gotcontext.ai/blog/what-chunking-strategy-matters-for-rag},
  note   = {gotcontext.ai engineering blog.},
}

APAtext

James Hollingsworth. (2026, May 8). What Chunking Strategy Actually Matters for RAG Quality. gotcontext.ai. Retrieved from https://www.gotcontext.ai/blog/what-chunking-strategy-matters-for-rag.

Contribute¶

Suggest an edit

Spotted a typo, a stale benchmark, or a missing nuance? Open a GitHub issue.

Discuss this post

Counterexamples, follow-up questions, and adjacent research welcome.

Email us

Bigger story? Hit us directly at hello@gotcontext.ai.

← Back to all posts