Back in September 2024, Anthropic introduced Contextual Retrieval, a technique for improving retrieval quality in RAG systems. It addresses a common problem: when you extract chunks from a document, they often lose critical contextual information once separated from their original surroundings.
Here’s a concrete example from our D\&D dataset. Imagine a martial weapons table gets split across multiple chunks, leaving one chunk with just:
```
| Battleaxe | 1d8 Slashing | Versatile (1d10) | Topple | 4 lb. | 10 GP |
| Flail | 1d8 Bludgeoning | — | Sap | 2 lb. | 10 GP |
| Longsword | 1d8 Slashing | Versatile (1d10) | Sap | 3 lb. | 15 GP |
| Maul | 2d6 Bludgeoning | Heavy, Two-Handed | Topple | 10 lb. | 10 GP |
```
This chunk has all the stats for the maul, but it’s missing the table header that tells you these are Martial Melee Weapons. A semantic similarity search might completely miss this chunk when a user asks “Which martial weapon deals the most damage?”
Contextual Retrieval tackles this by enriching each chunk with a brief description that situates it within its source document. An LLM generates this context by looking at both the individual chunk and the full document, producing an accurate contextual summary. This context gets prepended to the chunk before embedding. We’ll dig into the details later in this post.
Contextual retrieval shines when:
Anthropic reported significant retrieval improvements, especially when combined with reranking.
It’s been over a year since that original post, which got us thinking: does Contextual Retrieval still hold up in 2026?
A lot has changed. The model landscape has evolved dramatically. Rerankers in particular have come a long way. Models like Cohere Rerank v4 Pro and newer open-source options from the Qwen 3 family deliver substantially better performance than what was available a year ago.
This raised an interesting question: how does Contextual Retrieval perform alongside modern rerankers and on challenging datasets? Does enriching chunks with context still provide meaningful gains? We ran the experiments to find out.
We ran our experiments on:
As we explained in the previous blogpost, our dataset’s character-based structure lets us precisely determine which chunks are needed to answer each question, regardless of chunking strategy.
Contextual Retrieval modifies the ingestion phase: before inserting chunks into the vector store, you need to enrich them with contextual information that anchors them within the original document. This way, the embedding captures both the chunk content and its broader context.
We built our ingestion using the IngestionPipeline from Datapizza AI. This modular architecture lets us chain different processing steps as independent components: chunking, contextualization, embedding, and upserting to a vector store. We added custom components for LLM-based context generation.
Our pipeline expects documents in markdown format. For PDF-to-markdown conversion, see the preprocessing steps in our rag-dataset-builder.
For the D\&D SRD dataset, we kept things simple with character-based chunking:
This is where the magic happens. For each chunk, we feed an LLM:
The LLM generates a brief context that we prepend to the chunk like this:
```
CONTEXT: [generated context]
CONTENT: [original chunk text]
```
Anthropic’s prompt (from their original blog post) looks like:
```
<document>
{{WHOLE_DOCUMENT}}
</document>
Here is the chunk we want to situate within the whole document
<chunk>
{{CHUNK_CONTENT}}
</chunk>
Please give a short succinct context to situate this chunk within
the overall document for the purposes of improving search retrieval
of the chunk. Answer only with the succinct context and nothing else.
```
Our approach differs in a few ways:
```
<document>
{{ whole_document }}
</document>
You are an expert on the Dungeons & Dragons 5.2.1 System Reference Document.
Your task is to contextualize text chunks for improved semantic search retrieval.
For each chunk below:
{% for chunk in chunks %}
<chunk id="{{ chunk.id }}">
{{ chunk.text }}
</chunk>
{% endfor %}
```
These domain-specific prompts help the LLM generate more accurate and relevant contexts.
With batching and Gemini Flash 2.5, contextualizing the entire D\&D SRD dataset cost around $0.63.
We embed all chunks (with or without context) using Cohere embed-v4.0 (1536 dimensions).
After embedding, we load the chunks into Qdrant, our vector database of choice. Each chunk is stored with:
We set up two parallel pipelines for our experiments:
<!--IMG:0--><!--IMG:1-->
The only difference is the contextualization step, which lets us isolate its impact on performance.
We measured retrieval quality using recall.
In our setup, recall is the fraction of chunks needed to answer a query that the system actually retrieves. If a question requires 4 chunks and we retrieve 3 of them, that’s 0.75 recall.
To get a single metric for each dataset, we average recall across all questions at a fixed value of k (the number of chunks retrieved per query). With k=10, for example, the system returns the top 10 chunks by similarity, and we measure how many of those are actually needed to answer each question.
Our first experiment directly compares base retrieval against contextual retrieval, without any reranker. We tested k = 5, 10, and 20.
Embeddings were generated with Cohere embed-v4.0 (1536 dimensions), with similarity search on Qdrant for retrieval.
The charts below show recall (y-axis) across k = 5, 10, 20 (x-axis), comparing base retrieval against contextual retrieval.
For easy questions, we have results from two datasets: D\&D SRD and Private Dataset 2. Private Dataset 1 doesn’t include easy questions.
For medium questions, we have results across all three datasets: D\&D SRD, Private Dataset 1, and Private Dataset 2.
Each chart includes one subplot per dataset for direct comparison.

<!--IMG:5-->The table below shows the recall gain from contextual retrieval vs the baseline (recall_contextual − recall_base) at different k values:

Something interesting emerges here: the biggest improvement happens at k=5, while k=10 shows minimal or even slightly negative gains.
One possible explanation: contextual retrieval may surface more relevant chunks overall. But with k capped at 10, some of these additional relevant chunks get cut off, reducing the advantage over the baseline.
To test this hypothesis, we plotted the average recall gain for every k from 1 to 20:

The graph confirms our hunch: for both Easy and Medium tiers, the gain bottoms out around k=10 and peaks at low k values. This suggests contextual retrieval is especially effective at pushing the most relevant chunks to the top of the ranking.
We added a reranker to the pipeline for two reasons:
Here’s the flow:
Unlike embedders that work with vector representations, rerankers analyze text directly, enabling more fine-grained relevance judgments.
We evaluated two rerankers:
This gives us a self-hosted reranker with low costs and no external API dependencies. 1. Deployed on Scaleway with an L4 GPU (24GB VRAM) 2. Ran a vLLM container to serve the model 3. Connected via OpenAI-compatible API using Datapizza AI
Here are the medium-difficulty recall results with k=20:<!--IMG:6-->
We also applied the same analysis from Experiment 1 to examine recall gains (recall_contextual − recall_base) across all k values from 1 to 20:

With a reranker in the mix, the dip around k=10 disappears. Instead, we see peak improvement around k=5, with gains tapering off at higher k values. This makes sense: as k grows, there’s less room for contextual retrieval to add value.
After running these experiments, here’s what we took away:
Use it when:
Skip it if:
Several directions look promising:
All the code for these experiments is available in our GitHub repository.
***
*Raul Singh —* GitHub — LinkedIn — AI R\&D Engineer @ Datapizza
*Ling Xuan “Emma” Chen —* GitHub — LinkedIn — AI R\&D Engineer @ Datapizza
*Francesco Foresi —* GitHub — LinkedIn — GenAI R\&D Team Lead @ Datapizza