If you want to test a RAG system—or even just a single component like retrieval or generation—you need datasets that reflect real-world challenges. Precise metrics give you results that tell you not just whether one solution beats another, but when and why it works better. We specifically wanted to stress-test:
We explored several public datasets:
The pattern we kept seeing: our solutions would perform well on public datasets but deliver disappointing results on client data.
We needed datasets that better aligned with our goals. Public datasets often lack multi-hop queries with many steps and intensive reasoning—exactly what we encounter in production. Most also use chunk-based schemas or coarse-grained annotations, or schemas that vary across datasets, limiting portability.
We built two internal evaluation datasets and one public dataset based on D\&D SRD 5.2.1. All three have high skill ceilings (multi-hop, reasoning, broad coverage). The key feature: a standardized char-based structure that makes the dataset chunk-agnostic.
start_char/end_char references in markdown KB filesEach dataset item contains:
idquestionanswerpassages: a list of textual evidence, each withdocument_path (markdown file from the KB)start_char, end_char (intervals in the source text)content (the extracted span)Why char-based? Character intervals are independent of your chunking strategy. Whatever splitter you use downstream, the spans remain valid—you can verify whether a chunk contains the necessary content.<!--IMG:0-->
Our KB is a collection of markdown files parsed from PDFs. The start_char/end_char references point to normalized text, maximizing reproducibility and portability.
For easy questions, we automate: randomly select a markdown document from the KB and ask an LLM to generate a focused question about that document. At each iteration, we pass previously generated questions for that file to avoid duplicates.
This produces focused single-document questions, but not cross-document queries.
Medium questions require domain experts from the start. We want cross-document questions and non-trivial reasoning. Passing the entire KB to an LLM isn't always realistic and tends to produce questions that are too easy or not useful. Here, experts write questions from scratch. We chose D\&D because we have expertise in it (we also maintain two internal NDA-covered datasets).
Given a question and a source document, we call an LLM to produce the answer and identify the necessary passages with their spans. Each question-answer-passages triplet goes through quality control by a human expert who can accept, reject, or correct it.
Unlike "easy" questions, where answers can be found in a single document with substantially contiguous text, creating a medium-difficulty dataset required us to imagine scenarios where the necessary information is scattered.
We identified two main scenarios that characterize this medium difficulty: multi-hop questions, where the answer requires a chain of reasoning across multiple documents (for example, first finding a general rule in one document, then a specific exception in another, and finally applying both to a particular case described in a third), and wide questions, where the complete answer requires aggregating information from multiple documents without necessarily following a complex reasoning chain, but rather gathering scattered pieces of information that all contribute to the final answer.
The general generation loop (medium questions only):
For medium questions, we separated these two problem types, each with its own strategy. In both cases, if the answer isn't satisfactory, we enter an iterative loop of expert suggestions until we reach acceptable quality.
The goal was to create a Claude Skill that mimicked the behavior of a highly evolved RAG system, prioritizing answer quality over cost and latency. While these last two variables are hardly acceptable in many applications, they are less critical during dataset creation.The skill operates on our complete knowledge base in markdown (for the D\&D dataset, this consists of 20 markdown files extracted from SRD 5.2.1). The domain expert provides a custom-crafted question, and the skill executes the following steps:
For each question, the steps are:
This approach is also expensive, but:
We excluded excessively generic or ambiguous questions from the dataset, such as "When can I use ability X instead of Y?". These types of questions tend to generate unsatisfactory answers with both approaches (Claude Skills and LLM Retriever), as they often require contextual interpretation or admit multiple valid answers, making them unsuitable for objective evaluation.
Using this framework, we built a public dataset based on D\&D SRD 5.2.1, which covers a subset of fifth edition D\&D content (Player's Handbook, DM Guide, Monster Manual). The SRD is released under Creative Commons, so we can make it public. The dataset follows the structure above, is chunk-agnostic thanks to char-based annotations, and includes both easy and medium questions.
Domain experts perform quality control in three stages:
start_char/end_char intervals aligned and reproducible in the KB files?If validation fails, we reject the item or (for medium questions) revise it with additional hints.
The complete dataset, along with the code for generation and validation, is publicly available in the project's GitHub repository and on HuggingFace.
We've presented a practical methodology for building evaluation datasets for RAG pipelines: char-based structure, difficulty levels, semi-automated generation with human oversight, and specialized tools for multi-hop and wide questions. We're releasing a public dataset based on D\&D SRD 5.2.1 to enable transparent, reproducible evaluations. The dataset along with the code for generation and validation, is publicly available in the project's GitHub repository and on HuggingFace.
Coming soon: a dedicated post on Anthropic’s Contextual Retrieval, where we'll use these datasets to evaluate improvements over classical retrieval approaches.
***
*Raul Singh -* GitHub - LinkedIn - AI R\&D Engineer - Datapizza
*Ling Xuan “Emma” Chen -* GitHub - LinkedIn - AI R\&D Engineer - Datapizza
*Francesco Foresi -* GitHub - LinkedIn - GenAI R\&D Team Lead - Datapizza