A Reasoning-Focused Legal Retrieval Benchmark

Overview

As the legal community increasingly examines the use of large language models (LLMs) for various legal applications, legal AI developers have turned to retrieval-augmented LLMs (“RAG” systems) to improve system performance and robustness. An obstacle to the development of specialized RAG systems is the lack of realistic legal RAG benchmarks which capture the complexity of both legal retrieval and downstream legal question-answering. To address this, we introduce two novel legal RAG datasets: Bar Exam QA and Housing Statute QA. Below we offer a brief description of each dataset, and encourage you to read the paper for more details and experiment with the datasets. If you have questions, please reach out to us!

Bar Exam QA

Bar Exam QA is a dataset of multistate bar exam (MBE) questions designed for retrieval-augmented question answering in the legal domain. It consists of questions from historical bar exams released by the National Conference of Bar Examiners (NCBE) and practice bar exams from the Barbri MBE test preparation workbook. Each example contains a novel legal scenario, a specific legal question about the scenario, four answer choices, and gold explanation passages that support the correct answer. For questions from Barbri practice exams, the gold passages come directly from the answer key, while for historical bar exams, law students hand-annotated each example with supporting passages through a process modeled after real legal research. The dataset includes approximately 2,000 question-answer pairs with labeled gold passages. The retrieval passage pool contains around 900K passages, consisting of gold passages, U.S. caselaw from 2019-2021 (segmented at paragraph level), and legal encyclopedia entries from Cornell Law School's Legal Information Institute.

Housing Statute QA

Housing Statute QA is a dataset covering statutory housing law across 50+ U.S. jurisdictions. Created by adapting the Legal Services Corporation (LSC) Eviction Laws Database, it provides questions about housing and eviction law with binary (Yes/No) answers. Each sample contains a legal question about housing law in a specific state, the corresponding answer, and relevant supporting statutes. The dataset was created by legally trained researchers who manually searched housing laws in each jurisdiction to answer these questions, simulating the real-world legal research process. The released version consists of two splits: a primary evaluation set containing 6,853 question-answer pairs with labeled supporting statutes, and a larger set (9,297 examples) containing additional question-answer pairs (intended for evaluation purely of LLM knowledge). The retrieval passage pool contains approximately 2 million passages from state statutes compiled from Justia's database of state laws from 2021.

Takeaways

We're particularly excited about these datasets because they allow us to study each stage of the RAG pipeline---both retriever performance and downstream question-answering performance. Beyond RAG, we also think this a useful benchmark for measuring legal reasoning over statutes and other legal resources. We encourage you to read the paper for more details, but here are some highlights:

Low lexical similarity challenges traditional retrievers: Both datasets feature much lower lexical similarity between queries and relevant passages compared to existing benchmarks, making them particularly challenging for traditional retrieval methods like BM25.
Structured reasoning improves retrieval performance: Query expansion with structured legal reasoning rollouts significantly improves retrieval performance, with gains of up to 10 percentage points in Recall@10 for lexical retrievers.
Challenging for even advanced models: The performance gap between current retrievers and perfect retrieval highlights significant room for improvement in legal retrieval systems.

Acknowledgments

We thank Maura Carey for annotating gold passages for the Bar Exam QA dataset. We thank Yvonne Hong for research assistance in processing the Bar Exam QA dataset. We thank Isaac Cui, Olivia Martin, and Catherina Xu for piloting early iterations of the retrieval method. We are grateful to Varun Magesh, Faiz Surani, Suvir Mirchandani, Isabel Gallegos, Jihyeon Je, Chenglei Si, and Aryaman Arora for helpful discussion. LZ is supported by the Stanford Interdisciplinary Graduate Fellowship (SIGF). NG is supported by the Stanford Interdisciplinary Graduate Fellowship (SIGF) and the HAI Graduate Fellowship.

This work is dedicated to Andrea Vallebueno, in loving memory. Andrea was a dear friend and labmate. She had a brilliant, warm spirit with a special gift for research and teaching others. Her light overflowed on to each person in her life and inspired so so many, close and far. We remember and hope to carry on the legacy of her life, the dignity and respect with which she treated every person she encountered, her welcoming and inclusive nature, and her passion for her research on computational methods for addressing socially impactful problems.

BibTeX

@inproceedings{zheng2025,
  author = {Zheng, Lucia and Guha, Neel and Arifov, Javokhir and Zhang, Sarah and Skreta, Michal and Manning, Christopher D. and Henderson, Peter and Ho, Daniel E.},
  title = {A Reasoning-Focused Legal Retrieval Benchmark},
  year = {2025},
  series = {CSLAW '25 (forthcoming)}
}