Developer Builds AI Search Tool for 2 Million Pages of Epstein Documents
An independent developer has created an open-source AI system capable of searching and answering questions across a massive dataset of over 2 million pages related to the Jeffrey Epstein case. The project, built using Retrieval-Augmented Generation (RAG) technology, aims to make the sprawling document collection more accessible for research and analysis. The tool is publicly available on GitHub under an MIT license.

Developer Builds AI Search Tool for 2 Million Pages of Epstein Documents
By Investigative Tech Desk |
In a project blending cutting-edge artificial intelligence with one of the most scrutinized document troves in recent history, a developer has constructed a sophisticated search and analysis pipeline for the vast collection of materials known as the "Epstein Files." According to a post on the technical forum Reddit, the system processes over 2 million pages of news articles, legal documents, and other records, allowing users to pose complex questions and receive answers drawn directly from the source material.
The Technical Challenge of Scale
The developer, operating under the username Cod3Conjurer, described the undertaking as a passion project driven by the technical challenges of working at an immense scale. "I love playing around with RAG and AI, optimizing every layer to squeeze out better performance," the developer wrote. "Last night I thought: why not tackle something massive?"
The project, dubbed "EpsteinFiles-RAG," utilizes a technology known as Retrieval-Augmented Generation (RAG). This AI architecture allows large language models to pull specific, relevant information from a designated database—in this case, the 2-million-page dataset—before generating a response. This prevents the AI from "hallucinating" or inventing facts, grounding its answers in the provided documents.
According to the project's documentation, the core technical accomplishments include building a full RAG pipeline with optimized data processing, cleaning and chunking the massive dataset, converting the text into searchable vectors, and implementing semantic search and Q&A functionalities. The entire codebase is written in Python and released under a permissive MIT license, inviting collaboration and scrutiny from other developers.
Navigating a Sensitive Dataset
The dataset powering the tool originates from a Hugging Face repository named "teyler/epstein-files-20k." It aggregates a wide array of public documents and news coverage related to the late financier Jeffrey Epstein, who was accused of operating a sex trafficking ring involving minors and whose associations with powerful figures in politics, business, and academia have been the subject of intense public interest and media scrutiny for years.
Major news organizations, including CNN, maintain dedicated online portals, such as "The Epstein Files" on CNN Politics, to archive and present their reporting on the unfolding story, court documents, and related investigations. The new AI tool effectively creates a parallel, machine-readable index of a much broader swath of material, potentially enabling patterns and connections to be identified across millions of pages that would be impossible for a human researcher to review manually.
Implications for Research and Open Source Intelligence
The developer framed the project primarily as a technical exercise. "It’s trending, real-world data at scale, the perfect playground," they stated. "When you operate at scale, every optimization matters. This project lets me experiment with RAG architectures, data pipelines, and AI performance tuning on real-world workloads."
However, the release of such a tool has significant implications. For journalists, legal researchers, and open-source investigators, it could dramatically reduce the time required to locate specific names, events, or details within the colossal archive. The ability to ask natural language questions—such as "What documents mention visits to a particular island?" or "List all flight manifests from a specific date range?"—transforms a passive repository into an interactive query system.
Yet, the power of the tool also raises questions about verification and bias. The quality of the AI's answers is intrinsically linked to the completeness and accuracy of the underlying dataset. Furthermore, the system's outputs would require careful fact-checking against primary sources, a standard practice for any serious research.
Open Source and Future Development
By releasing the project as open-source software on GitHub, the developer has invited a community of peers to examine, improve, and adapt the code. The repository is already open for "ideas, optimizations, and technical discussions." This collaborative approach could lead to enhancements in search accuracy, processing speed, and user interface, potentially making powerful document analysis technology more accessible to newsrooms and research institutions with limited budgets.
The emergence of EpsteinFiles-RAG highlights a growing trend: the application of advanced AI and data engineering techniques to massive, complex real-world datasets. As these technologies become more democratized, they are poised to change how society investigates and understands large-scale information, from historical archives to leaked document dumps. The project stands as a testament to how a single developer, motivated by technical curiosity, can build a tool that intersects with some of the most consequential investigative narratives of our time.
The developer concluded their announcement with an open invitation for collaboration, signaling that this may be just the first iteration of a tool designed to bring unprecedented scale to document-driven inquiry.


