QED-Nano: Hugging Face’s 4B Model Solves Olympiad Math Without External Tools
Hugging Face has released QED-Nano, a compact 4-billion-parameter AI model that matches the performance of far larger systems in solving Olympiad-level mathematical proofs—all in natural language and without relying on formal proof assistants like Lean. The breakthrough, powered by reinforcement learning and a novel agent scaffold, makes frontier mathematics accessible on consumer hardware.

QED-Nano: Hugging Face’s 4B Model Solves Olympiad Math Without External Tools
In a landmark development at the intersection of artificial intelligence and mathematical reasoning, Hugging Face has unveiled QED-Nano, a 4-billion-parameter language model capable of autonomously generating rigorous proofs for International Mathematical Olympiad (IMO)-level problems. Unlike previous systems that depend on formal verification frameworks such as Lean or Coq, QED-Nano operates entirely in natural language, leveraging a sophisticated agent scaffold to decompose complex problems into manageable sub-tasks—achieving performance parity with models over four times its size.
According to the official blog post published on Hugging Face Spaces, QED-Nano was trained using reinforcement learning (RL) on a proprietary dataset called FineProofs-RL, with reward functions derived from structured proof rubrics rather than binary correctness signals. This approach allows the model to learn not just whether a proof is correct, but how to construct it with logical coherence and stylistic precision—mimicking the reasoning patterns of human mathematicians. The model’s smaller footprint—just 4B parameters—makes it uniquely suited for deployment on consumer-grade hardware, a stark contrast to frontier models like Gemini 3 Pro, which require massive cloud infrastructure.
The innovation lies in QED-Nano’s test-time compute architecture. While most AI proof systems are constrained by token limits during inference, QED-Nano employs an agent scaffold that dynamically expands its reasoning context to over one million tokens per proof. This enables iterative refinement: the model can generate multiple proof attempts, evaluate them internally, backtrack from dead ends, and synthesize a final, validated argument—all without external tools. This scalability at inference time, combined with its small size, results in a system that achieves comparable accuracy to state-of-the-art models while being approximately four times more cost-efficient.
QED-Nano builds on the conceptual foundation of earlier models like VibeThinker 1.5B, but takes a more structured, reward-guided approach. The training pipeline involved two phases: supervised fine-tuning (SFT) on the FineProofs-SFT dataset, followed by RL fine-tuning using human-annotated proof rubrics as reward signals. These rubrics evaluated elements such as logical flow, lemma selection, and clarity—turning theorem proving into a multi-dimensional optimization problem. The result is a model that doesn’t just memorize proof patterns but learns to reason compositionally.
Open-source accessibility is central to the project’s mission. Hugging Face has released both the QED-Nano and QED-Nano-SFT models, along with the training datasets (FineProofs-SFT and FineProofs-RL) and full evaluation code on GitHub under the CMU-AIRe organization. This transparency invites academic and industrial researchers to reproduce results, adapt the architecture, or extend it to other domains such as computer science or physics proofs.
On LinkedIn, one of the project’s contributors described QED-Nano as “the smallest theorem proving model to date,” emphasizing its potential to democratize access to high-level mathematical reasoning. For educators, this could mean AI-powered tutoring systems that guide students through proof construction in real time. For researchers, it offers a new benchmark for evaluating the emergent reasoning capabilities of language models.
While QED-Nano does not yet solve every problem on the IMO-ProofBench, its performance on the most challenging subset rivals that of much larger models. Its success suggests that parameter count alone is no longer the primary determinant of reasoning capability—architectural ingenuity and training methodology may be more decisive. As AI continues to blur the lines between pattern recognition and genuine logical deduction, QED-Nano stands as a compelling case study: sometimes, less is more.


