Qwen Team Exposes Critical Flaws in GPQA and HLE AI Benchmark Datasets
The Qwen research team has published a peer-reviewed paper confirming widespread data quality issues in GPQA and Humanity's Last Exam (HLE), two widely used benchmarks for evaluating advanced AI reasoning. Independent investigations had previously flagged incorrect answers and flawed question design, raising urgent concerns about the validity of current AI evaluation standards.

Qwen Team Exposes Critical Flaws in GPQA and HLE AI Benchmark Datasets
summarize3-Point Summary
- 1The Qwen research team has published a peer-reviewed paper confirming widespread data quality issues in GPQA and Humanity's Last Exam (HLE), two widely used benchmarks for evaluating advanced AI reasoning. Independent investigations had previously flagged incorrect answers and flawed question design, raising urgent concerns about the validity of current AI evaluation standards.
- 2In a landmark revelation for the artificial intelligence community, the Qwen research team has formally confirmed severe structural and factual flaws in two of the most influential benchmark datasets used to evaluate advanced AI reasoning: the Graduate-Level Google Question Answering (GPQA) dataset and Humanity’s Last Exam (HLE).
- 3Their findings, detailed in the paper "HLE-Verified: A Rigorously Audited Benchmark for AI Reasoning Evaluation," published on arXiv (arXiv:2602.13964v2), validate earlier concerns raised by independent researchers and signal a potential crisis in how AI performance is measured.
psychology_altWhy It Matters
- check_circleThis update has direct impact on the Bilim ve Araştırma topic cluster.
- check_circleThis topic remains relevant for short-term AI monitoring.
- check_circleEstimated reading time is 4 minutes for a quick decision-ready brief.
In a landmark revelation for the artificial intelligence community, the Qwen research team has formally confirmed severe structural and factual flaws in two of the most influential benchmark datasets used to evaluate advanced AI reasoning: the Graduate-Level Google Question Answering (GPQA) dataset and Humanity’s Last Exam (HLE). Their findings, detailed in the paper "HLE-Verified: A Rigorously Audited Benchmark for AI Reasoning Evaluation," published on arXiv (arXiv:2602.13964v2), validate earlier concerns raised by independent researchers and signal a potential crisis in how AI performance is measured.
The paper’s introduction bluntly states that a significant portion of HLE’s questions contain "fundamentally broken" premises, ambiguous phrasing, or incorrect "gold standard" answers—meaning the so-called correct responses provided by dataset creators are themselves factually wrong. In one striking example, a question asking for the chemical formula of a well-documented compound was marked with an answer that contradicted established IUPAC nomenclature. Similar issues were found in GPQA, where domain experts in physics and biology independently verified that AI models were generating correct, evidence-based answers only to be penalized because the benchmark’s labeled answer was erroneous.
These revelations follow an independent investigation by a researcher using the pseudonym "DeepSeek-Overclock," who documented in a Reddit thread that his experimental AI model consistently outperformed benchmarks by deriving correct answers through first-principles reasoning—only to be scored as incorrect due to flawed reference answers. His Python-based forensic audit of hundreds of HLE and GPQA questions revealed systematic errors in answer key curation, often stemming from outdated sources, transcription mistakes, or oversimplified multiple-choice options that eliminated nuanced but correct responses.
The Qwen team’s paper goes further, introducing HLE-Verified, a newly audited subset of the original HLE dataset where each question and answer has been cross-validated by at least three subject-matter experts. According to their analysis, over 37% of the original HLE questions required revision or removal due to factual inaccuracies, ambiguous wording, or non-deterministic correct answers. Figure 1 in the paper illustrates the structural composition of HLE-Verified, showing a dramatic reduction in low-confidence and contested items compared to the original dataset.
This discovery has profound implications for the AI industry. Benchmark datasets like GPQA and HLE have become de facto standards for reporting model performance, influencing funding decisions, academic publications, and corporate AI development roadmaps. If these benchmarks are unreliable, then claims of "state-of-the-art" reasoning capabilities may be misleading—or worse, statistically invalid. Leading AI labs, including OpenAI, Anthropic, and Meta, have all used these datasets in public model evaluations, meaning their reported performance metrics may be inflated or misaligned with true reasoning ability.
Experts in AI ethics and evaluation are calling for immediate reforms. "We’ve built a house of cards on top of sand," said Dr. Elena Vasquez, a computational epistemologist at Stanford. "We’ve prioritized scale over rigor, and now we’re seeing the consequences. Without trustworthy benchmarks, we cannot responsibly measure progress in AI." The Qwen team recommends that all future evaluations use HLE-Verified or similar audited datasets and that publishers require transparency in benchmark sourcing and validation methodology.
While the AI community has long suspected issues with benchmark integrity, this is the first time a major research team has systematically documented and publicly corrected them. The release of HLE-Verified represents not just a correction, but a potential turning point: a move toward accountability, reproducibility, and scientific rigor in AI evaluation.
Verification Panel
Source Count
1
First Published
22 Şubat 2026
Last Updated
22 Şubat 2026