SOOHAK Benchmark (2026): Why AI Models Like Google Gemini Fail on Unsolvable Math Problems
A new AI benchmark for mathematics reveals that while models like Google's Gemini can solve complex problems, they struggle to identify tasks with no solution. The SOOHAK benchmark, created by a consortium of mathematicians, highlights a critical gap in AI's research-level reasoning. No model scored above 50% in detecting deliberately broken problems.

SOOHAK Benchmark (2026): Why AI Models Like Google Gemini Fail on Unsolvable Math Problems
summarize3-Point Summary
- 1A new AI benchmark for mathematics reveals that while models like Google's Gemini can solve complex problems, they struggle to identify tasks with no solution. The SOOHAK benchmark, created by a consortium of mathematicians, highlights a critical gap in AI's research-level reasoning. No model scored above 50% in detecting deliberately broken problems.
- 2A new 2026 AI math benchmark has revealed a significant flaw in artificial intelligence: leading models like Google Gemini confidently generate answers to problems that have no solution.
- 3According to The Decoder, the SOOHAK benchmark—developed by 64 mathematicians—includes 439 handwritten tasks with 99 deliberately unsolvable problems.
psychology_altWhy It Matters
- check_circleThis update has direct impact on the Bilim ve Araştırma topic cluster.
- check_circleThis topic remains relevant for short-term AI monitoring.
- check_circleEstimated reading time is 3 minutes for a quick decision-ready brief.
A new 2026 AI math benchmark has revealed a significant flaw in artificial intelligence: leading models like Google Gemini confidently generate answers to problems that have no solution. According to The Decoder, the SOOHAK benchmark—developed by 64 mathematicians—includes 439 handwritten tasks with 99 deliberately unsolvable problems. This evaluation of machine reasoning shows that while AI can tackle research-level questions, its ability to recognize fundamental logical impossibilities remains critically weak, highlighting key AI limitations in mathematical reasoning.
What the SOOHAK Benchmark Tests
The SOOHAK benchmark represents a major advancement in AI model evaluation, moving beyond simple problem-solving to assess broader research skills needed for genuine mathematical inquiry. This benchmark dataset includes problems ranging from undergraduate level to cutting-edge research topics.
Key Performance Findings
Google's Gemini 3 Pro model emerged as the top performer on solvable research-level problems, achieving approximately 30 percent accuracy. However, its performance—and that of all tested models—plummeted on the unsolvable tasks.
The Critical Failure Point
- No AI model scored above 50% in correctly identifying the 99 "broken" problems
- Profound disconnect between computational capacity and critical judgment
- Increasing training data improves solution-finding but not problem-validation skills
Key Findings on AI Overconfidence and Hallucination
The benchmark reveals a critical AI overconfidence problem where models exhibit what researchers call "mathematical hallucination"—confidently presenting solutions to unsolvable problems. This flaw in machine reasoning has direct implications for AI as a research tool.
Implications for Academic Research
As AI systems assist in drafting research papers and exploring conjectures, the inability to spot inconsistencies becomes a major risk. The Daily Star raises crucial questions about authorship and accountability when AI contributes to mathematical papers.
The Core Reasoning Gap
The issue stems from overconfidence and lack of meta-reasoning. AI models, trained on solvable problems, are optimized to produce outputs without mechanisms to evaluate question validity—essentially becoming brilliant but uncritical students.
Implications for AI Research and Development
Researchers suggest the SOOHAK benchmark will guide future AI development toward systems that understand problem boundaries, not just solve more problems. This requires training AI to recognize contradictions and ill-defined conditions.
New Training Paradigms Needed
- Exposure to more unsolvable problems during training
- Techniques for problem deconstruction before solution attempts
- Integration of formal logic checks into reasoning processes
The Path Forward
Until this gap closes, AI's promise as a true partner in mathematical discovery remains limited. The 2026 SOOHAK benchmark findings serve as a sobering reminder that impressive narrow-domain performance can mask broader reasoning deficiencies. For rigorous sciences, identifying unsolvable problems proves as crucial as solving solvable ones—a threshold current AI models haven't yet crossed.


