AI Models Still Struggle with Math Despite Improvements, New Test Reveals

Despite significant strides in natural language understanding and generative capabilities, artificial intelligence models continue to struggle with fundamental mathematical reasoning, according to a new evaluation by the ORCA test suite. Even the most advanced commercially available models, including Google’s Gemini 3 Flash, would receive a grade of C if assessed using traditional academic standards—a finding that raises serious concerns about their deployment in fields requiring precision, such as finance, engineering, and scientific research.

The ORCA test, developed by a consortium of AI safety researchers and mathematicians, evaluates models on multi-step arithmetic, algebraic reasoning, symbolic manipulation, and logic-based word problems. Unlike previous benchmarks that focused on surface-level pattern matching, ORCA includes adversarial questions designed to expose the underlying limitations of large language models (LLMs) as probabilistic predictors rather than reasoning engines. The results are sobering: while performance has improved since 2023, the average score across top-tier models remains below a B−, with Gemini 3 Flash achieving 78% accuracy—a marginal gain over its predecessor but still far from human-level reliability.

"LLMs are not solving math problems—they’re guessing the most statistically probable answer," said Dr. Elena Vasquez, lead researcher at the Center for AI Transparency. "They’ve memorized patterns from vast datasets, but they lack true comprehension. When faced with a novel combination of variables or a subtle misdirection, they confidently generate incorrect solutions. This isn’t a bug; it’s a fundamental architectural constraint."

One striking example from the ORCA dataset involved a problem asking models to calculate the probability of drawing two red marbles from a bag containing five red and three blue marbles without replacement. While humans and traditional algorithms solve this with straightforward combinatorics, several LLMs—including GPT-4o and Claude 3 Opus—incorrectly applied independent probability assumptions, yielding answers that were off by over 30%. In another case, a model confidently asserted that 17 × 23 equals 371, despite the correct answer being 391—a mistake that would be flagged immediately by any middle-school student.

Industry leaders argue that these models are sufficient for many applications where approximate answers are acceptable. "In customer service or content generation, a 90% accuracy rate is often good enough," said Mark Chen, Head of AI Strategy at a major tech firm. "But when you’re calculating drug dosages, structural loads, or financial risk models, that margin of error becomes dangerous."

Researchers are now calling for hybrid architectures that integrate symbolic reasoning engines with neural networks. Projects like DeepMind’s AlphaGeometry and Microsoft’s LeanDojo are exploring ways to combine LLMs with formal proof systems, but these remain experimental. Meanwhile, regulatory bodies in the EU and U.S. are beginning to scrutinize AI systems used in high-stakes decision-making, with draft guidelines requiring transparency in model limitations—including math performance metrics.

The ORCA results underscore a broader truth: AI’s apparent intelligence is often an illusion of fluency. As organizations increasingly outsource critical calculations to black-box models, the risk of undetected errors grows. Without rigorous, standardized testing—and public disclosure of performance gaps—users may be misled into trusting systems that are, at their core, still guessing.

For now, the message from the ORCA team is clear: don’t let AI do your math homework—especially if the stakes are high.

AI-Powered Content

Sources: go.theregister.com

AI Models Still Struggle with Math Despite Improvements, New Test Reveals

AI Models Still Struggle with Math Despite Improvements, New Test Reveals

summarize3-Point Summary

psychology_altWhy It Matters

recommendRelated Articles

Attention Residuals (2026): Moonshot AI's Breakthrough for Efficient Transformer Scaling

Amazon Nova 2 Lite Content Moderation (2026): How New Prompts Beat Larger AI Models

Cursor Composer 2 AI Model (2026 Review): Beats Claude Opus 4.6 with 86% Lower Cost & Superior Be...