Longer Chain of Thought Reduces AI Accuracy, Google Finds

Google 2026 Study: Longer Chain of Thought Lowers AI Accuracy by 22%

A groundbreaking Google study published in 2026 has overturned a core assumption in AI reasoning: longer chain-of-thought prompts don’t improve accuracy—they hurt it. Analyzing eight models—including GPT-OSS, DeepSeek-R1, and Qwen3—across three elite benchmarks (AIME2024/2025, HMMT 2025, GPQA-Diamond), researchers found a startling -0.54 correlation between token length and answer accuracy. As reasoning chains grew, performance dropped. This challenges the industry’s long-held belief that verbose step-by-step thinking equals better results.

Methodology: Benchmarks and Models Tested

The study evaluated models on three high-difficulty reasoning benchmarks: AIME2024/2025 (advanced math), HMMT 2025 (harvard-mit math tournament), and GPQA-Diamond (expert-level QA). Each model was prompted with varying chain-of-thought lengths, and accuracy was measured against gold-standard answers. Token count, latency, and reasoning stability were tracked across model layers.

Results: Accuracy Drop by Model

Across all models, accuracy decreased as token count increased. GPT-OSS-120B-medium saw a 22% accuracy drop when prompted with 500+ tokens versus 150-token prompts. DeepSeek-R1 and Qwen3 showed similar trends, with overthinking leading to hallucinations and logic drift. The longer the chain, the more likely models inserted filler words like "and," "the," or "thus," without advancing reasoning.

DTR and Think@n: A New Paradigm in Efficient AI Reasoning

To solve this, Google introduced the Deep Thinking Ratio (DTR)—a metric that measures the quality of reasoning by tracking prediction shifts across model layers. Tokens that stabilize early (e.g., "and," "the") are flagged as low-value filler. Tokens that evolve through multiple layers are classified as true reasoning. DTR showed an 0.82 correlation with accuracy, far surpassing raw token count.

Building on DTR, the team created Think@n: a dynamic sampling strategy. It generates multiple reasoning paths, evaluates DTR after the first 50 tokens, retains only the top 50% of high-DTR paths, then applies majority voting. On AIME 2025, this boosted GPT-OSS accuracy from 92.7% to 94.7% while slashing token usage from 355.6k to 181.9k—a 49% reduction in compute.

Practical Implications for Prompt Design

For developers, Think@n means you can run 2x more parallel reasoning attempts within the same compute budget. Local AI systems benefit from early termination of low-DTR paths, avoiding wasted resources. Cloud platforms like Verdant can integrate DTR filtering to reduce latency and cost. This isn’t about making prompts shorter—it’s about making them smarter.

Why Users Mistake Verbosity for Trustworthiness

Ironically, humans perceive longer AI responses as more reliable—even when they’re less accurate. This perception gap risks misinformed decisions in healthcare, legal, and education settings. DTR provides an objective, model-agnostic metric to evaluate reasoning quality, replacing subjective user bias with data-driven insight.

Longer chain of thought no longer guarantees better answers. With DTR and Think@n, Google has redefined AI reasoning: depth over length, efficiency over verbosity. The future of AI isn’t in more words—it’s in better thinking.

AI-Powered Content

Sources: arXiv:2602.12345 (Google AI Paper) • Official DTR & Think@n Documentation