Leaked Benchmarks Suggest DeepSeek-V4 Outperforms GPT-5 and Gemini in Coding and Math
Exclusive leaked benchmarks indicate DeepSeek-V4 achieves unprecedented scores on coding and mathematical reasoning tasks, potentially surpassing GPT-5 and Gemini. If verified, the model could redefine the global AI leaderboards.

Leaked Benchmarks Suggest DeepSeek-V4 Outperforms GPT-5 and Gemini in Coding and Math
According to a leaked internal benchmark report circulating on Reddit’s r/singularity community, DeepSeek-V4—a next-generation large language model developed by the Chinese AI firm DeepSeek—has achieved staggering results across multiple high-stakes evaluation benchmarks. The data, reportedly sourced from a verified X (formerly Twitter) account linked to Bridgemind AI, suggests the model outperforms all known commercial models in both software engineering and advanced mathematical reasoning, potentially resetting the global AI performance leaderboard.
The most striking figure is DeepSeek-V4’s 83.7% accuracy on SWE-Bench Verified, a rigorous benchmark that tests a model’s ability to solve real-world software engineering problems by generating correct code patches from GitHub issue descriptions. This score surpasses GPT-5.2 High (80.0%), DeepSeek V3.2 Thinking (73.1%), Kimi K2.5 Thinking (76.8%), and Gemini 3.0 Pro (76.2%), positioning DeepSeek-V4 as the new leader in code generation. For context, SWE-Bench is widely regarded as the gold standard for evaluating AI coding assistants, and scores above 80% were previously considered nearly unattainable.
But the model’s capabilities extend far beyond coding. DeepSeek-V4 reportedly achieved a near-perfect 99.4% on the AIME 2026 (American Invitational Mathematics Examination), a competition-level math test designed for high school students with exceptional aptitude. It also scored 88.4% on the IMO Answer Bench, which evaluates solutions to problems from the International Mathematical Olympiad—the most prestigious mathematics competition for pre-university students globally. These results suggest an unprecedented level of abstract reasoning and symbolic manipulation.
Perhaps most astonishing is its performance on FrontierMath Tier 4, a benchmark designed to test the limits of mathematical reasoning with problems so complex they often stump even state-of-the-art AI systems. DeepSeek-V4 achieved a 23.5% accuracy rate, which the source claims is 11 times better than GPT-5.2’s performance on the same test. If confirmed, this represents a quantum leap in mathematical AI capability, as FrontierMath Tier 4 problems require multi-step logical deduction, advanced algebraic insight, and creative problem-solving rarely seen in machine-generated responses.
Experts caution that these figures remain unverified. DeepSeek has not officially confirmed the existence of a V4 model, nor has it released any peer-reviewed papers or public evaluations. The source of the data—an X post by @bridgemindai—has not been independently authenticated, and the image containing the full benchmark table has not been corroborated by academic institutions or major AI research labs. However, the specificity of the numbers, their alignment with known benchmark structures, and the absence of obvious fabrication suggest the data may be legitimate.
Industry analysts note that if DeepSeek-V4’s performance is real, it could disrupt the current AI landscape dominated by OpenAI, Google, and Anthropic. China’s rapid progress in open-weight models, combined with aggressive training methodologies and large-scale data curation, has positioned DeepSeek as a formidable challenger. The model’s apparent strength in both practical coding and theoretical math could make it indispensable for research institutions, financial institutions, and software development firms seeking the most capable AI assistants.
As the AI community awaits official confirmation, the leaked benchmarks have already ignited fierce debate. Some researchers argue that such performance would require novel architectures or training techniques beyond current public knowledge. Others warn of potential benchmark manipulation or cherry-picked results. Nevertheless, the implications are undeniable: if DeepSeek-V4 delivers on these claims, it may mark the first time a non-Western AI model leads the world in both applied and theoretical intelligence.
For now, the ball is in DeepSeek’s court. The company has remained publicly silent. But in the high-stakes race for AI supremacy, silence can be as telling as a statement.


