Claude 4.5 Opus Tops SWE-bench February 2025 Leaderboard Amid Chinese AI Surge
The SWE-bench February 2025 leaderboard reveals Claude 4.5 Opus leading in automated software engineering tasks, outperforming Gemini 3 Flash and MiniMax M2.5. Notably, Chinese models dominate the top ten, signaling a shift in global AI coding capabilities.

According to Simon Willison’s analysis of the newly updated SWE-bench leaderboard, the February 2025 benchmark results have unveiled a dramatic shift in the landscape of AI-powered software engineering agents. Claude 4.5 Opus, developed by Anthropic, has emerged as the top-performing model on the "Bash Only" benchmark, resolving 76.8% of 2,294 real-world coding problems drawn from 12 major open-source repositories—including Django, Scikit-learn, and Pytest. This marks a significant milestone, as it is the first time a Claude variant has topped the leaderboard over competitors from Google, OpenAI, and emerging Chinese AI labs.
The benchmark, administered by the SWE-bench team and not self-reported by any AI developer, evaluates the ability of AI agents to independently solve software issues by interacting with live codebases using a standardized system prompt. The test harness, known as mini-swe-agent, is a 9,000-line Python framework that simulates a developer’s workflow: reading issue reports, navigating code, writing patches, and submitting pull requests—all without human intervention. The dataset’s diversity, spanning Python-based projects from scientific computing to web frameworks, ensures real-world relevance beyond synthetic benchmarks.
Surprisingly, Claude 4.5 Opus outperformed its successor, Claude Opus 4.6, by a narrow margin of 0.2 percentage points—a counterintuitive result that has sparked debate within the AI research community. Meanwhile, Google’s Gemini 3 Flash and MiniMax’s newly released M2.5 model tied for second place at 75.8%, underscoring the rapid advancement of multi-modal and reasoning-optimized architectures. MiniMax, a Chinese AI startup, released M2.5 just days before the benchmark run, and its immediate high placement signals the growing competitiveness of China’s private AI sector.
Chinese models collectively claim six of the top ten positions: GLM-5, Kimi K2.5, DeepSeek V3.2, and M2.5 all feature prominently, alongside Claude 4.5 Sonnet and Haiku. This dominance contrasts sharply with the Western AI ecosystem, where OpenAI’s GPT-5.2 ranks sixth at 72.8%, and its specialized coding model, GPT-5.3-Codex, remains absent from the leaderboard—likely due to limited API availability. The absence of GPT-5.3-Codex raises questions about whether performance evaluations are being skewed by accessibility rather than capability.
One of the most compelling aspects of this benchmark is its methodological rigor. Unlike many proprietary evaluations, SWE-bench applies identical prompts and environment configurations to every model, eliminating the advantage of custom-tuned system instructions. This ensures that differences in performance reflect inherent model capabilities rather than prompt engineering prowess. Simon Willison further demonstrated this transparency by using Claude itself to augment the leaderboard’s visualizations—injecting JavaScript to overlay percentage values onto bar charts, a feat that highlights the self-referential nature of modern AI tools.
The implications are far-reaching. As AI agents increasingly handle complex software maintenance tasks, the ability to resolve real-world bugs at scale could transform developer workflows and open-source sustainability. With models like Claude 4.5 Opus and MiniMax M2.5 demonstrating near-human levels of precision in code correction, companies may soon rely on AI not just for code generation, but for continuous integration and bug triage. The rise of Chinese AI models also suggests a global redistribution of innovation, challenging the dominance of U.S.-based labs and prompting renewed scrutiny of export controls and model transparency.
Looking ahead, the SWE-bench team plans to expand the benchmark to include multi-language support and more complex system-level tasks. For now, the February 2025 results serve as a definitive snapshot: the era of AI as a primary software engineer is no longer speculative—it is measurable, benchmarked, and already reshaping the industry.


