New Benchmark APEX Testing Reveals True Coding Capabilities of LLMs Across Real Codebases
A groundbreaking independent benchmark called APEX Testing evaluates 65 real-world coding tasks across diverse codebases, exposing significant gaps in LLM performance despite inflated claims. The ELO-ranked system, funded privately, reveals unexpected rankings and cost-performance disparities among leading models.

New Benchmark APEX Testing Reveals True Coding Capabilities of LLMs Across Real Codebases
A new, rigorously designed benchmark named APEX Testing has emerged as a critical tool for cutting through the noise in the rapidly evolving field of coding large language models (LLMs). Created by an independent researcher under the pseudonym hauhau901, APEX Testing evaluates 65 real-world coding tasks drawn from actual production codebases — not synthetic or toy problems — and ranks models using an ELO system akin to chess ratings. The initiative, funded entirely out-of-pocket, has already exposed stark inconsistencies in how leading AI models perform under authentic engineering conditions, challenging marketing claims that dominate headlines.
Unlike conventional benchmarks that rely on isolated code snippets or artificial prompts, APEX Testing requires LLMs to clone real repositories, understand complex dependency graphs, debug race conditions, refactor legacy modules, and implement features in context — mirroring the full scope of a software engineer’s daily workflow. Each task is graded by multiple state-of-the-art models and manually reviewed by the creator to eliminate false negatives caused by infra failures or timeouts. This dual-layer validation ensures fairness and reproducibility, setting a new standard for empirical evaluation in AI-assisted coding.
Among the most surprising findings is the performance of GPT-5.1 Codex Mini, which outperformed its newer, larger counterpart, GPT-5.2 Codex, in consistency and task completion rate — despite using fewer computational resources. However, the Mini model compensated with significantly higher token usage, raising questions about cost-efficiency versus raw performance. Other models showed high average scores but collapsed entirely in specific categories such as CLI tool development or concurrency debugging, underscoring the danger of relying on aggregate metrics. The benchmark also revealed that two models with nearly identical ELO scores could differ by over 300% in API cost per successful task completion, a critical factor for enterprises scaling AI-assisted development.
While commercial benchmarks often prioritize speed and simplicity, APEX Testing’s emphasis on real-world fidelity aligns with emerging academic efforts like FeatureBench, a recently published framework from arXiv that similarly evaluates LLMs on complex, multi-step feature development tasks in real codebases. According to the FeatureBench paper (arXiv:2602.10975v1), “agentic coding systems must demonstrate sustained contextual awareness across extended codebases, not just syntactic completion.” APEX Testing complements this by adding operational realism — including version control workflows, package manager integration, and environment configuration — elements frequently absent in academic evaluations.
The benchmark currently spans eight categories: React UI components, Python CLI tools, JavaScript state management, Go concurrency bugs, Rust memory safety, SQL schema migrations, Dockerfile optimization, and legacy Java refactoring. Users can filter results by category to identify which models excel in frontend versus backend tasks, or which falter under tight resource constraints. The creator plans to expand the suite with more quantized and open-weight models, and is actively seeking community contributions to maintain neutrality.
With enterprise adoption of AI coding assistants accelerating, APEX Testing offers a much-needed reality check. As one senior engineering lead at a Fortune 500 firm remarked, “We’ve burned budgets on models that looked great in demos but failed when integrated into our CI/CD pipeline. APEX is the first benchmark that actually mirrors our pain points.”
Access to the full benchmark, detailed task descriptions, and model rankings is available at www.apex-testing.org. The site openly displays the creator’s total expenditure — over $18,000 to date — reinforcing its commitment to transparency over profit. In an era of hype-driven AI evaluation, APEX Testing stands as a rare, principled effort to measure what truly matters: real code, real bugs, and real results.


