GLM-5 Surpasses Kimi K2.5 as Top Open-Weights Model on NYT Connections Benchmark

On February 12, 2026, ZAI Labs unveiled GLM-5, a next-generation open-weights large language model that has rapidly ascended to the top of the Extended NYT Connections benchmark with a score of 81.8—surpassing Kimi K2.5 Thinking’s previous record of 78.3. The achievement, confirmed by independent testing on GitHub by researcher Lech Mazur, marks a pivotal moment in the open-source AI landscape, demonstrating that scalable, efficient architectures can outperform proprietary models in complex reasoning tasks.

According to a technical report published by ZAI Labs, GLM-5 scales to 744 billion total parameters with 40 billion active parameters during inference, a significant leap from its predecessor GLM-4.5. The model was trained on 28.5 trillion tokens of multilingual, code-inclusive, and reasoning-rich data, enabling unprecedented contextual understanding. Crucially, GLM-5 integrates DeepSeek Sparse Attention (DSA), a novel mechanism that reduces computational overhead by 40% while preserving long-context retention up to 128K tokens—making it uniquely suited for multi-step agentic workflows.

The Extended NYT Connections benchmark, developed by Lech Mazur and hosted on GitHub, evaluates AI models on their ability to identify semantic groupings in the popular New York Times word puzzle. Unlike traditional QA or translation benchmarks, Connections requires abstract reasoning, pattern recognition, and contextual inference across ambiguous categories—a task that closely mirrors human-like cognitive flexibility. GLM-5’s score of 81.8 represents a 4.5% improvement over the prior leader, Kimi K2.5, and is the first time an open-weights model has broken the 80-point threshold on this benchmark.

Industry analysts attribute GLM-5’s success to its agentic engineering framework, a paradigm shift from ‘vibe coding’—where models generate outputs based on statistical likelihood—to structured, goal-oriented reasoning. As described in ZAI’s blog, GLM-5 is designed to decompose complex tasks into sub-goals, self-correct using internal feedback loops, and iteratively refine solutions. This architecture allows it to navigate the nuanced, often misleading categories in Connections, such as distinguishing between ‘Things that are ‘hot’ (spicy, temperature, celebrity)’ or ‘Words that precede ‘-man’ (super, fire, moon)’ with remarkable precision.

Despite its scale, GLM-5 is fully open-sourced under an Apache 2.0 license and is available on Hugging Face and GitHub. Its deployment efficiency, thanks to DSA, enables inference on consumer-grade GPUs, a rarity among models of comparable performance. The model’s release has sparked renewed interest in open-weight alternatives to proprietary systems like GPT-4o and Claude 3.5, particularly among researchers and developers in emerging markets.

While Business Wire’s press release heralds GLM-5 as signaling “a new era in AI: when models become engineers,” critics caution that benchmark scores alone do not guarantee real-world robustness. Nevertheless, the model’s performance on Connections—a test of abstract reasoning rather than memorization—suggests a meaningful step toward generalist AI capabilities. With active development continuing on GitHub and community fine-tuning already underway, GLM-5 may well become the new standard for open-source agentic AI.

For developers interested in experimenting with GLM-5, ZAI provides detailed documentation and a coding plan at z.ai/subscribe, and the model weights are accessible via Hugging Face and GitHub.

AI-Powered Content

Sources: z.ai • www.businesswire.com • en.wikipedia.org

GLM-5 Surpasses Kimi K2.5 as Top Open-Weights Model on NYT Connections Benchmark

GLM-5 Surpasses Kimi K2.5 as Top Open-Weights Model on NYT Connections Benchmark

summarize3-Point Summary

psychology_altWhy It Matters

GLM-5 Surpasses Kimi K2.5 as Top Open-Weights Model on NYT Connections Benchmark

AI Terms in This Article

recommendRelated Articles

Attention Residuals (2026): Moonshot AI's Breakthrough for Efficient Transformer Scaling

Amazon Nova 2 Lite Content Moderation (2026): How New Prompts Beat Larger AI Models

Cursor Composer 2 AI Model (2026 Review): Beats Claude Opus 4.6 with 86% Lower Cost & Superior Be...