Fine-Tuned Qwen 14B Outperforms GPT-4o on NYT Connections Puzzle

In a striking demonstration of efficient AI fine-tuning, an independent researcher has successfully trained a 14-billion-parameter open-source model to outperform GPT-4o on the New York Times’ daily Connections puzzle—a word game requiring nuanced semantic reasoning. According to a detailed report published on Substack and shared on Reddit’s r/LocalLLaMA, the researcher’s fine-tuned Qwen 2.5 14B model achieved a 30% success rate, surpassing GPT-4o’s 22.7% and far exceeding the base Qwen 14B’s 9.3% performance. The breakthrough was accomplished using only $10 in cloud compute and a 20-minute training session on an A100 GPU, challenging conventional assumptions that superior AI performance requires massive proprietary models.

The key innovation lay not in data volume or model size, but in pedagogical distillation. The researcher employed Claude Sonnet 4.5, a high-performing commercial model with an 87.3% solve rate, as a "teacher" to generate step-by-step reasoning traces across approximately 350 NYT Connections puzzles. Rather than feeding the Qwen model only correct answers, it was trained on the full chain of logical deductions: how Sonnet identified potential categories, eliminated red herrings, and validated groupings. This approach—known as reasoning distillation—enabled Qwen to internalize the cognitive process behind solving the puzzles, not just memorize patterns.

Previous attempts at improving performance failed spectacularly. Fine-tuning on solution-only datasets caused the model to mimic the puzzle’s output format without understanding the underlying logic. Synthetic puzzle generation, where Sonnet was asked to create new puzzles for training, yielded trivial, unrealistic examples that did not reflect the complexity of real NYT puzzles. Similarly, embedding-based similarity scoring—common in word association tasks—proved inadequate, as Connections often hinges on idiomatic, cultural, or contextual groupings that defy vector-space semantics.

The technical implementation was remarkably lightweight. Using QLoRA (Quantized Low-Rank Adaptation) via the Unsloth library, the researcher applied a LoRA rank of 32 and trained for 2.5 epochs. The entire process required less than 20 minutes on a single A100, making it accessible to academic labs and independent developers without access to multi-million-dollar compute budgets. The resulting model, while smaller than GPT-4o, demonstrated superior performance on a task demanding abstract reasoning, contextual awareness, and pattern recognition.

This experiment carries broader implications for the future of AI development. It suggests that open-source models, when properly guided by high-quality reasoning data, can rival or exceed the performance of significantly larger proprietary systems. The success of reasoning distillation over raw scaling may accelerate a shift toward "cognitive apprenticeship" training paradigms, where smaller models are mentored by expert systems rather than trained on vast, unstructured corpora.

The researcher, who remains anonymous under the username john_enev, has released the full code and training dataset on Substack, inviting replication and extension by the community. As AI accessibility continues to democratize, this case study stands as a compelling example of how creativity, not just computational power, drives innovation. Whether this approach can be generalized to other reasoning tasks—such as logic puzzles, math word problems, or legal reasoning—remains an open and exciting question for the AI research community.

AI-Powered Content

Sources: www.reddit.com

Fine-Tuned Qwen 14B Outperforms GPT-4o on NYT Connections Puzzle

Fine-Tuned Qwen 14B Outperforms GPT-4o on NYT Connections Puzzle

summarize3-Point Summary

psychology_altWhy It Matters

AI Terms in This Article

recommendRelated Articles

Attention Residuals (2026): Moonshot AI's Breakthrough for Efficient Transformer Scaling

Amazon Nova 2 Lite Content Moderation (2026): How New Prompts Beat Larger AI Models

Cursor Composer 2 AI Model (2026 Review): Beats Claude Opus 4.6 with 86% Lower Cost & Superior Be...