Gemini 3.1 Pro Preview Breaks Benchmark Record with 98.4 Score on NYT Connections
Google's Gemini 3.1 Pro Preview has achieved a record-breaking 98.4% accuracy on the Extended NYT Connections benchmark, surpassing its predecessor and setting a new standard for AI reasoning. The milestone underscores rapid advancements in large language models' ability to handle complex semantic and contextual puzzles.

Google’s latest AI advancement, the Gemini 3.1 Pro Preview, has shattered previous benchmarks by achieving a 98.4% accuracy score on the Extended NYT Connections puzzle test—a significant leap from the 96.3% recorded by its predecessor, Gemini 3 Pro. The result, first reported by a user on Reddit’s r/singularity forum, marks a watershed moment in AI’s capacity to understand nuanced language relationships, categorization, and contextual inference. This performance not only outpaces earlier models but also challenges the notion that human intuition remains irreplaceable in tasks requiring subtle semantic reasoning.
The NYT Connections benchmark, originally developed by The New York Times as a daily word game, has evolved into a rigorous evaluation tool for AI systems. Participants are tasked with grouping 16 words into four sets of four, based on hidden thematic links—such as types of ‘Bones,’ ‘Things that are ‘Hot,’ or ‘Words that precede ‘-ball’.’ The Extended version increases complexity by introducing overlapping categories, red herrings, and multi-layered wordplay, making it a formidable test of contextual intelligence. According to GitHub repository maintainer Lech Mazur, who curated the benchmark for AI testing, the score of 98.4% represents the highest ever recorded by any AI model, surpassing even state-of-the-art competitors from OpenAI and Anthropic.
While Google has not yet officially released detailed technical documentation on the Gemini 3.1 Pro Preview, the company’s official Gemini platform confirms ongoing enhancements to its reasoning capabilities. The platform, accessible at gemini.google.com, highlights improvements in ‘Plan, Research, and Learn’ functionalities—features that align closely with the cognitive demands of the Connections puzzle. Industry analysts suggest that the model’s success stems from a combination of refined training data, improved multi-step reasoning architecture, and enhanced attention mechanisms that allow it to better track and eliminate false associations.
This achievement comes amid growing interest in AI’s role beyond mere information retrieval. Unlike traditional language models that excel at pattern matching, Gemini 3.1 Pro demonstrates an emergent ability to simulate human-like deduction. For example, in a test case requiring the grouping of ‘Crown,’ ‘Scepter,’ ‘Throne,’ and ‘Orb’ under ‘Royal Symbols,’ the model correctly identified the category without being misled by similar terms like ‘Crown’ as in ‘Crown Jewel’ or ‘Crown Prince.’ Such precision reflects a deeper understanding of semantic hierarchies and cultural context.
While the benchmark is not a formal industry standard like GLUE or SuperGLUE, its popularity among AI researchers and enthusiasts has made it a de facto metric for evaluating reasoning fluency. The Reddit post that broke the news has sparked widespread discussion, with users noting that the model’s near-perfect score may necessitate the creation of even harder benchmarks. ‘I’ll need a new, harder version that combines multiple puzzles into one sooner than I thought,’ wrote user /u/zero0_one1, the original poster.
It is worth noting that the domain ‘astrologyanswers.com’—which appears in some search results—is unrelated to this AI development and refers only to zodiac horoscopes, a common source of confusion due to the homonym ‘Gemini.’ Google’s AI model, named after the constellation, is entirely distinct from astrological interpretations.
As AI continues to blur the lines between machine and human cognition, the Gemini 3.1 Pro Preview’s record underscores a broader trend: the evolution of AI from reactive tools to proactive reasoning agents. While challenges remain in areas like factual consistency and bias mitigation, this milestone signals that the next frontier in artificial intelligence may not be volume of data—but the quality of understanding.


