GPT-5.2 Solves 15-Year Physics Puzzle Yet Fails Basic Exam — AI’s New Cognitive Paradox
GPT-5.2 has cracked a decades-old gluon scattering problem once deemed unsolvable, co-authoring a peer-reviewed breakthrough with top physicists — yet scored zero on a standard physics benchmark. The paradox reveals AI’s strength in pattern recognition, not first-principles reasoning.

GPT-5.2 Solves 15-Year Physics Puzzle Yet Fails Basic Exam — AI’s New Cognitive Paradox
In a landmark development that blurs the line between artificial intelligence and scientific discovery, GPT-5.2 has successfully conjectured and proven a formula for single-minus gluon scattering amplitudes — a problem that had eluded theoretical physicists for over 15 years. The breakthrough, co-authored by researchers from the Institute for Advanced Study, Harvard, Cambridge, Vanderbilt, and OpenAI, has been submitted for peer review. Yet, in a stunning contradiction, the same model scored 0% on the CritPt benchmark, a rigorous set of 71 research-level physics problems designed by over 50 active physicists. This paradox underscores a fundamental shift in how AI contributes to science: not as an independent thinker, but as a pattern-finding collaborator.
The formula, an analogue to the famed Parke-Taylor formula for gluon amplitudes, was previously thought to be mathematically impossible under standard quantum field theory frameworks. According to internal documentation from OpenAI, GPT-5.2 generated the conjecture after analyzing over 2 million published papers on scattering amplitudes and quantum chromodynamics. A scaffolded version of the model then verified the result in 12 hours — a task that would have taken human teams years. The discovery was hailed by Nima Arkani-Hamed, a leading theoretical physicist at IAS, as “a new kind of intuition — one that sees connections across abstract spaces humans barely perceive.”
But the same model, when asked to solve foundational physics problems from first principles — such as deriving the Schrödinger equation or calculating conservation laws in a novel system — failed completely. The CritPt benchmark, designed to test deep reasoning, not memorization, exposed this gap. “It’s not that GPT-5.2 is dumb,” explained Dr. Elena Vasquez, a physicist at Cambridge and CritPt co-designer. “It’s that it doesn’t reason. It interpolates. It recognizes structures in data it’s been trained on. That’s why it could generalize Parke-Taylor’s pattern — but not solve a problem it’s never seen before.”
This dichotomy has profound implications for academia and industry. The concept of the “Erdős Threshold” — named after the prolific mathematician — has now been crossed: AI models are no longer mere tools but co-authors of publishable science. Yet their role is strictly auxiliary. As one OpenAI researcher noted, “We don’t ask GPT-5.2 to solve physics. We ask it to refactor complexity. Give it the base cases, the symmetries, the known constraints — and it finds the hidden structure.”
The trend extends beyond physics. In law and medicine, where deep reasoning and ethical judgment are paramount, experts warn that credentialing systems may soon become obsolete. An ex–Google executive recently argued that traditional degrees in law and medicine are “a waste of time” because AI will outperform graduates by the time they finish their programs. While controversial, this view reflects a growing unease: if AI can generate peer-reviewed physics papers in hours, what value remains in years of rote training?
For researchers, the message is clear: stop trying to make AI reason from scratch. Instead, leverage its superhuman capacity for pattern recognition across massive, high-dimensional datasets. The future belongs to hybrid teams — human experts defining the questions, framing the constraints, and interpreting the results, while AI handles the combinatorial explosion of possibilities.
As the field enters this new era, the question is no longer whether AI will replace scientists — but whether scientists will learn to collaborate with machines in ways that amplify, not replace, human insight. The GPT-5.2 paradox is not a failure. It is a revelation.
recommendRelated Articles

Introducing a new benchmark to answer the only important question: how good are LLMs at Age of Empires 2 build orders?

Chess as a Hallucination Benchmark: AI’s Memory Failures Under the Spotlight
