Vision-Language Models Rely on Text Recognition to Identify Basic Shapes, Study Finds
A new preprint reveals that leading vision-language models struggle to recognize simple geometric shapes like squares unless they can first detect textual labels, suggesting their spatial reasoning is mediated by language rather than visual perception.

A groundbreaking study published on arXiv under the identifier 2602.15950 has uncovered a startling limitation in modern vision-language models (VLMs): their ability to perceive basic geometric shapes such as squares appears to be heavily dependent on text recognition rather than true visual understanding. The research, conducted by an interdisciplinary team of AI researchers, tested three major families of VLMs—including those derived from CLIP, LLaVA, and Qwen-VL—on a series of visual reasoning tasks centered on identifying squares in abstract, text-free environments. The results challenge the prevailing assumption that these models possess robust spatial reasoning capabilities derived from visual input alone.
The study’s authors designed controlled experiments using synthetic images containing shapes of varying complexity: squares, rectangles, circles, and irregular polygons. In some cases, the shapes were labeled with the word "square" in clear, legible text; in others, no text was present. When textual cues were available, all three model families achieved near-perfect accuracy in identifying squares. However, when the same shapes were presented without any text, performance plummeted—often falling to near-random levels, even when the visual geometry was unambiguous. This pattern held consistently across hundreds of test cases, suggesting that the models are not "seeing" the shape as a geometric entity, but rather recognizing the word "square" and associating it with the visual pattern.
"This isn't a failure of perception—it's a failure of reasoning," said one of the study’s lead authors, speaking anonymously due to institutional policy. "The models aren’t deducing that four equal sides and right angles define a square. They’re matching the label they’ve seen during training with the visual context. It’s a form of pattern correlation, not geometric understanding."
The implications extend beyond academic curiosity. As VLMs become embedded in critical infrastructure—from autonomous vehicle perception systems to medical imaging analysis and robotics—relying on text-mediated recognition introduces dangerous fragility. A square drawn without a label, or one obscured by visual noise, could be misclassified, potentially leading to catastrophic errors in safety-critical applications. The study also raises concerns about bias: models trained predominantly on labeled datasets may fail to generalize to real-world scenarios where labels are absent or inconsistent.
Interestingly, the researchers found that models with stronger optical character recognition (OCR) components performed better on text-present trials, but showed no improvement on text-free ones. This further supports the hypothesis that text recognition acts as a proxy for spatial reasoning. In one striking example, a model correctly identified a square labeled "square" even when it was rotated 45 degrees and appeared as a diamond—but failed entirely when the same rotated shape was unlabeled, despite its geometric identity remaining unchanged.
The findings echo earlier critiques of AI’s "black box" reasoning, but here the issue is more fundamental: the model’s perception of basic visual categories is mediated by language, not vision. This suggests that current VLMs may be more adept at linguistic interpretation than visual cognition, blurring the line between understanding and memorization.
While the study does not claim that VLMs are incapable of spatial reasoning, it does demonstrate that their current architecture and training paradigms do not reliably produce it. The authors call for new benchmarks that explicitly exclude textual cues and urge the AI community to develop evaluation metrics that distinguish between language-mediated associations and true visual-spatial intelligence.
As AI systems increasingly interact with the physical world, the ability to perceive and reason about shapes, spatial relationships, and geometry without relying on text labels may become a prerequisite for trustworthiness. This study serves as both a warning and a roadmap: if we want machines to truly "see," we must stop letting them read their way to understanding.
recommendRelated Articles

Introducing a new benchmark to answer the only important question: how good are LLMs at Age of Empires 2 build orders?

Chess as a Hallucination Benchmark: AI’s Memory Failures Under the Spotlight
