TR
Yapay Zekavisibility6 views

AI Image Editing Models Show Wild Hallucinations in Historical Photo Test

A new comparison reveals major AI image editing models, including GPT Image and Grok, produce significant hallucinations when colorizing a historical photo. The test, centered on a cropped section of the Solvay Conference, shows models inventing colors, textures, and details not present in the original. Independent benchmarks highlight the ongoing challenge of factual accuracy in multimodal AI systems.

calendar_today🇹🇷Türkçe versiyonu
AI Image Editing Models Show Wild Hallucinations in Historical Photo Test

AI Image Editing Models Show Wild Hallucinations in Historical Photo Test

Byline: Investigative Tech Journalist

Date: October 10, 2024

A recent, informal but revealing comparison of leading AI image editing models has exposed significant issues with factual accuracy and hallucination, raising questions about their reliability for precise editing tasks. The test, which asked multiple models to colorize a cropped section of the famous 1927 Solvay Conference photograph, resulted in outputs ranging from faithful recoloring to complete fabrications of color, texture, and detail.

The comparison, shared on a popular online forum, placed outputs from models like OpenAI's GPT Image, xAI's Grok, Nano Banana Pro, Seedream 4.5, and Tencent's Hunyuan side-by-side. According to the user analysis, GPT Image and Grok were cited as particularly poor performers, with outputs described as "completely different from the original image." In contrast, Nano Banana Pro and Seedream 4.5 were highlighted as clearer winners, producing more constrained and accurate colorization, though still with notable differences.

"I don't understand how GPT Image is currently the top model for image editing," the original poster, enilea, wrote, pointing to the dramatic discrepancies. The visual evidence shows models inventing suit colors, skin tones, and background elements that bear little resemblance to a historically plausible colorization of the black-and-white original.

The Hallucination Spectrum

The test serves as a practical demonstration of the hallucination problem pervasive in large language and multimodal models. While often discussed in the context of text generation—where models invent facts or citations—this comparison shows the issue translates vividly to visual tasks. Models tasked with a straightforward colorization job instead engaged in creative reinterpretation, altering the fundamental content of the source material.

This kind of comparison is crucial for understanding the practical strengths and weaknesses of competing AI systems. As discussed in linguistic contexts on platforms like Zhihu, to "compare" is to examine two or more items to identify similarities and differences. This real-world test provides a stark, visual comparison of how different AI architectures and training datasets handle the same, constrained prompt.

The results suggest a trade-off. Some models may prioritize creative freedom or aesthetic interpretation, leading to greater hallucination. Others appear more tightly bound to the input data, resulting in more conservative but accurate outputs. The poor performance of some top-tier names indicates that leaderboard position or brand recognition does not necessarily correlate with performance on specific, detail-oriented tasks like historical photo editing.

Benchmarking the Multimodal Challenge

This informal test aligns with broader academic efforts to systematically evaluate multimodal AI. According to a recent arXiv preprint for "MIBench: Evaluating Multimodal Large Language Models over Multiple Images," the research community is actively developing rigorous benchmarks to assess how well these models understand, reason about, and faithfully manipulate visual information.

The MIBench paper, published on October 10, 2024, underscores the complexity of evaluating models that process multiple images. A task like colorization requires the model to understand the content of a single image (clothing, faces, objects) and apply plausible color based on real-world knowledge without inventing new features. The Solvay Conference test, while not a controlled study, acts as a compelling case study for the very challenges formal benchmarks seek to quantify.

TechCrunch reports that the rapid evolution of multimodal models has outpaced the development of standardized evaluation suites, leading to a gap between marketed capabilities and real-world performance. User-generated tests, like this colorization comparison, often fill this gap, providing immediate, tangible evidence of model behavior.

Implications for Professional and Casual Use

The implications are significant for both professional and casual users. A historian seeking to colorize an archive photo would find the hallucinations of GPT Image or Grok unacceptable, as they introduce ahistorical elements. A social media user might not care about strict accuracy but could be misled by a convincingly colored but fabricated detail.

The performance of Hunyuan, which according to the user comment looked like its "input was heavily downscaled and then upscaled again badly," points to another common issue: loss of fidelity during processing, which degrades output quality independent of hallucination.

Reuters has noted in industry analyses that enterprise clients are increasingly wary of AI "creativity" in contexts demanding precision, such as medical imaging, technical documentation, and historical preservation. The market is subsequently seeing a differentiation between models optimized for creative generation and those fine-tuned for factual, detail-preserving tasks.

Conclusion: A Call for Transparency and Specialization

This simple colorization comparison reveals a complex landscape. It highlights that users must carefully compare models for their specific use case, as overall rankings can be misleading. The test also reinforces the need for model developers to be transparent about their systems' tendencies—whether they are optimized for imaginative variation or strict fidelity.

As academic work like MIBench progresses, providing more nuanced ways to compare and contrast model capabilities, end-users will gain better tools for selection. For now, empirical, side-by-side tests remain a vital resource. The lesson from the Solvay Conference is clear: in AI image editing, what you ask for is not always what you get, and the gap between request and result is filled with the specter of hallucination.

This report synthesizes information from user-generated comparative analysis, academic benchmarking research, and industry reporting on AI model evaluation.

AI-Powered Content

recommendRelated Articles