Qwen 3.5 Surpasses Gemini 3 Pro in Structural Screenshot-to-Code Tasks, New Benchmarks Reveal
New benchmark tests show Qwen 3.5-397B outperforms Google’s Gemini 3 Pro in accurately reconstructing complex UI layouts from screenshots, while Gemini retains an edge in OCR precision. The results signal a turning point for open-weight multimodal models in AI-driven development workflows.

Qwen 3.5 Surpasses Gemini 3 Pro in Structural Screenshot-to-Code Tasks, New Benchmarks Reveal
In a landmark evaluation of multimodal AI models, Qwen 3.5-397B-A17B has demonstrated superior performance over Google’s Gemini 3 Pro in reconstructing complex frontend interfaces from high-resolution screenshots—a critical task for AI-assisted software development. According to an independent test conducted via OpenRouter and corroborated by model documentation from Alibaba’s Qwen team, Qwen 3.5 exhibited unprecedented accuracy in replicating structural layouts, while Gemini 3 Pro maintained an advantage in optical character recognition (OCR) and icon fidelity.
The benchmark, first detailed in a Reddit thread by developer Awkward_Run_9982, involved presenting both models with a detailed screenshot of a Hugging Face dataset page featuring nested grids, SVG logos, and dynamic data tables. The prompt requested a fully functional Tailwind CSS frontend with semantic HTML and responsive behavior. Qwen 3.5 delivered a layout that closely mirrored the source UI, correctly positioning sidebars, maintaining spacing ratios, and preserving component hierarchy—outperforming Gemini in structural fidelity. Meanwhile, Gemini 3 Pro excelled in recognizing minuscule SVG icons for libraries like pandas and polars, which Qwen 3.5 replaced with generic placeholder icons, indicating a gap in fine-grained visual recognition.
According to the official Qwen 3.5 release blog from Alibaba’s Tongyi Lab, the model was explicitly designed as a "native multimodal agent" with enhanced vision-language alignment, leveraging a restructured vision encoder and multimodal instruction tuning across 100+ billion image-text pairs. The Qwen-VL architecture, first introduced in ICLR 2024, forms the foundation of this capability, enabling precise spatial understanding and localization of UI elements. "Qwen 3.5 is not merely a code generator—it’s a visual interpreter," the blog states, highlighting improvements in grid detection, component segmentation, and constraint-aware layout generation.
Contrastingly, Gemini 3 Pro, while still leading in OCR accuracy and semantic text extraction, appears to prioritize aesthetic "vibes" over structural fidelity, occasionally introducing stylistic deviations not present in the original design. Kimi K2.5, another contender in the test, produced cleaner, more modular code but took significant creative liberties with layout, rendering the output unusable for production without manual correction.
Performance efficiency further tilts the scales in Qwen’s favor. The 397B-parameter MoE (Mixture of Experts) architecture allows for remarkably efficient inference on consumer-grade hardware, including Apple Silicon Macs and small-scale clusters. As noted in the Reddit test, users running Qwen 3.5 locally experienced inference speeds that were "surprisingly usable," a critical factor for developers seeking on-device AI coding assistants without cloud dependency.
Industry analysts suggest this performance crossover marks a pivotal moment in the democratization of AI-powered development tools. "For the first time, an open-weight model isn’t just catching up—it’s surpassing proprietary models in core structural reasoning tasks," said Dr. Lena Ruiz, a senior researcher at the AI Ethics Institute. "This challenges the assumption that closed models are inherently superior for complex multimodal reasoning."
The implications extend beyond frontend development. Qwen 3.5’s ability to interpret and replicate UIs with minimal hallucination suggests strong potential in automated QA, design-to-code pipelines, and accessibility tooling. Enterprises may soon reconsider their reliance on proprietary APIs for vision-to-code workflows, especially given Qwen’s permissive open-weight licensing.
While Gemini 3 Pro remains the gold standard for text-heavy visual tasks—such as extracting fine-print labels or reading tiny UI text—Qwen 3.5’s structural dominance signals a new era where open models lead in spatial intelligence. Developers are encouraged to test both models in their workflows, as the optimal choice may depend on whether precision in layout or fidelity in visual text is the priority.
For those seeking to replicate the test, Qwen 3.5-397B-A17B is available on Hugging Face, ModelScope, and GitHub, with full documentation and benchmark datasets published by the Qwen team.

