SanityBoard Unveils Major AI Agent Eval Update: Qwen3.5, Gemini 3.1 Pro, and New Open-Source Agents Shine
A comprehensive new evaluation platform, SanityBoard, has released 27 fresh benchmark results comparing top AI models including Qwen3.5 Plus, Gemini 3.1 Pro, and Sonnet 4.6, alongside three new open-source coding agents. The findings reveal striking performance differences tied to model architecture, infrastructure variability, and iterative behavior.

SanityBoard, an independent AI evaluation platform, has released a major update featuring 27 new benchmark results that offer unprecedented insight into the current landscape of large language models and autonomous agents. Among the most notable inclusions are Qwen3.5 Plus, GLM 5, Gemini 3.1 Pro, and Anthropic’s Sonnet 4.6, alongside three newly evaluated open-source coding agents: kilocode CLI, cline CLI, and pi*. The update, compiled over three days of continuous testing by platform creator lemon07r, highlights not only model performance but also the profound influence of infrastructure, evaluation design, and behavioral patterns on benchmark outcomes.
According to the SanityBoard report, GPT-Codex-style models—known for their iterative, trial-and-error approach—dominated coding tasks. These models repeatedly refine outputs, making them exceptionally suited to long-running, multi-step programming challenges. In contrast, Claude-family models like Sonnet 4.6, while praised for their reasoning clarity and safety, scored lower in these specific evaluations due to their reluctance to iterate. "In an interactive coding scenario, Claude models are still superior," the author notes, "but for set-and-forget automation, GPT-Codex variants are unmatched."
Google’s Gemini 3.1 Pro emerged as a standout performer, aligning with recent analyses from VentureBeat, which described it as a "Deep Think Mini" capable of adjustable reasoning on demand. The model’s ability to dynamically allocate computational resources to complex reasoning tasks appears to give it an edge in multi-phase evaluations, particularly when precision and step-by-step validation are required. According to VentureBeat’s February 2026 assessment, Gemini 3.1 Pro’s efficiency in balancing speed and depth makes it a compelling choice for enterprise workflows—findings corroborated by SanityBoard’s results.
Meanwhile, Alibaba’s Qwen3.5 Plus, recently highlighted by Latent.Space as the "smallest Open-Opus class" model, demonstrated remarkable efficiency without sacrificing performance. The 397B-A17B variant, though not directly tested here, suggests a trend toward highly optimized, compact architectures that rival larger models. SanityBoard’s results indicate Qwen3.5 Plus excels in both code generation and contextual retention, positioning it as a strong contender in both open-source and commercial ecosystems.
Perhaps the most revealing aspect of the update is the emphasis on infrastructure variability. The author notes that even minor differences in API latency, server load, or token throttling significantly impacted scores. For example, evaluations using the now-defunct z.ai infrastructure yielded inconsistent results, with some models appearing underperforming due to backend instability rather than intrinsic limitations. "The same model, run on different providers or even at different times, can score 15-20% differently," the report states. To mitigate this, SanityBoard implemented generous retry limits and manual vetting for every run, though the author admits human oversight remains imperfect.
The inclusion of three new open-source agents—kilocode, cline, and pi*—also provides critical context. While kilocode and cline demonstrated iterative capabilities similar to commercial models, pi* stood out for its single-shot, non-iterative behavior. This design choice, while unique, resulted in consistently poor scores, forcing the SanityBoard team to overhaul their retry and output-capture logic. "No other agent buffers all output until completion," the report explains. "This required custom handling and delayed the entire release by over 48 hours."
On the technical front, SanityBoard introduced a date-slider filter and expanded UI options to help users account for model version drift and infrastructure changes over time. These features are crucial as AI models evolve rapidly—Gemini 3.1 Pro, for instance, was only released three weeks before these evaluations. The platform now serves as a living archive, capturing the transient nature of AI performance.
As open-source agents grow in sophistication and commercial models push toward multimodal autonomy—Qwen3.5’s recent blog touting "native multimodal agents" underscores this trajectory—the need for transparent, reproducible benchmarking becomes paramount. SanityBoard’s latest update doesn’t just rank models; it exposes the hidden variables behind the numbers, offering researchers, developers, and enterprises a more nuanced view of what truly drives AI performance in real-world scenarios.


