TR

ClawBench: AI Agents Succeed in Just 33.3% of Real Tasks (2026 Study)

ClawBench, a new benchmark testing AI agents on 153 real-world online tasks across 144 live websites, reveals even the best models succeed in only 33.3% of tasks. Finance and academic tasks are easier, while travel and development tasks remain daunting.

calendar_today🇹🇷Türkçe versiyonu
ClawBench: AI Agents Succeed in Just 33.3% of Real Tasks (2026 Study)
YAPAY ZEKA SPİKERİ

ClawBench: AI Agents Succeed in Just 33.3% of Real Tasks (2026 Study)

0:000:00

summarize3-Point Summary

  • 1ClawBench, a new benchmark testing AI agents on 153 real-world online tasks across 144 live websites, reveals even the best models succeed in only 33.3% of tasks. Finance and academic tasks are easier, while travel and development tasks remain daunting.
  • 2ClawBench Reveals AI Agents Struggle with Real-World Online Tasks ClawBench, a groundbreaking evaluation framework for AI browser agents, has exposed critical limitations in today’s most advanced AI systems.
  • 3According to the research paper published on arXiv, even the top-performing model—Claude Sonnet 4.6—achieved only a 33.3% success rate across 153 real-world tasks conducted on 144 live, production websites.

psychology_altWhy It Matters

  • check_circleThis update has direct impact on the Yapay Zeka Araçları ve Ürünler topic cluster.
  • check_circleThis topic remains relevant for short-term AI monitoring.
  • check_circleEstimated reading time is 3 minutes for a quick decision-ready brief.

ClawBench Reveals AI Agents Struggle with Real-World Online Tasks

ClawBench, a groundbreaking evaluation framework for AI browser agents, has exposed critical limitations in today’s most advanced AI systems. According to the research paper published on arXiv, even the top-performing model—Claude Sonnet 4.6—achieved only a 33.3% success rate across 153 real-world tasks conducted on 144 live, production websites. Unlike synthetic benchmarks that use simulated environments, ClawBench tests agents on actual platforms like Airbnb, GitHub, and university portals, revealing how far AI still is from reliably handling everyday digital tasks.

Why 33.3% Success Rate Matters for AI Autonomy

This low success rate underscores a fundamental gap: AI agents can process instructions but fail at dynamic, real-time web interaction. Even minor changes—like a redesigned checkout button or a CAPTCHA—cause breakdowns. For AI to achieve true autonomy, it must adapt to unpredictable interfaces, not just follow static prompts.

Claude Sonnet 4.6 vs GLM-5: Performance Breakdown

Claude Sonnet 4.6 led with a 33.3% success rate, while GLM-5, a text-only model without visual input, ranked second at 24.2%. This challenges the assumption that multimodal perception is essential for web navigation. Strong reasoning and instruction-following can partially compensate for missing visual data—though neither model broke 50% in any category.

Task Difficulty by Domain: Finance vs. Travel

Task success varied dramatically by domain. Finance and academic tasks—like checking account balances or retrieving course syllabi—saw the best model achieve up to 50% success. In contrast, travel and developer tasks, such as booking non-refundable flights or configuring GitHub Actions workflows, saw success rates below 20%. These workflows demand deep domain knowledge, time-sensitive decisions, and contextual awareness—areas where AI still lags.

How ClawBench Ensures Safe, Accurate Evaluation

ClawBench captures five behavioral data streams: session replays, screenshots, HTTP traffic logs, agent reasoning traces, and browser action logs. A request interceptor blocks irreversible actions like payments or bookings, ensuring ethical testing. Human annotators validate outcomes, while an agentic evaluator provides step-by-step failure diagnostics, making it the most transparent AI benchmark to date.

Implications for AI Browser Automation

With even the best models failing nearly two-thirds of real-world tasks, ClawBench isn’t just a metric—it’s a wake-up call. The future of AI agents depends on robustness, not just speed or scale. Developers must prioritize contextual understanding, error recovery, and safety over raw performance metrics. ClawBench’s open dataset and tools (available on Hugging Face and GitHub) empower researchers to build more reliable AI browser automation systems.

The ClawBench dataset, now publicly available on Hugging Face, includes full task descriptions, annotated traces, and replay data for researchers worldwide. The team behind the benchmark, from the Natural and Artificial Intelligence Lab, has also released open-source tools via GitHub and PyPI, enabling others to reproduce results and build upon their methodology. This transparency is critical as the AI community races to develop truly autonomous agents capable of operating in uncontrolled digital environments.

AI-Powered Content
auto_awesome

AI Terms in This Article

View All

recommendRelated Articles