[r/ML] ClawBench: Can AI Agents Complete Everyday Online Tasks? 153 tasks, 144 live websites, best model at 33.3% [R]

Summary

ClawBench is a new benchmark evaluating AI browser agents on 153 real-world tasks across 144 live websites, moving beyond synthetic tests. Key findings reveal that even the best model, Claude Sonnet 4.6, achieved only a 33.3% success rate, with GLM-5 (a text-only model) surprisingly strong at 24.2%. The results highlight significant limitations in current AI agents for everyday online tasks, with task difficulty varying greatly across categories.

Editorial note

AI Dose summarizes public reporting and links to original sources when they are available. Review the Editorial Policy, Disclaimer, or Contact page if you need to flag a correction or understand how this site handles sources.

Continue Reading

[r/ML] [D] MYTHOS-INVERSION STRUCTURAL AUDIT
March 29, 2026
[r/LocalLLaMA] karpathy / autoresearch
March 10, 2026
[r/ML] You can decompose models into a graph database [N]
April 15, 2026
[r/ML] KIV: 1M token context window on a RTX 4070 (12GB VRAM), no retraining, drop-in HuggingFace cache replacement - Works with any model that uses DynamicCache [P]
April 13, 2026

[r/ML] ClawBench: Can AI Agents Complete Everyday Online Tasks? 153 tasks, 144 live websites, best model at 33.3% [R]

Summary

Editorial note

Continue Reading

Related Articles

[r/ML] [D] MYTHOS-INVERSION STRUCTURAL AUDIT

Comments