0
Likes
0
Saves
Back to updates

[r/ML] ClawBench: Can AI Agents Complete Everyday Online Tasks? 153 tasks, 144 live websites, best model at 33.3% [R]

Impact: 7/10
Swipe left/right

Summary

ClawBench is a new benchmark evaluating AI browser agents on 153 real-world tasks across 144 live websites, moving beyond synthetic tests. Key findings reveal that even the best model, Claude Sonnet 4.6, achieved only a 33.3% success rate, with GLM-5 (a text-only model) surprisingly strong at 24.2%. The results highlight significant limitations in current AI agents for everyday online tasks, with task difficulty varying greatly across categories.

Editorial note

AI Dose summarizes public reporting and links to original sources when they are available. Review the Editorial Policy, Disclaimer, or Contact page if you need to flag a correction or understand how this site handles sources.

Continue Reading

Explore related coverage about community news and adjacent AI developments: [r/ML] [D] MYTHOS-INVERSION STRUCTURAL AUDIT, [r/LocalLLaMA] karpathy / autoresearch, [r/ML] You can decompose models into a graph database [N], [r/ML] KIV: 1M token context window on a RTX 4070 (12GB VRAM), no retraining, drop-in HuggingFace cache replacement - Works with any model that uses DynamicCache [P].

Related Articles

Next read

[r/ML] [D] MYTHOS-INVERSION STRUCTURAL AUDIT

Stay with the thread by reading one adjacent story before leaving this update.

Comments

Sign in to leave a comment.

Loading comments...