Poster
in
Workshop: Agentic Markets Workshop
WebCanvas: Benchmarking Web Agents in Online Environments
Yichen Pan · Dehan Kong · Sida Zhou · Cheng Cui · Yifei Leng · Bing Jiang · Hangyu Liu · Yanni Shawn · Shuyan Zhou · Sherry Tongshuang Wu · Zhengyang Wu
For web agents to be practically useful, they need to generalize to the ever changing web environment --- UI updates, page content updates, etc. Unfortunately, most traditional benchmarks only capture a static state of the web page. We introduce WebCanvas, an innovative online evaluation framework for web agents designed to address the dynamic nature of web interactions. WebCanvas contains three main components supporting realistic assessments: (1) A key-node-based evaluation metric, which stably capture critical actions or states necessary for task completions while disregarding noises caused by insignificant events or changed web-elements; (2) A benchmark dataset called Mind2Web-Live, a refined version of original Mind2Web static dataset containing 542 tasks with 2439 intermediate evaluation states; (3) Lightweight and generalizable annotation tools and testing pipelines, which allows us to maintain the high-quality, up-to-date dataset and automatically detection shifts in live action sequences. Despite the advancements, best-performing model achieves only a 23.1% task success rate, highlighting substantial room for improvement in future work.