Poster

CyberCycle: Scalable Real-World Benchmark for AI Agents' End-to-End Cybersecurity Capabilities

Tianneng Shi ⋅ Robin Rheem ⋅ Dongwei Jiang ⋅ Francisco De La Riega ⋅ Mona Wang ⋅ Zhun Wang ⋅ Jingzhi Jiang ⋅ Alexander Cheung ⋅ Sean Tai ⋅ Jonah Cha ⋅ Jianhong Tu ⋅ Gabriel Han ⋅ Chenguang Wang ⋅ Wenbo Guo ⋅ Jingxuan He ⋅ Dawn Song

Abstract

AI has the potential to transform cybersecurity by enabling systems that can autonomously detect, analyze, and remediate software vulnerabilities. However, existing cybersecurity evaluations of AI systems are limited in scale or scope, and fail to capture the end-to-end lifecycle of real-world software vulnerability discovery and remediation. To address this gap, we propose CyberCycle, a large-scale and realistic end-to-end cybersecurity benchmark that comprehensively evaluates AI agents' abilities across the full lifecycle of vulnerability discovery, PoC generation, and patch generation. CyberCycle is comprehensive and scalable, as we build an automated, agent-enhanced pipeline for transforming open-source vulnerability data into realistic evaluation environments. Currently, the benchmark consists of 615 real-world vulnerabilities across 120 different open-source projects.