OpenAI and Paradigm announce EVMbench, a benchmark that measures how well AI agents can detect, exploit, and patch high-severity smart contract vulnerabilities
Making smart contracts safer by evaluating AI agents' ability to detect, patch, and exploit vulnerabilities in blockchain environments.
EVMbench measures the ability of agents to detect, patch, and exploit smart contract vulnerabilities. Opus 4.6 getting mogged by GPT-5.2 and GPT-5.3. Although its detection accuracy is technically higher, it's precision is much lower. (Opus is going shizo) [image]
new collab from @paradigm and @OpenAI: evmbench is a benchmark and agent harness for exploiting smart contract bugs a few months ago, the best models found <20% of critical, fund-draining @Code4rena bugs in our benchmark. today they find > 70% [video]
Introducing EVMbench—a new benchmark that measures how well AI agents can detect, exploit, and patch high-severity smart contract vulnerabilities. https://openai.com/...