OpenAI and Paradigm announce EVMbench, a benchmark that measures how well AI agents can detect, exploit, and patch high-severity smart contract vulnerabilities

Making smart contracts safer by evaluating AI agents' ability to detect, patch, and exploit vulnerabilities in blockchain environments.

OpenAI 2026-02-19

Context & Ripple Effects

EVMbench follows evidence that frontier models could collaboratively develop exploits against contracts in the SCONE-bench smart-contract evaluation. It gives OpenAI and Paradigm a common evaluation target spanning discovery, exploitation, and remediation rather than treating those capabilities separately.

The release also fits OpenAI’s earlier work on measuring AI-system monitorability, extending the emphasis on evaluation into a security domain where an agent’s harmful and defensive abilities are tightly coupled.

First-order effects

Smart-contract security teams and AI developers gain a named benchmark for testing whether agents can identify, reproduce, and fix high-severity EVM flaws.
EVMbench makes exploit capability an explicit part of security-agent assessment, alongside detection and patching, raising the bar for claims that an agent can secure contracts.

Second-order effects

Competing model providers, audit firms, and agent-tool builders can be pressed to report comparable performance on a shared task set rather than relying on isolated demonstrations.
Because the benchmark evaluates both attack and remediation workflows, users will need to distinguish a strong score from safe operational deployment; validation and access controls become part of the product question.

Third-order effects

If adopted beyond its creators, task-specific benchmarks such as EVMbench could make security-agent competition more measurable and shift differentiation toward reliable remediation, not just vulnerability finding.
The benchmark underscores a durable dual-use tension: the same agentic capabilities that improve audits can lower the effort to develop exploits, making evaluation governance as consequential as model performance.

The trend: AI security is moving from general coding evaluations toward domain-specific, end-to-end benchmarks that test agents across attack, verification, and repair workflows.

Discussion

@scaling01 @scaling01 on x
EVMbench measures the ability of agents to detect, patch, and exploit smart contract vulnerabilities. Opus 4.6 getting mogged by GPT-5.2 and GPT-5.3. Although its detection accuracy is technically higher, it's precision is much lower. (Opus is going shizo) [image]
@gdb Greg Brockman on x
measuring agentic security capabilities with smart contracts:
@0xalpo Alpin Yukseloglu on x
new collab from @paradigm and @OpenAI: evmbench is a benchmark and agent harness for exploiting smart contract bugs a few months ago, the best models found <20% of critical, fund-draining @Code4rena bugs in our benchmark. today they find > 70% [video]
@openai @openai on x
Introducing EVMbench—a new benchmark that measures how well AI agents can detect, exploit, and patch high-severity smart contract vulnerabilities. https://openai.com/...
@antoniogm Antonio García Martínez on x
Finally a non-bullshit crypto/AI crossover.
r/singularity r on reddit
OpenAI introduces EVMbench, new Benchmark to test AI Agents
r/OpenAI r on reddit
OpenAI: Introducing EVMbench, a new benchmark

Chronicles