OpenAI and Paradigm announce EVMbench, a benchmark that measures how well AI agents can detect, exploit, and patch high-severity smart contract vulnerabilities

Making smart contracts safer by evaluating AI agents' ability to detect, patch, and exploit vulnerabilities in blockchain environments.

OpenAI 2026-02-18

Context & Ripple Effects

EVMbench follows evidence from SCONE-bench testing that produced real smart-contract exploits, which made agent capability in this domain a concrete security concern rather than a purely coding-quality question. It also extends OpenAI's work on evaluating whether AI reasoning can be monitored into a task where successful outputs can directly affect deployed code.

The benchmark focuses evaluation on the full security loop—finding flaws, demonstrating their impact, and repairing them—rather than treating vulnerability detection as a standalone capability.

First-order effects

AI-agent developers and smart-contract security teams gain a common test for comparing performance on high-severity vulnerability discovery, exploitation, and patching.
The benchmark makes trade-offs across offensive validation and defensive remediation more visible, giving evaluators a basis to test whether an agent can complete the entire workflow.

Second-order effects

Security-tool vendors and model providers face pressure to substantiate smart-contract security claims against a shared evaluation rather than isolated demonstrations.
For blockchain teams, stronger evidence of agent capability could change how they assess automated review tools alongside existing audit processes; the benchmark's results will determine how useful that comparison is.

Third-order effects

If such benchmarks are adopted, AI security competition is likely to shift from general coding ability toward measurable, domain-specific assurance across detection, validation, and remediation.
The same dual-use evaluation structure underscores a growing need to govern access and deployment of agents that can both identify and exploit vulnerabilities, not merely flag them.

The trend: AI cybersecurity is moving toward operational benchmarks that measure agents on complete, high-consequence remediation workflows rather than single-task code generation.

Chronicles

OpenAI and Paradigm announce EVMbench, a benchmark that measures how well AI agents can detect, exploit, and patch high-severity smart contract vulnerabilities

Context & Ripple Effects

First-order effects

Second-order effects

Third-order effects

Related Coverage

Discussion