OpenAI and Paradigm launched EVMbench, aiming to evaluate AI agents’ ability to detect, patch, and exploit vulnerabilities within Ethereum-based smart contracts that collectively secure over $100 billion in crypto assets. EVMbench is based on 120 vulnerability types found in 40 different security audits (including several from Tempo blockchain) and will include scenarios involving payment-oriented smart contract code related to expected agentic stablecoin transactions.
How does EVMbench evaluate an AI agent’s capabilities?
EVMbench measures each artificial intelligence (AI) agent’s performance in three ways:
- Detect (identify vulnerabilities in contracts through an audit process)
- Patch (modify contracts so that vulnerabilities are removed, but perfect functionality is maintained)
- Exploit (conducting a liquidity-draining attack on a smart contract in a sandboxed environment)

Each test is conducted by using an Anvil-based, Rust-language harness to provide a deterministic and reproducible means of evaluating agents within isolated environments rather than live networks.

AI’s Frontier models (the most recent ones) have performed significantly better than those measured six months ago. For example, GPT-5.3-Codex’s average performance on exploit tasks is 72.2%, while its predecessor model (GPT-5) had only 31.9% on the same exploits. Detect and patch modes remain more challenging to agents as they often stop their audit process after finding the first fault or have difficulties maintaining the perfect functionality of the contract while also removing vulnerabilities.
Significance of Crypto Security
EVMbench tackles both sides of AI use in cybersecurity: monitoring new threats while encouraging defensive applications. As part of this initiative, OpenAI recently committed $10M of Application Programming Interface (API) credits through its Cybersecurity Grant Program to help boost efforts toward creating more defensive research and expanding Aardvark’s (its open-source security research agent) footprint with a private beta test.