evm bench
Artificial Intelligence & Cybersecurity

OpenAI Launches EVMbench: A New Framework to Detect and Exploit Blockchain Vulnerabilities

OpenAI has collaborated with crypto investment firm Paradigm to release EVMbench, a new benchmark designed to evaluate how artificial intelligence agents interact with smart contract security.

As smart contracts currently secure over $100 billion in open-source crypto assets, the ability of AI to successfully read, write, and audit code is becoming a critical component of the financial infrastructure.

EVMbench Capabilities and Methodology

The EVMbench framework is built on a dataset of 120 curated high-severity vulnerabilities sourced from 40 different audits and open code competitions.

It also includes specific vulnerability scenarios derived from the security audit of the Tempo blockchain.

To ensure safety and reproducibility, the system utilizes a Rust-based harness that restricts unsafe RPC methods and runs all exploit tasks in an isolated, local Anvil environment rather than on live networks.

This allows for rigorous testing without risking actual assets or network stability.

The framework evaluates agents across three distinct capability modes designed to mimic real-world security tasks.

The Detect mode challenges agents to audit a smart contract repository and identify ground-truth vulnerabilities based on historical data.

EVMbench Capability Modes

Capability ModeDescriptionVerification Method
DetectAgents audit repositories to find known vulnerabilities.Scored on recall of ground-truth vulnerabilities and audit rewards.
PatchAgents modify contracts to remove exploits while keeping functionality.Verified through automated tests to ensure the exploit is gone and code compiles.
ExploitAgents attempt to drain funds from a deployed contract.Graded programmatically via transaction replay on a sandboxed blockchain.

The Patch mode requires agents to fix the identified issues without breaking the contract’s intended functionality or causing compilation errors.

Finally, the Exploit mode tests an agent’s ability to execute end-to-end fund-draining attacks in a sandboxed environment, providing a clear metric of offensive capability that defenders must guard against.

Model Performance and Safety Initiatives

The release of EVMbench highlights significant progress in AI model capabilities regarding cybersecurity tasks.

In the exploit mode evaluation, OpenAI’s GPT-5.3-Codex achieved a success rate of 72.2 percent. This represents a substantial increase in capability compared to the GPT-5 model, which scored 31.9 percent just over six months ago.

Despite these gains in offensive testing, the report notes that performance remains weaker in detection and patching tasks.

Agents often struggle to maintain full functionality while removing subtle bugs, indicating that human oversight remains essential in the auditing process.

Recognizing the dual-use nature of cybersecurity tools, OpenAI is emphasizing an evidence-based approach to aid defenders.

This includes expanding their security research agent, Aardvark, and committing $10 million in API credits through their Cybersecurity Grant Program.

These initiatives aim to accelerate cyber defense for open-source software and critical infrastructure.

While EVMbench has limitations such as not supporting complex timing mechanics or mainnet forks, it represents a significant step toward standardizing how AI interacts with blockchain security.

3 Comments

  • admin April 7, 2022

    WelcRimply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry’s standard dummy text ever since

    • admin April 7, 2022

      The leap into electronic typesetting, remaining essentiallyuncha opularisedthe with the release of Letrasetsheets containingthe leap remaining essentially unchanged.

  • admin April 7, 2022

    Travel orem Ipsum has been the industry’s standard dummy text ever since the 1500s, when an unknown printer took a gallery Followe yof type and scrambled it to make a type specimen book.

Leave a Reply to admin Cancel reply

Your email address will not be published. Required fields are marked *