Can AI detect software backdoors?

33 tasks | 3 categories | 16 models | 42% pass rate | Updated 9 Feb 2026 | QuesmaOrg/BinaryAudit

Can AI agents detect malicious backdoors hidden in compiled binaries? We tested leading models on reverse engineering tasks using tools like Ghidra and Radare2 to see if they can identify backdoors, timebombs, and other security threats in real software.

Read our blog post introducing BinaryAudit: We hid backdoors in binaries — Opus 4.6 found 49% of them

Model ranking #

Models ranked by their success rate in each task category. The benchmark tests three distinct capabilities: detecting malicious code in binaries, using reverse engineering tools, and avoiding false positives on clean code.

Detect backdoors in compiled binaries

1
Anthropic claude-opus-4.6
49%
4
Anthropic claude-opus-4.5
37%
5
Anthropic claude-sonnet-4.5
26%
6
Kimi kimi-k2.5
25%
7
Z.ai glm-4.7
25%
8
Google gemini-2.5-pro
21%
9
OpenAI gpt-5.2
18%
10
Anthropic claude-sonnet-4
14%
11
DeepSeek deepseek-v3.2
12%
12
Anthropic claude-haiku-4.5
11%
13
OpenAI gpt-5.2-codex
9%
14
Grok grok-4
9%
15
Grok grok-4.1-fast
7%
16
OpenAI gpt-5
2%

Security analysis tasks #

Tasks cover three categories: binary analysis for backdoor detection, tooling usage for decompilation and analysis, and false positive tests—when no backdoor is present, models should correctly report none rather than produce false positives. Target software includes real-world network infrastructure in which we artificially added backdoors: web servers, DNS servers, SSH servers, proxies, and load balancers.

Binary Analysis:
Tooling:
Verification:

View all tasks →

Model-task matrix #

0%
100% unsolved

A detailed view of which tasks each model solved or failed. This helps identify models that handle specific security analysis patterns well, even if their overall score is lower.

Detection vs false alarms #

Pass rate (identifying a backdoor and pointing to its location in the binary) plotted against false positive rate, how often a model incorrectly flags clean code. Models in the upper left detect more while raising fewer false alarms.

Pareto frontier

Model Pass Rate False Positive Rate
OpenAI gpt-5.2 18%
0%
Anthropic claude-opus-4.6 49%
22%

Cost efficiency #

We map total API cost against success rate for Binary Analysis tasks. The Pareto frontier (blue line) highlights the most cost-efficient models for a given performance level.

Speed vs quality #

This chart compares accuracy against average generation time for Binary Analysis tasks, helping identify models that balance solution quality with response latency.

Pareto frontier

Model Pass Rate Time
Google gemini-3-pro-preview 44%
5m
Anthropic claude-opus-4.6 49%
54m

Performance over time #

We plot model pass rates against their release dates to track performance changes over time. This timeline shows how capability on Binary Analysis tasks compares across model generations.

Pareto frontier

Model Pass Rate Released
Google gemini-2.5-pro 21%
Mar 25
Anthropic claude-sonnet-4.5 26%
Sept 25
Google gemini-3-pro-preview 44%
Nov 25
Anthropic claude-opus-4.6 49%
Feb 26

Run it yourself #

For reproducibility, we open-sourced the full benchmark at QuesmaOrg/BinaryAudit. Built on the Harbor framework, you can verify our findings, test new models and agents; see our Migrating CompileBench to Harbor: standardizing AI agent evals.

uv tool install harbor
git clone [email protected]:QuesmaOrg/BinaryAudit.git
cd binaryaudit

We welcome contributions of new tasks. See the repository for details.

Get notified when we add new models or benchmark results

All product names, logos, and brands (™/®) are the property of their respective owners; they're used here solely for identification and comparison, and their use does not imply affiliation, endorsement, or sponsorship.