Can AI detect software backdoors?

Can AI agents detect malicious backdoors hidden in compiled binaries? We tested leading models on reverse engineering tasks using tools like Ghidra and Radare2 to see if they can identify backdoors, timebombs, and other security threats in real software.

Read our blog post introducing BinaryAudit: We hid backdoors in binaries — Opus 4.6 found 49% of them

Model ranking #

Models ranked by their success rate in each task category. The benchmark tests three distinct capabilities: detecting malicious code in binaries, using reverse engineering tools, and avoiding false positives on clean code.

Detect backdoors in compiled binaries

claude-opus-4.6

49%

gemini-3-pro-preview

44%

gemini-3-flash-preview

37%

claude-opus-4.5

37%

claude-sonnet-4.5

26%

kimi-k2.5

25%

glm-4.7

25%

gemini-2.5-pro

21%

gpt-5.2

18%

claude-sonnet-4

14%

deepseek-v3.2

12%

claude-haiku-4.5

11%

gpt-5.2-codex

grok-4

grok-4.1-fast

gpt-5

Security analysis tasks #

Tasks cover three categories: binary analysis for backdoor detection, tooling usage for decompilation and analysis, and false positive tests—when no backdoor is present, models should correctly report none rather than produce false positives. Target software includes real-world network infrastructure in which we artificially added backdoors: web servers, DNS servers, SSH servers, proxies, and load balancers.

Binary Analysis:

Tooling:

Verification:

dnsmasq-backdoor-detect

58% pass rate

Cheapest	deepseek-v3.2	$0.02
Fastest	gemini-3-flash-preview	1m

sozu-backdoor-multiple-arch-binaries-detect

25% pass rate

Cheapest	grok-4.1-fast	$0.02
Fastest	claude-sonnet-4	4m

lighttpd-backdoor-detect-open

4% pass rate

Cheapest	claude-sonnet-4.5	$0.49
Fastest	claude-sonnet-4.5	10m

View all tasks →

Model-task matrix #

100% unsolved

A detailed view of which tasks each model solved or failed. This helps identify models that handle specific security analysis patterns well, even if their overall score is lower.

Detection vs false alarms #

Pass rate (identifying a backdoor and pointing to its location in the binary) plotted against false positive rate, how often a model incorrectly flags clean code. Models in the upper left detect more while raising fewer false alarms.

Pareto frontier

Model	Pass Rate		False Positive Rate
gpt-5.2	18%		0%
claude-opus-4.6	49%		22%

Cost efficiency #

We map total API cost against success rate for Binary Analysis tasks. The Pareto frontier (blue line) highlights the most cost-efficient models for a given performance level.

Pareto frontier

Model	Pass Rate	Cost
grok-4.1-fast	7%	$1
deepseek-v3.2	12%	$5
gemini-3-flash-preview	37%	$18
gemini-3-pro-preview	44%	$28
claude-opus-4.6	49%	$286

Speed vs quality #

This chart compares accuracy against average generation time for Binary Analysis tasks, helping identify models that balance solution quality with response latency.

Pareto frontier

Model	Pass Rate		Time
gemini-3-pro-preview	44%		5m
claude-opus-4.6	49%		54m

Performance over time #

We plot model pass rates against their release dates to track performance changes over time. This timeline shows how capability on Binary Analysis tasks compares across model generations.

Pareto frontier

Model	Pass Rate	Released
gemini-2.5-pro	21%	Mar 25
claude-sonnet-4.5	26%	Sept 25
gemini-3-pro-preview	44%	Nov 25
claude-opus-4.6	49%	Feb 26

Run it yourself #

For reproducibility, we open-sourced the full benchmark at QuesmaOrg/BinaryAudit. Built on the Harbor framework, you can verify our findings, test new models and agents; see our Migrating CompileBench to Harbor: standardizing AI agent evals.

uv tool install harbor
git clone [email protected]:QuesmaOrg/BinaryAudit.git
cd binaryaudit

Group Task Agent API Model

We welcome contributions of new tasks. See the repository for details.

Get notified when we add new models or benchmark results

All product names, logos, and brands (™/®) are the property of their respective owners; they're used here solely for identification and comparison, and their use does not imply affiliation, endorsement, or sponsorship.

Cheapest	gemini-3-pro-preview	$0.26
Fastest	gemini-3-pro-preview	5m

Cheapest	gemini-3-flash-preview	$0.14
Fastest	gemini-3-flash-preview	2m

Cheapest	gemini-3-pro-preview	$0.46
Fastest	gemini-3-pro-preview	3m

Cheapest	gemini-3-pro-preview	$0.35
Fastest	gemini-3-pro-preview	3m

Can AI detect software backdoors?

dnsmasq-backdoor-detect

dnsmasq-backdoor-detect-syscall

dnsmasq-backdoor-detect-obfuscated

dnsmasq-backdoor-detect-posix-spawn

dnsmasq-backdoor-detect-printf

lighttpd-timebomb-multiple-binaries-detect

sozu-backdoor-multiple-arch-binaries-detect

dnsmasq-backdoor-detect-execvp-obfuscated

lighttpd-backdoor-multiple-binaries-detect

dropbear-brokenauth-detect-nologline

sozu-backdoor-multiple-binaries-detect

dropbear-brokenauth-detect

dnsmasq-backdoor-detect-posix-spawn-obfuscated

dnsmasq-backdoor-detect-syscall-obfuscated

lighttpd-backdoor-detect-proc-obfuscated

lighttpd-backdoor-detect-open

dropbear-brokenauth2-detect

lighttpd-backdoor-multiple-arch-binaries-detect

sozu-timebomb-multiple-binaries-detect

radare2-decompile

ghidra-decompile-vanilla

radare2-decompile-jq

ghidra-decompile-pyghidra

ghidra-decompile-pyghidra-jq

ghidra-decompile-vanilla-jq

sozu-backdoor-detect-negative

sozu-backdoor-detect-negative2

lighttpd-backdoor-detect-negative2

lighttpd-backdoor-detect-negative

dnsmasq-backdoor-detect-negative2

dnsmasq-backdoor-detect-negative

dropbear-brokenauth-detect-negative2

dropbear-brokenauth-detect-negative