Can AI agents detect malicious backdoors hidden in compiled binaries? We tested leading models on reverse engineering tasks using tools like Ghidra and Radare2 to see if they can identify backdoors, timebombs, and other security threats in real software.
Models ranked by their success rate in each task category. The benchmark tests three distinct capabilities: detecting malicious code in binaries, using reverse engineering tools, and avoiding false positives on clean code.
Detect backdoors in compiled binaries
Tasks cover three categories: binary analysis for backdoor detection, tooling usage for decompilation and analysis, and false positive tests—when no backdoor is present, models should correctly report none rather than produce false positives. Target software includes real-world network infrastructure in which we artificially added backdoors: web servers, DNS servers, SSH servers, proxies, and load balancers.
A detailed view of which tasks each model solved or failed. This helps identify models that handle specific security analysis patterns well, even if their overall score is lower.
Pass rate (identifying a backdoor and pointing to its location in the binary) plotted against false positive rate, how often a model incorrectly flags clean code. Models in the upper left detect more while raising fewer false alarms.
Pareto frontier
| Model | Pass Rate | False Positive Rate | |
|---|---|---|---|
| gpt-5.2 | 18% | | 0% |
| claude-opus-4.6 | 49% | | 22% |
We map total API cost against success rate for Binary Analysis tasks. The Pareto frontier (blue line) highlights the most cost-efficient models for a given performance level.
Pareto frontier
| Model | Pass Rate | Cost | |
|---|---|---|---|
| grok-4.1-fast | 7% | | $1 |
| deepseek-v3.2 | 12% | | $5 |
| gemini-3-flash-preview | 37% | | $18 |
| gemini-3-pro-preview | 44% | | $28 |
| claude-opus-4.6 | 49% | | $286 |
This chart compares accuracy against average generation time for Binary Analysis tasks, helping identify models that balance solution quality with response latency.
Pareto frontier
| Model | Pass Rate | Time | |
|---|---|---|---|
| gemini-3-pro-preview | 44% | | 5m |
| claude-opus-4.6 | 49% | | 54m |
We plot model pass rates against their release dates to track performance changes over time. This timeline shows how capability on Binary Analysis tasks compares across model generations.
Pareto frontier
| Model | Pass Rate | Released | |
|---|---|---|---|
| gemini-2.5-pro | 21% | | Mar 25 |
| claude-sonnet-4.5 | 26% | | Sept 25 |
| gemini-3-pro-preview | 44% | | Nov 25 |
| claude-opus-4.6 | 49% | | Feb 26 |
For reproducibility, we open-sourced the full benchmark at QuesmaOrg/BinaryAudit. Built on the Harbor framework, you can verify our findings, test new models and agents; see our Migrating CompileBench to Harbor: standardizing AI agent evals.
uv tool install harbor
git clone [email protected]:QuesmaOrg/BinaryAudit.git
cd binaryaudit We welcome contributions of new tasks. See the repository for details.
Get notified when we add new models or benchmark results
All product names, logos, and brands (™/®) are the property of their respective owners; they're used here solely for identification and comparison, and their use does not imply affiliation, endorsement, or sponsorship.