Make AI agents

production-ready through

realistic simulations

Independent evaluation and training for the AI agent ecosystem. Real-world complexity through simulation environments where agents face multi-hour tasks.

Talk to Founder

For frontier labs

Large-scale RL datasets with tuned difficulty distributions. Cheat-proof reward functions. Teach skills scarce in public data (e.g. dependency hell, distributed system debugging).

For AI app developers

Measure quality and uncover blind spots. Pick optimal models, tune prompts in a fast-changing world. Benchmark against competitors. Win deals and deliver on performance promises.

For enterprise buyers

Independent verification of what actually works. Design processes based on real capabilities, not marketing hype. ROI-driven deployment decisions. Move from FOMO to measurable P&L impact.

Latest insights

Explore our research on AI agents, benchmarking, and evaluation

Featured

We hid backdoors in binaries — Opus 4.6 found 49% of them

BinaryAudit benchmarks AI agents using Ghidra to find backdoors in compiled binaries of real open-source servers, proxies, and network infrastructure.

Piotr Grabowski & Rafał Strzaliński & Michał Kowalczyk & Piotr Migdał & Jacek Migdal 10 Feb 2026

Featured

Reverse engineering River Raid with Claude, Ghidra, and MCP

Connecting Claude to Ghidra via MCP to reverse engineer River Raid. A test of AI agents against 6502 assembly, memory mapping, and 80s game logic.

Rafal Strzalinski 23 Jan 2026

Featured

Benchmarking OpenTelemetry: Can AI trace your failed login?

A lot of vendors pitch AI SRE. We tested 14 models across 11 programming languages; even the best ones struggle with instrumenting code with the leading open-source standard, OpenTelemetry.

Przemek Delewski & Rafał Strzaliński & Piotr Migdał & Jacek Migdał 18 Jan 2026

View all blog posts

Looking for Quesma Gateway?

The Quesma database gateway IP has been acquired by Hydrolix to ensure continued support.
Read the announcement.

View on GitHub Documentation

Make AI agents

For frontier labs

For AI app developers

For enterprise buyers

Latest insights

We hid backdoors in binaries — Opus 4.6 found 49% of them

Reverse engineering River Raid with Claude, Ghidra, and MCP

Benchmarking OpenTelemetry: Can AI trace your failed login?

Stay tuned for future posts and releases!

Looking for Quesma Gateway?