Compare harnesses not models: Blitzy vs GPT-5.4 on SWE-Bench Pro
An independent audit of agentic scaffolding and harnesses. We analyze how agent workflows, codebase documentation, and test verification impact performance compared to raw base models like GPT-5.4, Gemini 3.1 Pro, and Claude Code.