Terminal-Bench 2.0: same model, 24-point gap

Terminal-Bench 2.0 is a benchmark for terminal-based coding agents — real tasks in real terminals, not just patch generation. What makes it fascinating is filtering the leaderboard by a single model.

When you look at agents all running Claude Opus 4.6, the spread is enormous:

Agent	Accuracy
ForgeCode	81.8%
Capy	75.3%
Terminus-KIRA	74.7%
Junie CLI (JetBrains)	71.0%
Droid (Factory)	69.9%
Claude Code (Anthropic)	58.0%

That’s a ~24 point gap between the best third-party harness and Anthropic’s own Claude Code — using the exact same model. The irony: Anthropic builds the best model but ships one of the weakest harnesses for it.

This is the strongest quantitative evidence yet for the “harness > model” thesis. The model is a commodity; the scaffolding around it — tool design, context management, edit formats, retry logic, task decomposition — determines whether it performs at 58% or 82%.

Three independent data points now converge on this:

The Hashline edit format showed that changing how the model references code lines dramatically reduces edit failures
OpenAI’s Harness Engineering post showed that agent environment design matters more than model capability
Terminal-Bench 2.0 puts hard numbers on the gap: same model, vastly different results

If you’re building with AI agents, stop chasing the next model release. Invest in your harness.

Terminal-Bench 2.0: same model, 24-point gap — the harness is everything