Claude Opus 4.8: Anthropic Bets on Honesty and Subagent Orchestration

Anthropic released Claude Opus 4.8 on May 28 as its most capable generally available model. Like Opus 4.7 six weeks ago, it is a point upgrade rather than a new generation, and the headline benchmark gains are modest. The more telling change is what Anthropic chose to lead with: not a bigger number, but a model that is measurably more honest about its own work.

Base pricing did not move — $5 per million input tokens,$ 25 per million output — and the model is live on day one across Claude’s Pro, Max, Team, and Enterprise tiers, the Anthropic API (claude-opus-4-8), AWS, Google Cloud Vertex AI, and Microsoft Foundry.

The honesty pitch

The number Anthropic put at the front of the release: Opus 4.8 is roughly four times less likely than Opus 4.7 to let flaws in code it wrote pass unremarked. Early testers also report it is more likely to flag uncertainty about its own progress and less likely to make claims it can’t support.

This is a deliberate repositioning. For two years the frontier race has been scored on capability — can the model solve the harder problem — and Anthropic is now arguing that for agentic work, calibration matters as much as capability. A model that runs unsupervised for an hour and quietly ships a subtle bug is worse than a slower model that stops and says it isn’t sure. The failure mode of long-horizon agents isn’t that they can’t do the work; it’s that they do it wrong and report success. Opus 4.8 is tuned against exactly that.

Whether the four-times figure holds up outside Anthropic’s evals is the open question, but the direction is right. It is also the kind of improvement that doesn’t show up on a leaderboard, which may be why Anthropic led with it rather than burying it.

The benchmark picture

The capability gains are real but incremental. Anthropic’s own reported scores, Opus 4.7 to 4.8:

Capability (benchmark)	Opus 4.7	Opus 4.8
Agentic coding (SWE-bench Pro)	64.3%	69.2%
Agentic computer use (OSWorld-Verified)	82.8%	83.4%
Multidisciplinary reasoning w/ tools (HLE)	54.7%	57.9%
Agentic financial analysis (Finance Agent v2)	51.5%	53.9%
Knowledge work (GDPval-AA, Elo)	1753	1890

The strongest result is SWE-bench Pro, where 69.2% leads both GPT-5.5 and Gemini 3.1 Pro. The caveat, which Anthropic states plainly, is that GPT-5.5 still leads on raw terminal coding. The pattern from the 4.7 release holds: Opus is the strongest general-purpose engineering model, GPT is narrowly ahead on the terminal, and the gap depends on the workload. The bigger jump is the GDPval-AA knowledge-work score, up 137 Elo, which tracks the kind of professional-grade reasoning that enterprise buyers actually evaluate against.

Dynamic workflows: orchestration moves into the model

The feature with the longest reach is dynamic workflows, in research preview inside Claude Code. Instead of one agent grinding through a task linearly, Claude writes a JavaScript orchestration script that fans work out to parallel subagents — up to 16 running concurrently and 1,000 total per run.

The design detail that matters is where the intermediate results live. They stay in the script’s variables, not in Claude’s context window; only the final answer comes back to your session. That is the constraint that has capped multi-agent systems until now — every subagent’s output eating the orchestrator’s context until it runs out of room. Offloading that state to a script is what makes a 1,000-subagent run tractable rather than a context-window bonfire. The use case Anthropic points to is codebase-scale work: migrations and refactors spanning hundreds of thousands of lines, the kind of job that was previously too big to hand to a single agent. OpenAI is chasing the same long-horizon goal from a different angle — its Codex update the same week pushed multi-day Goal Mode out of beta — where Anthropic bets on parallelism inside one run, OpenAI bets on persistence across many.

Paired with this is finer control over compute. An effort control lets you dial how hard the model works a given response, trading more tokens for higher quality, and a new ultracode setting combines the top reasoning tier with automatic workflow orchestration for the hardest coding problems. This is the natural extension of the xhigh tier and task budgets that shipped with Opus 4.7: Anthropic is steadily giving developers knobs to trade latency and cost against quality instead of guessing.

Fast mode got a lot cheaper

Quietly, the economics of fast mode changed. Opus 4.8 fast mode runs at $10 per million input tokens and$ 50 per million output — roughly 2.5x the output speed of standard mode at, Anthropic says, identical quality. The number to anchor on is the comparison to the prior generation, where fast mode ran $30/$ 150. Fast mode used to cost six times the base rate; now it’s two. For anyone running latency-sensitive agentic loops in production, that repricing is more consequential than a couple of benchmark points.

Mythos is weeks away

The release closes a loop we flagged back in April. Claude Mythos — the cybersecurity-specialized model Anthropic has said outperforms its generally available line — is currently in preview with a small number of organizations, and Anthropic now says it expects to bring “Mythos-class models” to all customers in the coming weeks.

That reframes Opus 4.8. It is the best model most teams can buy today, but Anthropic is openly signaling that a more capable tier is about to land. The work on Mythos and Project Glasswing was the first sign that Anthropic was running a capability ceiling above its public models; the Opus 4.8 announcement is the first time the company has put a rough timeline on closing that gap. If Mythos-class capability ships broadly in June, the “generally available frontier” resets again — and the safeguards Opus 4.7 introduced against cyber misuse stop being precautionary and start being load-bearing.

What this means if you’re already on Opus

For teams running Opus 4.7 in production, the upgrade is a drop-in: same base price, better calibration, cheaper fast mode. The reasons to move are the honesty gains and the fast-mode repricing, not the benchmark deltas — a few points on SWE-bench Pro won’t change your architecture, but a model that flags its own uncertainty changes how much you can trust an unsupervised run.

The thing actually worth piloting is dynamic workflows, and only if you have a real codebase-scale task to point it at. A 1,000-subagent orchestration is not the tool for a feature branch; it’s the tool for the migration you’ve been deferring because no single agent could hold it. Test it on one bounded job, measure the cost and the correctness, and decide from there. The pattern that holds for any first AI proof of concept holds here too: prove it on one process before you rebuild your pipeline around it.

What ties the release together is a bet that the hard part of agentic AI is no longer raw capability but trust — whether you can hand the model a long job and walk away. The honesty gains, the subagent orchestration, and the cheaper fast mode all serve that bet more than they move the leaderboard. The complication is Mythos. Anthropic is telling customers a more capable tier lands within weeks, which means Opus 4.8’s real test isn’t GPT-5.5; it’s whether Anthropic can ship that capability jump without giving up the calibration it just spent a release earning.