Anthropic released Claude Opus 4.6 today, the latest flagship in its Claude model family. The headline changes are adaptive thinking — a replacement for extended thinking that lets the model dynamically decide how much reasoning a task requires — and a 1 million token context window now available in beta. On benchmarks, Opus 4.6 sets new marks for agentic coding tasks, scoring 80.8% on SWE-bench Verified and 72.7% on OSWorld.
The release comes at a moment when the AI industry is increasingly focused on models that don’t just answer questions but autonomously complete complex, multi-step tasks. Opus 4.6 is Anthropic’s most direct play for that market.
Adaptive Thinking
Previous Claude models offered “extended thinking” — a mode where the model could spend additional compute on harder problems. The limitation was that users had to decide when to enable it, and the model would think extensively even on simpler tasks.
Adaptive thinking removes that decision. Opus 4.6 automatically calibrates its reasoning depth based on task complexity. A straightforward factual question gets a quick response. A complex debugging session or multi-step analysis gets deeper reasoning. Anthropic reports this approach improves both response quality on hard tasks and latency on easy ones.
This matters for production deployments. When you’re routing thousands of diverse queries through a single model endpoint, you don’t want to pay extended-thinking latency on simple lookups. Adaptive thinking lets the model do the routing internally, which simplifies architecture and reduces unnecessary compute spend.
1M Token Context Window
The context window expands to 1 million tokens in beta, up from 200K in the standard tier. Inputs beyond 200K tokens are priced at a premium: 37.50 per million output tokens, compared to 25 for standard context.
A million tokens is roughly equivalent to several large codebases, hundreds of pages of documentation, or extensive conversation histories. For enterprises working with large document sets — legal discovery, financial analysis, codebase-wide refactoring — this removes a constraint that previously required chunking, summarization, or retrieval-augmented generation workarounds.
Maximum output length also increases to 128K tokens, which enables the model to generate substantial documents, detailed analyses, or large code changes in a single response.
Benchmark Performance
Opus 4.6 posts strong numbers across agentic and reasoning benchmarks:
| Benchmark | Score | Context |
|---|---|---|
| SWE-bench Verified | 80.8% | Real-world software engineering tasks |
| Terminal-Bench 2.0 | 65.4% | Terminal-based task completion |
| OSWorld-Verified | 72.7% | Operating system interaction tasks |
| ARC-AGI-2 | 68.8% | Abstract reasoning (83% improvement over Opus 4.5’s 37.6%) |
| GDPval-AA | 1606 Elo | General capability assessment |
The ARC-AGI-2 jump is particularly notable — an 83% improvement over its predecessor suggests meaningful architectural or training advances, not just incremental scaling. Understanding how AI benchmarks work — and what they actually measure — is critical context for interpreting these scores. For businesses that have been evaluating whether bigger models actually perform better, Opus 4.6 provides fresh evidence that capability gains are still accelerating.
Pricing and Availability
| Tier | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|
| Standard (≤200K context) | $5.00 | $25.00 |
| Long context (>200K) | $10.00 | $37.50 |
Opus 4.6 is available through the Anthropic API, AWS Bedrock, Google Vertex AI, and Microsoft Foundry. The model ID is claude-opus-4-6.
Competitive Landscape
Opus 4.6 arrives the same day as OpenAI’s GPT-5.3-Codex release, making February 5 one of the most competitive days in AI model launches. While GPT-5.3-Codex posts a higher Terminal-Bench score (77.3% vs 65.4%), Opus 4.6 leads on SWE-bench Verified (80.8% vs 80.0%) and OSWorld (72.7% vs 64.7%).
The practical takeaway is that the frontier is increasingly crowded. Enterprises choosing between top models now face genuine tradeoffs across different capability dimensions rather than a single clear winner. The right choice depends on your specific use case — a pattern that’s consistent with how AI adoption decisions should work in practice.
What This Means
Adaptive thinking is the most consequential change. It signals a shift from models that offer fixed reasoning modes to models that self-optimize their compute usage. For teams building AI-powered products, this simplifies integration — you send the query, and the model figures out how hard to think about it.
The 1M token context window, combined with 128K output, makes Opus 4.6 viable for workflows that previously required complex RAG pipelines or multi-step processing chains. There’s still a cost premium for long context, but the architectural simplification may be worth it for many use cases.
For organizations already invested in Claude, Opus 4.6 is a straightforward upgrade. For those evaluating options, the increasingly competitive landscape means the best strategy is testing against your specific workloads rather than relying on benchmark comparisons alone.
Official announcement: Anthropic blog | Bloomberg coverage
