GPT-5.5: OpenAI Reclaims Terminal Coding a Week After Opus 4.7

OpenAI released GPT-5.5 on April 23, one week to the day after Anthropic shipped Opus 4.7 and claimed it had narrowly retaken the frontier. That claim held for exactly seven days. GPT-5.5 — internal codename “Spud” — reclaims raw terminal coding by a wide margin and posts the strongest frontier-math scores any public model has shown. The picture is more mixed than OpenAI’s headline suggests, but on the two axes where it leads, it leads clearly.

It is available immediately across the API, ChatGPT Plus and Pro, Codex, and Copilot. API pricing is $5 per million input tokens and $30 per million output: input matches Opus 4.7, output runs $5 higher. The context window is 1M tokens, narrowing to 400K inside Codex. A GPT-5.5 Pro tier lists at $30/$180 for the hardest problems, and batch and flex modes cut the standard rate in half.

The benchmark picture

OpenAI’s reported scores against the two models that mattered on launch day:

Benchmark	GPT-5.5	Opus 4.7	Gemini 3.1 Pro
Terminal-Bench 2.0	82.7%	69.4%	~67%
SWE-bench Verified	88.7%	87.6%	80.6%
SWE-bench Pro	58.6%	64.3%	54.2%
FrontierMath (Tier 4)	35.4%	—	16.7%
GPQA Diamond	93.6%	—	—

The terminal-coding result is the decisive one. At 82.7% on Terminal-Bench 2.0, GPT-5.5 clears Opus 4.7’s 69.4% by thirteen points. The gap on raw terminal work that Opus 4.7 already conceded to GPT-5.4 didn’t close with this release — it widened. FrontierMath is the other standout: 35.4% on the hardest tier roughly doubles Gemini 3.1 Pro’s 16.7%, and the hardest math problems have been where reasoning models genuinely separate rather than trade rounding error.

The complication is SWE-bench, and OpenAI is unusually candid about it. GPT-5.5 edges Opus 4.7 on SWE-bench Verified (88.7% vs 87.6%, close enough that the eval harness matters more than the number), but trails it on the harder SWE-bench Pro. OpenAI also disclosed that GPT-5.5 may have seen parts of the public SWE-bench Pro set during training, which is a reason to discount that whole column on both sides. The takeaway is not that GPT-5.5 “won.” It is that on agentic terminal work and math it leads outright, and on broad software engineering the two models are inside the margin of error.

The context window is the upgrade that travels

The 1M-token context is the change most likely to alter how teams actually use the model. It puts GPT-5.5 in range of long-context agentic coding — feeding most of a mid-size repository into a single call instead of building retrieval scaffolding around a smaller window. The 400K ceiling inside Codex is lower, but it is still a meaningful step up for the coding surface that OpenAI has been pushing hardest since GPT-5.3-Codex, and it pairs naturally with the terminal-coding lead. A model that is both best-in-class on Terminal-Bench and able to hold a large codebase in context is aimed squarely at the long-horizon coding agent, which is where the GPT-5.4 line was already heading.

The price of the answer

The pricing is where the competitive math gets specific. GPT-5.5’s $30 output rate sits a fifth above Opus 4.7’s $25. For coding-heavy workloads where the terminal lead is real and repeated, that premium is easy to justify on output quality alone. For the SWE-bench-Pro-style work where Opus 4.7 is still ahead, it isn’t — you would be paying more for the model that scores lower on your task. The GPT-5.5 Pro tier at $30/$180 sits in a different bracket entirely; it is priced for problems where a single correct answer is worth dollars, not for production throughput.

Input parity at $5 is the quieter signal. OpenAI matched Anthropic on the cost of feeding the model rather than undercutting it, which says the two labs now agree on roughly what frontier inference costs. The competition has moved off price and onto which benchmark you weight.

What this cadence means for buyers

A week is not enough time to migrate a production stack, and that is the real story of April 2026: the frontier lead is changing hands faster than anyone can responsibly act on it. Opus 4.7 led for seven days. GPT-5.5 leads now, on some axes, until the next drop. The teams getting value out of this aren’t the ones chasing the weekly leaderboard — they’re the ones who picked the model that fits their dominant workload and built a harness loose enough to swap it when the numbers move.

If your work is terminal-heavy, GPT-5.5 is the obvious pick today. If it’s broad software engineering, the gap against Opus 4.7 is narrower than either company’s launch post admits, and the right answer is whichever one your own evals prefer. Run them now — the leaderboard will have moved again before you finish.

GPT-5.5: OpenAI Reclaims Terminal Coding a Week After Opus 4.7

The benchmark picture

The context window is the upgrade that travels

The price of the answer

What this cadence means for buyers

Sources

Want to discuss this topic?