GPT-5.3-Codex: OpenAI's Agentic Code Model

OpenAI released GPT-5.3-Codex today, an agentic coding model that combines the coding capabilities of GPT-5.2-Codex with the reasoning abilities of GPT-5.2 — and that was, according to OpenAI, partially used in its own creation. The model scores 77.3% on Terminal-Bench 2.0, the highest score recorded on that benchmark, and is designed for long-running, multi-step coding tasks with tool use and persistent context.

The release landed just minutes after Anthropic’s Claude Opus 4.6 announcement, making February 5 a landmark day for competitive AI releases.

Architecture and Capabilities

GPT-5.3-Codex isn’t a new foundation model — it’s a fusion of two existing ones. OpenAI combined GPT-5.2-Codex’s specialized coding abilities with GPT-5.2’s general reasoning, producing a model that can handle complex software engineering tasks that require both deep code understanding and broader analytical thinking.

The result is 25% faster than GPT-5.2-Codex while maintaining or exceeding its predecessor’s coding performance. More importantly, it’s designed for agentic workflows: long-running tasks where the model uses tools, maintains context across multiple steps, and makes autonomous decisions about how to approach a problem. These capabilities reflect the agentic AI architecture patterns — ReAct loops, persistent memory, and tool orchestration — that are becoming standard in frontier coding models.

OpenAI’s claim that GPT-5.3-Codex was partially used in its own creation is both a marketing statement and a technical milestone. It suggests the model is capable enough to contribute meaningfully to the kind of complex engineering work that goes into building frontier AI systems. The system card provides more detail on safety evaluations and capability assessments.

Benchmark Performance

Benchmark	Score	Context
Terminal-Bench 2.0	77.3%	Industry-leading terminal task completion
SWE-bench Verified	80.0%	Real-world software engineering
SWE-Bench Pro (Public)	56.8%	Advanced software engineering tasks
OSWorld-Verified	64.7%	Operating system interaction

The Terminal-Bench 2.0 score of 77.3% is a new high-water mark, suggesting particularly strong performance on the kind of command-line, file-manipulation, and system-administration tasks that make up real-world development work. The SWE-bench Verified score of 80.0% puts it in the same tier as Anthropic’s Opus 4.6 (80.8%), with the difference small enough to be within noise for most practical applications.

Availability and Pricing

GPT-5.3-Codex is available through:

ChatGPT for paid subscribers (Plus, Pro, Team, Enterprise)
GitHub Copilot — generally available as of February 9, 2026
API — rolling out to developers

Official API pricing hasn’t been announced at launch. For reference, the predecessor GPT-5.2-Codex was priced at $1.25 per million input tokens and$ 10.00 per million output tokens. Given the performance improvements, pricing is expected to be in a similar range.

The GitHub Copilot integration is particularly significant. It means GPT-5.3-Codex’s agentic coding capabilities will be accessible directly within development environments, lowering the barrier for developers who want AI-assisted development without building custom integrations.

Security Considerations

Fortune reported that OpenAI’s own evaluation flagged GPT-5.3-Codex as presenting heightened cybersecurity risks. A model capable of autonomously executing complex coding tasks across long sessions can, in theory, find and exploit vulnerabilities as effectively as it patches them.

OpenAI’s system card acknowledges these dual-use concerns and describes guardrails implemented to limit misuse. This is a tension that will only grow as coding models become more capable — the same abilities that make a model excellent at software engineering also make it potentially useful for offensive security. Organizations deploying these models should factor security review into their integration plans.

What the Same-Day Release Means

OpenAI and Anthropic releasing flagship models within minutes of each other isn’t coincidental. Both companies are competing for the same enterprise market, and both recognize that agentic coding is the current frontier. The fact that their top models now score within a percentage point of each other on SWE-bench Verified (80.0% vs 80.8%) signals that the performance ceiling for current approaches may be near.

For businesses, this convergence is actually good news. When multiple frontier models offer similar capabilities, competition drives down prices and improves service. The meaningful differentiation shifts from raw benchmark scores to factors like reliability, latency, tool ecosystem, and integration quality — the same factors that matter in any enterprise technology decision. Meanwhile, the open-source community is building on this convergence with projects like Moltbot, which aim to bring agentic AI capabilities to individual developers without vendor lock-in.

Key Takeaway

GPT-5.3-Codex represents the state of the art in agentic coding. Its strength is in sustained, multi-step tasks where the model needs to reason about code, use tools, and maintain context over long sessions. The GitHub Copilot integration makes it immediately accessible to millions of developers. Combined with the competitive pressure from Anthropic’s same-day release, this is a moment where the practical capability of AI coding tools takes a visible step forward.

Official announcement: OpenAI blog | System Card | TechCrunch coverage