Back to Insights
AI

Gemini 3.1 Pro Leads 12 of 18 Benchmarks

Gemini 3.1 Pro scores 77.1% on ARC-AGI-2, leads on 12 benchmarks, and doubles reasoning power at no price increase.

S5 Labs TeamFebruary 19, 2026

Google DeepMind released Gemini 3.1 Pro today, and the benchmark results are hard to ignore: 77.1% on ARC-AGI-2, 94.3% on GPQA Diamond, and leading scores on 12 of the 18 major benchmarks tracked by independent evaluators. It’s the broadest benchmark lead any single model has held in over a year.

The pricing is unchanged from Gemini 3 Pro — 2.00permillioninputtokensand2.00 per million input tokens and 12.00 per million output tokens — making this effectively a 2x+ reasoning upgrade at zero additional cost.

Benchmark Dominance

BenchmarkGemini 3.1 ProBest CompetitorContext
ARC-AGI-277.1%68.8% (Opus 4.6)Abstract reasoning
GPQA Diamond94.3%88.4% (Qwen 3.5)Graduate-level QA
LiveCodeBench Pro2887 EloCoding evaluation
BrowseComp85.9%Web browsing comprehension
SciCode59%Scientific coding
APEX-Agents33.5%Agent capability
MCP Atlas69.2%Multi-capability

The ARC-AGI-2 result is particularly striking. This benchmark tests abstract reasoning — the ability to identify novel patterns and generalize from few examples — and has historically been resistant to scaling improvements. (For a deeper look at how evaluations like ARC-AGI-2, SWE-bench, and GPQA Diamond actually work, see our guide on how AI benchmarks work.) Gemini 3.1 Pro’s 77.1% represents a 2.5x improvement over Gemini 3 Pro’s 31.1% and sits 12 percentage points above Claude Opus 4.6’s 68.8%. Gains of this magnitude on reasoning benchmarks typically require architectural innovations, not just more compute.

The GPQA Diamond score of 94.3% is approaching the ceiling of what’s meaningful on graduate-level question answering — there’s limited room for improvement before the benchmark loses discriminative power.

Three Thinking Levels

Gemini 3.1 Pro introduces configurable thinking modes: Low, Medium, and High. This is Google’s version of the compute-scaling approach that Anthropic calls “adaptive thinking” — letting users or applications control how much reasoning the model applies to a given task.

The difference is that Gemini 3.1 Pro makes this explicit and user-controllable rather than automatic. For production deployments, explicit control means you can route simple queries to Low thinking (faster, cheaper) and complex analysis to High thinking (slower, more thorough) without relying on the model’s own judgment about task difficulty.

The model supports 1 million token context inputs and up to 64K token outputs.

Pricing in Context

ModelInput (per 1M tokens)Output (per 1M tokens)ARC-AGI-2
Gemini 3.1 Pro$2.00$12.0077.1%
Claude Sonnet 4.6$3.00$15.00
Claude Opus 4.6$5.00$25.0068.8%
MiniMax M2.5$0.30$1.10

At 2.00/2.00/12.00, Gemini 3.1 Pro is cheaper than both Claude tiers while leading on more benchmarks. It’s not the cheapest option — MiniMax M2.5 and Qwen 3.5 undercut it significantly — but for organizations that want frontier performance from a major Western lab with enterprise support, the price-performance ratio is currently the best available.

Google keeping pricing flat while delivering a substantial capability upgrade is a competitive signal. Rather than extracting more revenue from improved performance, Google is using price stability to attract volume and ecosystem lock-in through Google Cloud, Vertex AI, and their developer tooling.

What Google Gets Right (and Where It Lags)

The benchmark breadth is Gemini 3.1 Pro’s standout quality. Leading on 12 of 18 tracked benchmarks means this isn’t a model that’s been optimized for a narrow set of evaluations — it’s broadly capable across reasoning, coding, science, browsing, and agent tasks.

Where Gemini 3.1 Pro’s position is less clear is in the agentic coding space specifically. Google hasn’t published SWE-bench Verified scores in its announcement, and the APEX-Agents score of 33.5% — while potentially leading — suggests the agent evaluation landscape is still immature. The models from Anthropic and OpenAI that score 80%+ on SWE-bench have a clearer narrative around software engineering automation.

The 64K output limit is also worth noting. While generous, it’s half of Opus 4.6’s 128K output, which matters for tasks that require generating large code changes or documents in a single pass.

February’s Competitive Picture

Gemini 3.1 Pro caps off a February that has seen an extraordinary density of frontier model releases:

  • January 27: Kimi K2.5 — 1T parameter open-weight model
  • February 5: Claude Opus 4.6 and GPT-5.3-Codex — same-day flagship releases
  • February 12: MiniMax M2.5 — frontier SWE-bench at 1/10th pricing
  • February 16: Qwen 3.5 — 201-language open-weight release
  • February 17: Claude Sonnet 4.6 — near-Opus performance at mid-tier pricing
  • February 19: Gemini 3.1 Pro — broadest benchmark lead in over a year

Seven frontier model releases in 24 days. The pace is unprecedented, and it has practical implications for anyone making technology decisions. Models that were frontier-leading two weeks ago are now trailing on key benchmarks. The AI disruption of the software industry isn’t slowing down — it’s accelerating.

For businesses, the takeaway is that betting on a single model provider is increasingly risky. The smartest approach is building model-agnostic architectures that can swap between providers as the performance landscape shifts — because in February 2026, it shifts weekly.

Official announcement: Google blog | DeepMind model card | Techzine analysis

Want to discuss this topic?

We'd love to hear about your specific challenges and how we might help.