Frontier AI Models 2026: GPT-5.3, Claude 4.6, G

The Most Packed Month in AI History

The frontier AI race has gone into overdrive. Eleven major providers are actively shipping frontier models — each one pushing the boundaries of what's possible with language models.

Here's the timeline:

Date	Provider	Model	Highlight
Dec 2	Mistral AI	Mistral Large 3	675B MoE, #2 open-source on LMArena
Jan 27	Moonshot AI	Kimi K2.5	1T open-source MoE with Agent Swarm
Feb 5	OpenAI	GPT-5.3 Codex	First "self-improving" agentic coding model
Feb 11	Zhipu AI	GLM-5	745B open-source model trained on Chinese chips
Feb 12	DeepSeek	V3.2 Update	Context window expanded 10x to 1M+ tokens
Feb 15	Moonshot AI	Kimi Claw	Browser-based agent platform powered by K2.5
Feb 17	Anthropic	Claude Sonnet 4.6	Near-Opus performance at 1/5th the price
Feb 17	xAI	Grok 4.2 RC	"Rapid learning" model that improves weekly
Feb 17	DeepSeek	V4 (expected)	1T-param model targeting coding dominance
Feb 19	Google	Gemini 3.1 Pro	2x reasoning jump, ARC-AGI-2 score of 77.1%
Feb 16	ByteDance	Seed 2.0 Pro	#6 LMSYS Text, #3 Vision, multimodal, ~10x cheaper
2026	MiniMax	M2.5	#1 Multi-SWE-Bench, 10B active params, $0.30/M

This isn't just incremental improvement. This is a fundamental shift in what AI models can do, how much they cost, and who's building them.

Let's break down each release.

OpenAI: GPT-5.3 Codex

Released: February 5, 2026

OpenAI's GPT-5.3 Codex represents a paradigm shift from "model that writes code" to "model that does nearly anything developers can do on a computer."

What's New

GPT-5.3 Codex combines the frontier coding performance of GPT-5.2-Codex with the reasoning and professional knowledge of GPT-5.2. The result is a model that can take on long-running tasks involving research, tool use, and complex multi-step execution.

Key improvements:

25% faster than GPT-5.2-Codex
Fewer tokens consumed per task — builds more with less
State-of-the-art on SWE-Bench Pro and Terminal-Bench
Strong results on OSWorld and GDPval

The Cybersecurity Flag

This is the first OpenAI model to hit "high" on their cybersecurity preparedness framework — meaning they believe GPT-5.3 Codex is capable enough at coding and reasoning to "meaningfully enable real-world cyber harm, especially if automated or used at scale." It's a milestone that underscores just how capable these models have become.

Availability

Available to paid ChatGPT users via Codex app, CLI, IDE extension, and web. A lighter GPT-5.3-Codex-Spark variant was also released. API access coming soon.

Pricing

Model	Input (per 1M)	Output (per 1M)	Cached Input
GPT-5	$1.25	$10.00	$0.625
GPT-5.3 Codex	TBA (API pending)	TBA	TBA
o3	$2.00	$8.00	—
o4-mini	$1.10	$4.40	$0.55

Anthropic: Claude Sonnet 4.6

Released: February 17, 2026

Claude Sonnet 4.6 is Anthropic's answer to a question nobody thought possible a year ago: can a mid-tier model match a flagship?

What's New

This isn't a minor version bump. Sonnet 4.6 is a full upgrade across coding, computer use, long-context reasoning, agent planning, knowledge work, and design. It ships with a 1M token context window (in beta).

Benchmark Highlights

Benchmark	Sonnet 4.6	Opus 4.6	Gap
SWE-bench Verified	79.6%	—	—
OSWorld (Computer Use)	72.5%	72.7%	0.2%
Office Productivity	1633 Elo	1559 Elo	Sonnet leads
Financial Analysis	63.3%	62.0%	Sonnet leads

The computer use number is remarkable: 72.5% on OSWorld-Verified, up from 14.9% when computer use first launched just 16 months earlier.

User Preference

Anthropic reports that 70% of users prefer Sonnet 4.6 over Sonnet 4.5, and 59% prefer it over the older Opus 4.5. At $3/$15 per million tokens — one-fifth of Opus 4.6's $15/$75 — this is the best value in frontier AI right now for enterprise workloads.

Claude Opus 4.6

The flagship Opus 4.6 remains the ceiling for Anthropic's capabilities, powering the most demanding agentic and reasoning tasks. But the gap with Sonnet is now razor-thin, making the mid-tier model the pragmatic choice for most applications.

Google: Gemini 3.1 Pro

Released: February 19, 2026

Google is framing Gemini 3.1 Pro not as a niche upgrade but as a sturdier default model for complex tasks.

What's New

The headline number: an ARC-AGI-2 score of 77.1% — more than double the reasoning performance of Gemini 3 Pro. This is specifically designed for tasks that require advanced multi-step reasoning, like synthesizing data across sources or explaining complex interdependent topics.

Availability

Rolling out across the full Google ecosystem:

Gemini app (higher limits for Pro and Ultra plan users)
NotebookLM (Pro and Ultra users)
Gemini API via AI Studio, Vertex AI, Gemini CLI, and Android Studio
Pricing unchanged from Gemini 3 Pro (~$1.25/$10 per million tokens standard)

Why It Matters

Google kept pricing flat while dramatically improving reasoning. For enterprises already on Google Cloud, 3.1 Pro slots in as a direct upgrade with zero budget impact.

DeepSeek: V4 & The 10x Context Expansion

V3.2 Update: February 12, 2026 V4 Expected: Mid-February 2026

DeepSeek continues to be the most disruptive force in AI pricing while pushing genuine frontier capabilities.

V3.2: 10x Context Expansion

In early February, DeepSeek expanded V3.2's context window from 128,000 tokens to over 1 million — a tenfold increase. At $0.27/$1.10 per million tokens, this is now the cheapest way to process massive documents with a frontier-class model.

V4: The Next Frontier

DeepSeek V4 is expected to launch with:

1 trillion parameters (MoE architecture)
1M+ token context native
Three architectural breakthroughs: Engram conditional memory, Manifold-Constrained Hyper-Connections, and DeepSeek Sparse Attention
Target: 80%+ on SWE-bench — which would put it at the very top of coding benchmarks
Expected to be open-weight under a permissive license

The Cost Story

The pricing gap between DeepSeek and Western providers remains staggering:

Task Cost Example	GPT-5	Claude Opus 4.6	DeepSeek V3.2
100K input + 10K output	$0.225	$2.25	$0.038
Ratio to DeepSeek	6x	59x	1x

A complex task costing $15 with GPT-5 costs approximately $0.50 with DeepSeek. This isn't just a cost advantage — it changes what's economically viable to automate.

Zhipu AI: GLM-5

Released: February 11, 2026

The biggest open-source model release of the month, and possibly the most geopolitically significant.

What's New

GLM-5 is a 745 billion parameter MoE model (44B active parameters) with five core capabilities: creative writing, code generation, multi-step reasoning, agentic intelligence, and long-context processing.

Benchmark Performance

Benchmark	GLM-5	Comparison
SWE-bench Verified	77.8%	Matches Claude Opus 4.5
AIME 2026	92.7%	—
GPQA-Diamond	86.0%	—
Humanity's Last Exam	50.4%	Beats Claude Opus 4.5
Hallucination Rate	34%	Down from 90% (GLM-4.7)

The hallucination reduction — from 90% to 34% using a novel RL technique called Slime — is particularly impressive, topping the Artificial Analysis Omniscience Index.

The Geopolitical Signal

GLM-5 was trained entirely on Huawei Ascend chips using the MindSpore framework — zero US-manufactured hardware. This demonstrates that China's domestic compute stack can produce frontier-quality models despite export controls.

Native Agent Mode

GLM-5 ships with a native "Agent Mode" that can transform prompts into professional office documents (.docx, .pdf, .xlsx) — directly competing with Anthropic's computer use and OpenAI's Codex on practical business tasks.

Following the launch, Zhipu's shares surged 34% on the Hong Kong Stock Exchange.

Moonshot AI: Kimi K2.5 & Kimi Claw

K2.5 Released: January 27, 2026 Kimi Claw: February 15, 2026

Moonshot AI is building the most complete open-source agentic ecosystem in the Chinese AI space.

Kimi K2.5

A 1 trillion parameter MoE model (32B active parameters) that understands text, images, and video. Key innovation: Agent Swarm capability, powered by a new RL technique called Parallel Agent Reinforcement Learning (PARL) that trains the model to decompose and parallelize complex tasks.

The model is fully open-source and available on Hugging Face.

Kimi Claw

Launched February 15, Kimi Claw is a cloud-native browser-based AI agent platform built on the OpenClaw framework. Think of it as Moonshot's answer to Anthropic's computer use — but running entirely in the cloud.

xAI: Grok 4.2 Release Candidate

Public Beta: February 17, 2026

Elon Musk's Grok 4.2 introduces a fundamentally different approach to model improvement: rapid learning.

What's New

Unlike every other model on this list, Grok 4.2 is designed to improve every week based on public usage. Musk described it as "able to learn rapidly" with weekly improvement cycles and release notes.

New capabilities:

4-agent parallel collaboration — specialized AI agents that synthesize outputs into a single response
Medical document analysis via photo upload
Improved engineering reasoning

Pricing

xAI maintains its aggressive pricing strategy:

Model	Input (per 1M)	Output (per 1M)
Grok 4.1	$0.20	$0.50
Grok 4.2 RC	TBA (beta)	TBA

Current Status

Grok 4.2 is currently in public beta — available to select in the Grok interface. The general public release is expected in March 2026. Official benchmarks will be published after the beta concludes.

Mistral AI: Large 3 & The Coding Stack

Mistral Large 3: December 2, 2025 Devstral 2: December 2025

Mistral continues to punch above its weight as Europe's frontier AI lab, shipping models that compete at the top of open-source leaderboards.

Mistral Large 3

A 675 billion parameter MoE model with 41B active parameters. It debuted at #2 in open-source non-reasoning models on the LMArena leaderboard — behind only the much larger models from Chinese labs.

Key models in Mistral's current lineup:

Model	Focus	Pricing (per 1M)
Mistral Large 3	General frontier	~$2.00 / $6.00
Mistral Medium 3.1	Multimodal (40k ctx)	$2.00 / $5.00
Magistral Medium 1.2	Reasoning	$2.00 / $5.00
Codestral	Code completion	Premier tier
Devstral 2	Agentic coding	Open-weight

Devstral Small 2

The standout from the December release: a 24B parameter coding model that beats Qwen 3 Coder Flash despite being significantly smaller. For teams that need self-hosted coding AI without massive GPU requirements, Devstral Small 2 is a compelling option.

Ministral 3

Mistral's small-model family (3B, 7B, 14B parameters) achieves the best cost-to-performance ratio of any open-source model — matching or exceeding comparable models while producing an order of magnitude fewer tokens.

MiniMax: M2.5

M2.5 Released: 2026

The dark horse of the frontier race. MiniMax's M2.5 delivers benchmark-topping coding performance with just 10 billion active parameters — a fraction of what competitors use.

What's New

MiniMax M2.5 is purpose-built for coding and agentic execution, with a focus on doing more with less:

#1 on Multi-SWE-Bench with a score of 51.3
Surpasses Claude Opus 4.6 on SWE-Bench Pro
Leading scores on FinSearch, BrowseComp, and RISE benchmarks
100 tokens per second throughput — described as "3x faster than Opus"
Chain of Thought reasoning up to 128K tokens

The Efficiency Story

The standout stat: MiniMax M2.5 completes 327.8 tasks per $100 budget — over 10x more than Opus. At $0.30 per million input tokens ($0.06 with cache), it's in DeepSeek territory for pricing while matching or exceeding premium models on coding tasks.

Model	Input (per 1M)	With Cache	Speed
M2.5	$0.30	$0.06	100 TPS
M2.5-highspeed	$0.30	$0.06	Faster variant

Open Weights

MiniMax has open-sourced M2.5 weights on HuggingFace, supporting vLLM, SGLang, and Transformers for self-hosting. This makes it one of the most cost-effective options for teams running their own inference infrastructure.

ByteDance: Seed 2.0

Released: February 2026

ByteDance enters the frontier race with Seed 2.0 — a full model family designed for production deployment and complex real-world task execution. Available in four sizes: Pro, Lite, Mini, and a dedicated Code variant.

What's New

Seed 2.0 Pro is a true multimodal model with industry-leading performance across visual understanding tasks — parsing documents, tables, graphs, and video content. It can process hour-long videos using a novel VideoCut tool for chunked analysis.

The reasoning capabilities are formidable: Seed 2.0 Pro achieves gold medals in ICPC, IMO, and CMO math competitions, and ByteDance claims it "can explore Erdos-level math problems."

Benchmark Performance

Benchmark	Seed 2.0 Pro	Ranking
LMSYS Text Arena	—	#6 overall
LMSYS Vision Arena	—	#3 overall
MathVista / MathVision	Top tier	Industry-leading
DUDE / MMLongBench	Top tier	Industry-best long-context
SuperGPQA / FrontierSci	Competitive	Near GPT-5.2 level

The Cost Story

ByteDance claims token pricing is "lowered by approximately an order of magnitude" compared to competitors. If accurate, this puts Seed 2.0 in DeepSeek territory for pricing while matching tier-1 model quality — a combination that could reshape enterprise AI economics.

Availability

Models are available through:

Volcano Engine (full API access for all Seed 2.0 variants)
Doubao App and TRAE (Pro and Code models)
Open-source variant: Seed-OSS-36B-Instruct (Apache-2.0, $0.16/$0.65 per million tokens via OpenRouter)

Why It Matters

ByteDance has the infrastructure (serving billions of TikTok recommendations daily) and the incentive to build frontier models at scale. Seed 2.0's multimodal + long-context + competitive pricing makes it a serious contender, especially for video-heavy and document-heavy enterprise workloads.

The Pricing Landscape

Here's how all frontier models stack up on cost (per million tokens):

Provider	Model	Input	Output	Context
xAI	Grok 4.1	$0.20	$0.50	—
DeepSeek	V3.2	$0.27	$1.10	1M+
MiniMax	M2.5	$0.30	—	128K
OpenAI	o4-mini	$1.10	$4.40	—
Google	Gemini 3.1 Pro	~$1.25	~$10.00	1M
OpenAI	GPT-5	$1.25	$10.00	400K
Mistral AI	Medium 3.1	$2.00	$5.00	40K
Mistral AI	Large 3	~$2.00	~$6.00	128K
OpenAI	o3	$2.00	$8.00	—
Anthropic	Sonnet 4.6	$3.00	$15.00	1M (beta)
Anthropic	Opus 4.6	$15.00	$75.00	200K
ByteDance	Seed-OSS-36B	$0.16	$0.65	131K
Zhipu AI	GLM-5	Open weights	Free to self-host	—
Moonshot AI	Kimi K2.5	Open weights	Free to self-host	—
DeepSeek	V4 (expected)	Open weights	Free to self-host	1M+

The 17x cost gap between the cheapest API (DeepSeek at $0.27/M) and premium models (Opus 4.6 at $15/M input) represents a real architectural decision for businesses. The question is no longer "can we afford AI?" but "which tier of AI fits our use case?"

Key Trends

1. The Open-Source Surge

Six recent releases — GLM-5, Kimi K2.5, DeepSeek V4, Mistral Large 3, MiniMax M2.5, and ByteDance Seed-OSS-36B — are open-weight models. They're not just catching up to closed-source; GLM-5 matches Claude Opus 4.5 on SWE-bench and beats it on Humanity's Last Exam. Mistral Large 3 sits at #2 on open-source LMArena. The quality gap between open and closed is essentially gone.

2. China's Independent AI Stack

GLM-5 (Huawei Ascend), DeepSeek V4, and ByteDance Seed 2.0 demonstrate that Chinese labs can produce frontier models at scale. Export controls have slowed but not stopped China's AI progress — and may have accelerated their investment in domestic alternatives. ByteDance's entry is particularly notable given their massive infrastructure advantage from serving TikTok at global scale.

3. The Agentic Everything

Every single release this month includes agentic capabilities: GPT-5.3 Codex does long-running multi-step tasks, Claude 4.6 has computer use at 72.5%, Grok 4.2 runs 4-agent parallel collaboration, GLM-5 has native Agent Mode, and Kimi has Agent Swarm. 2026 is the year models stopped being chatbots and started being workers.

4. The Mid-Tier Revolution

Claude Sonnet 4.6 proving that a $3/M model can match a $15/M flagship is a watershed moment. Combined with DeepSeek's $0.27/M pricing achieving ~90% of GPT-5 quality, the value proposition of premium API pricing is under serious pressure.

5. Context Window Convergence

Multiple models now offer 1M+ token context windows: Gemini 3.1 Pro, Claude 4.6 (beta), DeepSeek V4, and Kimi K2.5. Processing entire codebases, legal documents, or research corpora in a single pass is no longer a differentiator — it's table stakes.

What This Means for Business Users

If you're building AI into your business workflow in 2026, here's the practical takeaway:

For coding and development: GPT-5.3 Codex and Claude Sonnet 4.6 lead the pack. Codex for long-running agentic tasks, Sonnet for versatile coding + computer use.

For cost-sensitive workloads: DeepSeek V3.2 at $0.27/M tokens is unbeatable for high-volume tasks. Open-weight models (GLM-5, Kimi K2.5) are free to self-host if you have GPU infrastructure.

For enterprise reasoning: Gemini 3.1 Pro's 2x reasoning improvement makes it the default for Google Cloud shops. Claude Opus 4.6 remains the ceiling for complex analysis.

For rapid iteration: Grok 4.2's weekly improvement model is unique — if you need a model that gets better at your specific use cases over time, it's worth watching.

For multimodal workloads: ByteDance Seed 2.0 Pro's #3 ranking on Vision Arena and hour-long video processing make it ideal for document-heavy and video-heavy enterprise tasks.

For independence: Open-weight models (GLM-5, Kimi K2.5, DeepSeek V4, Seed-OSS-36B) give you full control over deployment, customization, and data privacy.

Last Updated

March 2, 2026 — This article is updated as new frontier models are released. Follow us for the latest coverage.

Previous updates: Added ByteDance Seed 2.0 (Mar 2, 2026) • Initial publication (Feb 20, 2026)

Frontier AI Models: Every Major Release (February–March 2026)

The Most Packed Month in AI History

OpenAI: GPT-5.3 Codex

What's New

The Cybersecurity Flag

Availability

Pricing

Anthropic: Claude Sonnet 4.6

What's New

Benchmark Highlights

User Preference

Claude Opus 4.6

Google: Gemini 3.1 Pro

What's New

Availability

Why It Matters

DeepSeek: V4 & The 10x Context Expansion

V3.2: 10x Context Expansion

V4: The Next Frontier

The Cost Story

Zhipu AI: GLM-5

What's New

Benchmark Performance

The Geopolitical Signal

Native Agent Mode

Moonshot AI: Kimi K2.5 & Kimi Claw

Kimi K2.5

Kimi Claw

xAI: Grok 4.2 Release Candidate

What's New

Pricing

Current Status

Mistral AI: Large 3 & The Coding Stack

Mistral Large 3

Devstral Small 2

Ministral 3

MiniMax: M2.5

What's New

The Efficiency Story

Open Weights

ByteDance: Seed 2.0

What's New

Benchmark Performance

The Cost Story

Availability

Why It Matters

The Pricing Landscape

Key Trends

1. The Open-Source Surge

2. China's Independent AI Stack

3. The Agentic Everything

4. The Mid-Tier Revolution

5. Context Window Convergence

What This Means for Business Users

Last Updated

Turn the best models into shipped work